2020-03-16 10:57:15 -07:00
|
|
|
# GPTC
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
General-purpose text classifier in Python
|
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
GPTC provides both a CLI tool and a Python library.
|
|
|
|
|
2022-07-17 16:27:16 -07:00
|
|
|
## Installation
|
|
|
|
|
|
|
|
pip install gptc[emoji] # handles emojis! (see section "Emoji")
|
|
|
|
# Or, if you don't need emoji support,
|
|
|
|
pip install gptc # no dependencies!
|
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
## CLI Tool
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
### Classifying text
|
|
|
|
|
2022-07-19 17:02:57 -07:00
|
|
|
gptc classify [-n <max_ngram_length>] <compiled model file>
|
2021-10-26 13:33:15 -07:00
|
|
|
|
2022-05-21 14:02:20 -07:00
|
|
|
This will prompt for a string and classify it, then print (in JSON) a dict of
|
2022-07-13 11:45:17 -07:00
|
|
|
the format `{category: probability, category:probability, ...}` to stdout. (For
|
|
|
|
information about `-n <max_ngram_length>`, see section "Ngrams.")
|
2021-11-03 06:38:22 -07:00
|
|
|
|
2022-05-21 14:02:20 -07:00
|
|
|
Alternatively, if you only need the most likely category, you can use this:
|
2021-11-03 06:38:22 -07:00
|
|
|
|
2022-07-19 17:02:57 -07:00
|
|
|
gptc classify [-n <max_ngram_length>] <-c|--category> <compiled model file>
|
2021-11-03 06:38:22 -07:00
|
|
|
|
2022-05-21 14:02:20 -07:00
|
|
|
This will prompt for a string and classify it, outputting the category on
|
|
|
|
stdout (or "None" if it cannot determine anything).
|
2021-11-03 06:38:22 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
### Compiling models
|
|
|
|
|
2022-07-19 17:02:57 -07:00
|
|
|
gptc compile [-n <max_ngram_length>] <raw model file>
|
2022-05-21 14:02:20 -07:00
|
|
|
|
|
|
|
This will print the compiled model in JSON to stdout.
|
2020-03-16 10:57:15 -07:00
|
|
|
|
|
|
|
## Library
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
### `gptc.Classifier(model, max_ngram_length=1)`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
Create a `Classifier` object using the given *compiled* model (as a dict, not
|
|
|
|
JSON).
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
For information about `max_ngram_length`, see section "Ngrams."
|
|
|
|
|
2021-11-03 06:38:22 -07:00
|
|
|
#### `Classifier.confidence(text)`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-11-03 06:38:22 -07:00
|
|
|
Classify `text`. Returns a dict of the format `{category: probability,
|
|
|
|
category:probability, ...}`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
#### `Classifier.classify(text)`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-11-03 06:38:22 -07:00
|
|
|
Classify `text`. Returns the category into which the text is placed (as a
|
2020-08-14 16:11:42 -07:00
|
|
|
string), or `None` when it cannot classify the text.
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
### `gptc.compile(raw_model, max_ngram_length=1)`
|
2021-10-26 13:33:15 -07:00
|
|
|
Compile a raw model (as a list, not JSON) and return the compiled model (as a
|
|
|
|
dict).
|
2020-03-16 10:57:15 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
For information about `max_ngram_length`, see section "Ngrams."
|
|
|
|
|
|
|
|
## Ngrams
|
|
|
|
|
|
|
|
GPTC optionally supports using ngrams to improve classification accuracy. They
|
|
|
|
are disabled by default (maximum length set to 1) for performance and
|
|
|
|
compatibility reasons. Enabling them significantly increases the time required
|
|
|
|
both for compilation and classification. The effect seems more significant for
|
|
|
|
compilation than for classification. Compiled models are also much larger when
|
|
|
|
ngrams are enabled. Larger maximum ngram lengths will result in slower
|
|
|
|
performance and larger files. It is a good idea to experiment with different
|
|
|
|
values and use the highest one at which GPTC is fast enough and models are
|
|
|
|
small enough for your needs.
|
|
|
|
|
|
|
|
Once a model is compiled at a certain maximum ngram length, it cannot be used
|
|
|
|
for classification with a higher value. If you instantiate a `Classifier` with
|
|
|
|
a model compiled with a lower `max_ngram_length`, the value will be silently
|
|
|
|
reduced to the one used when compiling the model.
|
|
|
|
|
|
|
|
Models compiled with older versions of GPTC which did not support ngrams are
|
|
|
|
handled the same way as models compiled with `max_ngram_length=1`.
|
|
|
|
|
2022-07-17 16:27:16 -07:00
|
|
|
## Emoji
|
|
|
|
|
|
|
|
If the [`emoji`](https://pypi.org/project/emoji/) package is installed, GPTC
|
|
|
|
will automatically handle emojis the same way as words. If it is not installed,
|
|
|
|
GPTC will still work but will ignore emojis.
|
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
## Model format
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
This section explains the raw model format, which is how you should create and
|
|
|
|
edit models.
|
|
|
|
|
|
|
|
Raw models are formatted as a list of dicts. See below for the format:
|
|
|
|
|
|
|
|
[
|
|
|
|
{
|
|
|
|
"text": "<text in the category>",
|
|
|
|
"category": "<the category>"
|
|
|
|
}
|
|
|
|
]
|
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
GPTC handles models as Python `list`s of `dict`s of `str`s (for raw models) or
|
|
|
|
`dict`s of `str`s and `float`s (for compiled models), and they can be stored
|
|
|
|
in any way these Python objects can be. However, it is recommended to store
|
|
|
|
them in JSON format for compatibility with the command-line tool.
|
2020-03-16 10:57:15 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
## Example model
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
An example model, which is designed to distinguish between texts written by
|
|
|
|
Mark Twain and those written by William Shakespeare, is available in `models`.
|
2020-08-14 16:24:16 -07:00
|
|
|
The raw model is in `models/raw.json`; the compiled model is in
|
|
|
|
`models/compiled.json`.
|
2022-07-05 16:29:32 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
The example model was compiled with `max_ngram_length=10`.
|
|
|
|
|
2022-07-05 16:29:32 -07:00
|
|
|
## Benchmark
|
|
|
|
|
|
|
|
A benchmark script is available for comparing performance of GPTC between
|
|
|
|
different Python versions. To use it, run `benchmark.py` with all of the Python
|
|
|
|
installations you want to test. It tests both compilation and classification.
|
|
|
|
It uses the default Twain/Shakespeare model for both, and for classification it
|
|
|
|
uses [Mark Antony's "Friends, Romans, countrymen"
|
|
|
|
speech](https://en.wikipedia.org/wiki/Friends,_Romans,_countrymen,_lend_me_your_ears)
|
|
|
|
from Shakespeare's *Julius Caesar*.
|