2020-03-16 10:57:15 -07:00
|
|
|
# GPTC
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
General-purpose text classifier in Python
|
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
GPTC provides both a CLI tool and a Python library.
|
|
|
|
|
2022-07-17 16:27:16 -07:00
|
|
|
## Installation
|
|
|
|
|
2022-11-23 17:01:04 -08:00
|
|
|
pip install gptc
|
2022-07-17 16:27:16 -07:00
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
## CLI Tool
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
### Classifying text
|
|
|
|
|
2022-07-19 17:02:57 -07:00
|
|
|
gptc classify [-n <max_ngram_length>] <compiled model file>
|
2021-10-26 13:33:15 -07:00
|
|
|
|
2022-05-21 14:02:20 -07:00
|
|
|
This will prompt for a string and classify it, then print (in JSON) a dict of
|
2022-07-13 11:45:17 -07:00
|
|
|
the format `{category: probability, category:probability, ...}` to stdout. (For
|
|
|
|
information about `-n <max_ngram_length>`, see section "Ngrams.")
|
2021-11-03 06:38:22 -07:00
|
|
|
|
2022-11-26 18:26:52 -08:00
|
|
|
### Checking individual words or ngrams
|
|
|
|
|
|
|
|
gptc check <compiled model file> <token or ngram>
|
|
|
|
|
|
|
|
This is very similar to `gptc classify`, except it takes the input as an
|
|
|
|
argument, and it treats the input as a single token or ngram.
|
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
### Compiling models
|
|
|
|
|
2022-12-24 12:17:44 -08:00
|
|
|
gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file> <compiled model file>
|
2022-05-21 14:02:20 -07:00
|
|
|
|
2022-12-24 12:17:44 -08:00
|
|
|
This will write the compiled model encoded in binary format to `<compiled model
|
|
|
|
file>`.
|
2020-03-16 10:57:15 -07:00
|
|
|
|
2022-11-23 11:42:58 -08:00
|
|
|
If `-c` is specified, words and ngrams used less than `min_count` times will be
|
|
|
|
excluded from the compiled model.
|
|
|
|
|
2022-07-19 19:15:59 -07:00
|
|
|
### Packing models
|
|
|
|
|
|
|
|
gptc pack <dir>
|
|
|
|
|
|
|
|
This will print the raw model in JSON to stdout. See `models/unpacked/` for an
|
|
|
|
example of the format. Any exceptions will be printed to stderr.
|
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
## Library
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-12-22 18:01:37 -08:00
|
|
|
### `Model.serialize(file)`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-12-22 18:01:37 -08:00
|
|
|
Write binary data representing the model to `file`.
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2023-04-17 21:33:03 -07:00
|
|
|
### `Model.deserialize(encoded_model)`
|
2022-11-26 16:41:29 -08:00
|
|
|
|
2022-12-23 10:49:24 -08:00
|
|
|
Deserialize a `Model` from a file containing data from `Model.serialize()`.
|
2022-07-13 11:45:17 -07:00
|
|
|
|
2022-11-26 16:41:29 -08:00
|
|
|
### `Model.confidence(text, max_ngram_length)`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-11-03 06:38:22 -07:00
|
|
|
Classify `text`. Returns a dict of the format `{category: probability,
|
|
|
|
category:probability, ...}`
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-11-23 17:01:04 -08:00
|
|
|
Note that this may not include values for all categories. If there are no
|
|
|
|
common words between the input and the training data (likely, for example, with
|
|
|
|
input in a different language from the training data), an empty dict will be
|
|
|
|
returned.
|
|
|
|
|
2022-11-26 16:41:29 -08:00
|
|
|
For information about `max_ngram_length`, see section "Ngrams."
|
2022-07-19 19:15:59 -07:00
|
|
|
|
2022-11-26 18:17:02 -08:00
|
|
|
### `Model.get(token)`
|
|
|
|
|
|
|
|
Return a confidence dict for the given token or ngram. This function is very
|
|
|
|
similar to `Model.confidence()`, except it treats the input as a single token
|
|
|
|
or ngram.
|
|
|
|
|
2023-04-17 21:33:03 -07:00
|
|
|
### `Model.compile(raw_model, max_ngram_length=1, min_count=1, hash_algorithm="sha256")`
|
2022-07-19 19:15:59 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
Compile a raw model (as a list, not JSON) and return the compiled model (as a
|
2022-11-23 17:01:04 -08:00
|
|
|
`gptc.Model` object).
|
2020-03-16 10:57:15 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
For information about `max_ngram_length`, see section "Ngrams."
|
|
|
|
|
2022-11-23 11:42:58 -08:00
|
|
|
Words or ngrams used less than `min_count` times throughout the input text are
|
|
|
|
excluded from the model.
|
|
|
|
|
2022-12-24 11:18:05 -08:00
|
|
|
The hash algorithm should be left as the default, which may change with a minor
|
|
|
|
version update, but it can be changed by the application if needed. It is
|
|
|
|
stored in the model, so changing the algorithm does not affect compatibility.
|
|
|
|
The following algorithms are supported:
|
|
|
|
|
|
|
|
* `md5`
|
|
|
|
* `sha1`
|
|
|
|
* `sha224`
|
|
|
|
* `sha256`
|
|
|
|
* `sha384`
|
|
|
|
* `sha512`
|
|
|
|
* `sha3_224`
|
|
|
|
* `sha3_384`
|
|
|
|
* `sha3_256`
|
|
|
|
* `sha3_512`
|
|
|
|
* `shake_128`
|
|
|
|
* `shake_256`
|
|
|
|
* `blake2b`
|
|
|
|
* `blake2s`
|
|
|
|
|
2022-11-26 16:45:30 -08:00
|
|
|
### `gptc.pack(directory, print_exceptions=False)`
|
2022-07-19 19:15:59 -07:00
|
|
|
|
|
|
|
Pack the model in `directory` and return a tuple of the format:
|
|
|
|
|
|
|
|
(raw_model, [(exception,),(exception,)...])
|
|
|
|
|
|
|
|
Note that the exceptions are contained in single-item tuples. This is to allow
|
|
|
|
more information to be provided without breaking the API in future versions of
|
|
|
|
GPTC.
|
|
|
|
|
|
|
|
See `models/unpacked/` for an example of the format.
|
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
## Ngrams
|
|
|
|
|
|
|
|
GPTC optionally supports using ngrams to improve classification accuracy. They
|
2022-11-23 17:01:04 -08:00
|
|
|
are disabled by default (maximum length set to 1) for performance reasons.
|
|
|
|
Enabling them significantly increases the time required both for compilation
|
|
|
|
and classification. The effect seems more significant for compilation than for
|
|
|
|
classification. Compiled models are also much larger when ngrams are enabled.
|
|
|
|
Larger maximum ngram lengths will result in slower performance and larger
|
|
|
|
files. It is a good idea to experiment with different values and use the
|
|
|
|
highest one at which GPTC is fast enough and models are small enough for your
|
|
|
|
needs.
|
2022-07-13 11:45:17 -07:00
|
|
|
|
|
|
|
Once a model is compiled at a certain maximum ngram length, it cannot be used
|
|
|
|
for classification with a higher value. If you instantiate a `Classifier` with
|
|
|
|
a model compiled with a lower `max_ngram_length`, the value will be silently
|
|
|
|
reduced to the one used when compiling the model.
|
|
|
|
|
2020-03-16 10:57:15 -07:00
|
|
|
## Model format
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2022-11-26 16:45:30 -08:00
|
|
|
This section explains the raw model format, which is how models are created and
|
|
|
|
edited.
|
2020-03-16 10:57:15 -07:00
|
|
|
|
|
|
|
Raw models are formatted as a list of dicts. See below for the format:
|
|
|
|
|
|
|
|
[
|
|
|
|
{
|
|
|
|
"text": "<text in the category>",
|
|
|
|
"category": "<the category>"
|
|
|
|
}
|
|
|
|
]
|
|
|
|
|
2022-11-26 16:45:30 -08:00
|
|
|
GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str,
|
|
|
|
str]]`), and they can be stored in any way these Python objects can be.
|
|
|
|
However, it is recommended to store them in JSON format for compatibility with
|
|
|
|
the command-line tool.
|
2020-03-16 10:57:15 -07:00
|
|
|
|
2022-11-23 17:01:04 -08:00
|
|
|
## Emoji
|
|
|
|
|
|
|
|
GPTC treats individual emoji as words.
|
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
## Example model
|
2022-05-20 17:22:37 -07:00
|
|
|
|
2021-10-26 13:33:15 -07:00
|
|
|
An example model, which is designed to distinguish between texts written by
|
|
|
|
Mark Twain and those written by William Shakespeare, is available in `models`.
|
2020-08-14 16:24:16 -07:00
|
|
|
The raw model is in `models/raw.json`; the compiled model is in
|
|
|
|
`models/compiled.json`.
|
2022-07-05 16:29:32 -07:00
|
|
|
|
2022-07-13 11:45:17 -07:00
|
|
|
The example model was compiled with `max_ngram_length=10`.
|
|
|
|
|
2022-07-05 16:29:32 -07:00
|
|
|
## Benchmark
|
|
|
|
|
|
|
|
A benchmark script is available for comparing performance of GPTC between
|
|
|
|
different Python versions. To use it, run `benchmark.py` with all of the Python
|
|
|
|
installations you want to test. It tests both compilation and classification.
|
|
|
|
It uses the default Twain/Shakespeare model for both, and for classification it
|
|
|
|
uses [Mark Antony's "Friends, Romans, countrymen"
|
|
|
|
speech](https://en.wikipedia.org/wiki/Friends,_Romans,_countrymen,_lend_me_your_ears)
|
|
|
|
from Shakespeare's *Julius Caesar*.
|