gptc/README.md

# GPTC

General-purpose text classifier in Python

GPTC provides both a CLI tool and a Python library.

## Installation

    pip install gptc

## CLI Tool

### Classifying text

    gptc classify [-n <max_ngram_length>] <compiled model file>

This will prompt for a string and classify it, then print (in JSON) a dict of
the format `{category: probability, category:probability, ...}` to stdout. (For
information about `-n <max_ngram_length>`, see section "Ngrams.")

### Checking individual words or ngrams

    gptc check <compiled model file> <token or ngram>

This is very similar to `gptc classify`, except it takes the input as an
argument, and it treats the input as a single token or ngram.

### Compiling models

    gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file>

This will print the compiled model encoded in binary format to stdout.

If `-c` is specified, words and ngrams used less than `min_count` times will be
excluded from the compiled model.

### Packing models

    gptc pack <dir>

This will print the raw model in JSON to stdout. See `models/unpacked/` for an
example of the format. Any exceptions will be printed to stderr.

## Library

### `Model.serialize(file)`

Write binary data representing the model to `file`.

### `gptc.deserialize(encoded_model)`

Deserialize a `Model` from a file containing data from `Model.serialize()`.

### `Model.confidence(text, max_ngram_length)`

Classify `text`. Returns a dict of the format `{category: probability,
category:probability, ...}`

Note that this may not include values for all categories. If there are no
common words between the input and the training data (likely, for example, with
input in a different language from the training data), an empty dict will be
returned.

For information about `max_ngram_length`, see section "Ngrams."

### `Model.get(token)`

Return a confidence dict for the given token or ngram. This function is very
similar to `Model.confidence()`, except it treats the input as a single token
or ngram.

### `gptc.compile(raw_model, max_ngram_length=1, min_count=1, hash_algorithm="sha256")`

Compile a raw model (as a list, not JSON) and return the compiled model (as a
`gptc.Model` object).

For information about `max_ngram_length`, see section "Ngrams."

Words or ngrams used less than `min_count` times throughout the input text are
excluded from the model.

The hash algorithm should be left as the default, which may change with a minor
version update, but it can be changed by the application if needed. It is
stored in the model, so changing the algorithm does not affect compatibility.
The following algorithms are supported:

* `md5`
* `sha1`
* `sha224`
* `sha256`
* `sha384`
* `sha512`
* `sha3_224`
* `sha3_384`
* `sha3_256`
* `sha3_512`
* `shake_128`
* `shake_256`
* `blake2b`
* `blake2s`

### `gptc.pack(directory, print_exceptions=False)`

Pack the model in `directory` and return a tuple of the format:

    (raw_model, [(exception,),(exception,)...])

Note that the exceptions are contained in single-item tuples. This is to allow
more information to be provided without breaking the API in future versions of
GPTC.

See `models/unpacked/` for an example of the format.

### `gptc.Classifier(model, max_ngram_length=1)`

`Classifier` objects are deprecated starting with GPTC 3.1.0, and will be
removed in 4.0.0. See [the README from
3.0.2](https://git.kj7rrv.com/kj7rrv/gptc/src/tag/v3.0.1/README.md) if you need
documentation.

## Ngrams

GPTC optionally supports using ngrams to improve classification accuracy. They
are disabled by default (maximum length set to 1) for performance reasons.
Enabling them significantly increases the time required both for compilation
and classification. The effect seems more significant for compilation than for
classification. Compiled models are also much larger when ngrams are enabled.
Larger maximum ngram lengths will result in slower performance and larger
files. It is a good idea to experiment with different values and use the
highest one at which GPTC is fast enough and models are small enough for your
needs.

Once a model is compiled at a certain maximum ngram length, it cannot be used
for classification with a higher value. If you instantiate a `Classifier` with
a model compiled with a lower `max_ngram_length`, the value will be silently
reduced to the one used when compiling the model.

## Model format

This section explains the raw model format, which is how models are created and
edited.

Raw models are formatted as a list of dicts. See below for the format:

    [
        {
            "text": "<text in the category>",
            "category": "<the category>"
        }
    ]

GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str,
str]]`), and they can be stored in any way these Python objects can be.
However, it is recommended to store them in JSON format for compatibility with
the command-line tool.

## Emoji

GPTC treats individual emoji as words.

## Example model

An example model, which is designed to distinguish between texts written by
Mark Twain and those written by William Shakespeare, is available in `models`.
The raw model is in `models/raw.json`; the compiled model is in
`models/compiled.json`.

The example model was compiled with `max_ngram_length=10`.

## Benchmark

A benchmark script is available for comparing performance of GPTC between
different Python versions. To use it, run `benchmark.py` with all of the Python
installations you want to test. It tests both compilation and classification.
It uses the default Twain/Shakespeare model for both, and for classification it
uses [Mark Antony's "Friends, Romans, countrymen"
speech](https://en.wikipedia.org/wiki/Friends,_Romans,_countrymen,_lend_me_your_ears)
from Shakespeare's *Julius Caesar*.
Setup 2020-03-16 10:57:15 -07:00			`# GPTC`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Setup 2020-03-16 10:57:15 -07:00			`General-purpose text classifier in Python`

Reorganize code and improve README 2021-10-26 13:33:15 -07:00			`GPTC provides both a CLI tool and a Python library.`

Document emojis 2022-07-17 16:27:16 -07:00			`## Installation`

New model format Use Model objects and binary serialization format 2022-11-23 17:01:04 -08:00			`pip install gptc`
Document emojis 2022-07-17 16:27:16 -07:00
Setup 2020-03-16 10:57:15 -07:00			`## CLI Tool`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Reorganize code and improve README 2021-10-26 13:33:15 -07:00			`### Classifying text`

remove `python -m` 2022-07-19 17:02:57 -07:00			`gptc classify [-n <max_ngram_length>] <compiled model file>`
Reorganize code and improve README 2021-10-26 13:33:15 -07:00
New CLI tool 2022-05-21 14:02:20 -07:00			`This will prompt for a string and classify it, then print (in JSON) a dict of`
Add ngrams First git commit from new laptop! 2022-07-13 11:45:17 -07:00			the format `{category: probability, category:probability, ...}` to stdout. (For
			information about `-n <max_ngram_length>`, see section "Ngrams.")
Add Classifier.confidence() 2021-11-03 06:38:22 -07:00
Add CLI for Model.get() 2022-11-26 18:26:52 -08:00			`### Checking individual words or ngrams`

			`gptc check <compiled model file> <token or ngram>`

			This is very similar to `gptc classify`, except it takes the input as an
			`argument, and it treats the input as a single token or ngram.`

Reorganize code and improve README 2021-10-26 13:33:15 -07:00			`### Compiling models`

Add min_count 2022-11-23 11:42:58 -08:00			`gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file>`
New CLI tool 2022-05-21 14:02:20 -07:00
New model format Use Model objects and binary serialization format 2022-11-23 17:01:04 -08:00			`This will print the compiled model encoded in binary format to stdout.`
Setup 2020-03-16 10:57:15 -07:00
Add min_count 2022-11-23 11:42:58 -08:00			If `-c` is specified, words and ngrams used less than `min_count` times will be
			`excluded from the compiled model.`

Add emoji checks, improve docs 2022-07-19 19:15:59 -07:00			`### Packing models`

			`gptc pack <dir>`

			This will print the raw model in JSON to stdout. See `models/unpacked/` for an
			`example of the format. Any exceptions will be printed to stderr.`

Setup 2020-03-16 10:57:15 -07:00			`## Library`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Performance improvements 2022-12-22 18:01:37 -08:00			### `Model.serialize(file)`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Performance improvements 2022-12-22 18:01:37 -08:00			Write binary data representing the model to `file`.
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Add `confidence` to Model; deprecate Classifier 2022-11-26 16:41:29 -08:00			### `gptc.deserialize(encoded_model)`

Deserialize from file 2022-12-23 10:49:24 -08:00			Deserialize a `Model` from a file containing data from `Model.serialize()`.
Add ngrams First git commit from new laptop! 2022-07-13 11:45:17 -07:00
Add `confidence` to Model; deprecate Classifier 2022-11-26 16:41:29 -08:00			### `Model.confidence(text, max_ngram_length)`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Add Classifier.confidence() 2021-11-03 06:38:22 -07:00			Classify `text`. Returns a dict of the format `{category: probability,
			category:probability, ...}`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
New model format Use Model objects and binary serialization format 2022-11-23 17:01:04 -08:00			`Note that this may not include values for all categories. If there are no`
			`common words between the input and the training data (likely, for example, with`
			`input in a different language from the training data), an empty dict will be`
			`returned.`

Add `confidence` to Model; deprecate Classifier 2022-11-26 16:41:29 -08:00			For information about `max_ngram_length`, see section "Ngrams."
Add emoji checks, improve docs 2022-07-19 19:15:59 -07:00
Add ability to look up individual token Closes #10 2022-11-26 18:17:02 -08:00			### `Model.get(token)`

			`Return a confidence dict for the given token or ngram. This function is very`
			similar to `Model.confidence()`, except it treats the input as a single token
			`or ngram.`

Allow hash algorithm selection Closes #9 2022-12-24 11:18:05 -08:00			### `gptc.compile(raw_model, max_ngram_length=1, min_count=1, hash_algorithm="sha256")`
Add emoji checks, improve docs 2022-07-19 19:15:59 -07:00
Reorganize code and improve README 2021-10-26 13:33:15 -07:00			`Compile a raw model (as a list, not JSON) and return the compiled model (as a`
New model format Use Model objects and binary serialization format 2022-11-23 17:01:04 -08:00			`gptc.Model` object).
Setup 2020-03-16 10:57:15 -07:00
Add ngrams First git commit from new laptop! 2022-07-13 11:45:17 -07:00			For information about `max_ngram_length`, see section "Ngrams."

Add min_count 2022-11-23 11:42:58 -08:00			Words or ngrams used less than `min_count` times throughout the input text are
			`excluded from the model.`

Allow hash algorithm selection Closes #9 2022-12-24 11:18:05 -08:00			`The hash algorithm should be left as the default, which may change with a minor`
			`version update, but it can be changed by the application if needed. It is`
			`stored in the model, so changing the algorithm does not affect compatibility.`
			`The following algorithms are supported:`

			* `md5`
			* `sha1`
			* `sha224`
			* `sha256`
			* `sha384`
			* `sha512`
			* `sha3_224`
			* `sha3_384`
			* `sha3_256`
			* `sha3_512`
			* `shake_128`
			* `shake_256`
			* `blake2b`
			* `blake2s`

Fix README issues 2022-11-26 16:45:30 -08:00			### `gptc.pack(directory, print_exceptions=False)`
Add emoji checks, improve docs 2022-07-19 19:15:59 -07:00
			Pack the model in `directory` and return a tuple of the format:

			`(raw_model, [(exception,),(exception,)...])`

			`Note that the exceptions are contained in single-item tuples. This is to allow`
			`more information to be provided without breaking the API in future versions of`
			`GPTC.`

			See `models/unpacked/` for an example of the format.

Add `confidence` to Model; deprecate Classifier 2022-11-26 16:41:29 -08:00			### `gptc.Classifier(model, max_ngram_length=1)`

			`Classifier` objects are deprecated starting with GPTC 3.1.0, and will be
			`removed in 4.0.0. See [the README from`
			`3.0.2](https://git.kj7rrv.com/kj7rrv/gptc/src/tag/v3.0.1/README.md) if you need`
			`documentation.`

Add ngrams First git commit from new laptop! 2022-07-13 11:45:17 -07:00			`## Ngrams`

			`GPTC optionally supports using ngrams to improve classification accuracy. They`
New model format Use Model objects and binary serialization format 2022-11-23 17:01:04 -08:00			`are disabled by default (maximum length set to 1) for performance reasons.`
			`Enabling them significantly increases the time required both for compilation`
			`and classification. The effect seems more significant for compilation than for`
			`classification. Compiled models are also much larger when ngrams are enabled.`
			`Larger maximum ngram lengths will result in slower performance and larger`
			`files. It is a good idea to experiment with different values and use the`
			`highest one at which GPTC is fast enough and models are small enough for your`
			`needs.`
Add ngrams First git commit from new laptop! 2022-07-13 11:45:17 -07:00
			`Once a model is compiled at a certain maximum ngram length, it cannot be used`
			for classification with a higher value. If you instantiate a `Classifier` with
			a model compiled with a lower `max_ngram_length`, the value will be silently
			`reduced to the one used when compiling the model.`

Setup 2020-03-16 10:57:15 -07:00			`## Model format`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Fix README issues 2022-11-26 16:45:30 -08:00			`This section explains the raw model format, which is how models are created and`
			`edited.`
Setup 2020-03-16 10:57:15 -07:00
			`Raw models are formatted as a list of dicts. See below for the format:`

			`[`
			`{`
			`"text": "<text in the category>",`
			`"category": "<the category>"`
			`}`
			`]`

Fix README issues 2022-11-26 16:45:30 -08:00			GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str,
			str]]`), and they can be stored in any way these Python objects can be.
			`However, it is recommended to store them in JSON format for compatibility with`
			`the command-line tool.`
Setup 2020-03-16 10:57:15 -07:00
New model format Use Model objects and binary serialization format 2022-11-23 17:01:04 -08:00			`## Emoji`

			`GPTC treats individual emoji as words.`

Reorganize code and improve README 2021-10-26 13:33:15 -07:00			`## Example model`
Add blank lines before and after headings in README 2022-05-20 17:22:37 -07:00
Reorganize code and improve README 2021-10-26 13:33:15 -07:00			`An example model, which is designed to distinguish between texts written by`
			Mark Twain and those written by William Shakespeare, is available in `models`.
Move models to subdir 2020-08-14 16:24:16 -07:00			The raw model is in `models/raw.json`; the compiled model is in
			`models/compiled.json`.
Add benchmark script 2022-07-05 16:29:32 -07:00
Add ngrams First git commit from new laptop! 2022-07-13 11:45:17 -07:00			The example model was compiled with `max_ngram_length=10`.

Add benchmark script 2022-07-05 16:29:32 -07:00			`## Benchmark`

			`A benchmark script is available for comparing performance of GPTC between`
			different Python versions. To use it, run `benchmark.py` with all of the Python
			`installations you want to test. It tests both compilation and classification.`
			`It uses the default Twain/Shakespeare model for both, and for classification it`
			`uses [Mark Antony's "Friends, Romans, countrymen"`
			`speech](https://en.wikipedia.org/wiki/Friends,_Romans,_countrymen,_lend_me_your_ears)`
			`from Shakespeare's Julius Caesar.`