Classifier objects will be removed in 5.0

Move deserialize to Model object
Update README
2023-05-31 13:42:42 -07:00 · 2023-04-17 21:35:38 -07:00 · 2023-04-17 21:33:03 -07:00 · 2023-04-17 21:28:24 -07:00 · 2023-04-17 21:15:18 -07:00 · 2023-04-17 21:06:47 -07:00
13 changed files with 452 additions and 325 deletions
--- a/README.md
+++ b/README.md
@ -18,18 +18,19 @@ This will prompt for a string and classify it, then print (in JSON) a dict of
 the format `{category: probability, category:probability, ...}` to stdout. (For
 information about `-n <max_ngram_length>`, see section "Ngrams.")
-Alternatively, if you only need the most likely category, you can use this:
+### Checking individual words or ngrams
-    gptc classify [-n <max_ngram_length>] <-c|--category> <compiled model file>
+    gptc check <compiled model file> <token or ngram>
-This will prompt for a string and classify it, outputting the category on
+This is very similar to `gptc classify`, except it takes the input as an
-stdout (or "None" if it cannot determine anything).
+argument, and it treats the input as a single token or ngram.
 ### Compiling models
-    gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file>
+    gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file> <compiled model file>
-This will print the compiled model encoded in binary format to stdout.
+This will write the compiled model encoded in binary format to `<compiled model
 file>`.
 If `-c` is specified, words and ngrams used less than `min_count` times will be
 excluded from the compiled model.
@ -43,14 +44,15 @@ example of the format. Any exceptions will be printed to stderr.
 ## Library
-### `gptc.Classifier(model, max_ngram_length=1)`
+### `Model.serialize(file)`
-Create a `Classifier` object using the given compiled model (as a `gptc.Model`
+Write binary data representing the model to `file`.
 object, not as a serialized byte string).
-For information about `max_ngram_length`, see section "Ngrams."
+### `Model.deserialize(encoded_model)`
-#### `Classifier.confidence(text)`
+Deserialize a `Model` from a file containing data from `Model.serialize()`.
 ### `Model.confidence(text, max_ngram_length)`
 Classify `text`. Returns a dict of the format `{category: probability,
 category:probability, ...}`
@ -60,16 +62,15 @@ common words between the input and the training data (likely, for example, with
 input in a different language from the training data), an empty dict will be
 returned.
-#### `Classifier.classify(text)`
+For information about `max_ngram_length`, see section "Ngrams."
-Classify `text`. Returns the category into which the text is placed (as a
+### `Model.get(token)`
 string), or `None` when it cannot classify the text.
-#### `Classifier.model`
+Return a confidence dict for the given token or ngram. This function is very
 similar to `Model.confidence()`, except it treats the input as a single token
 or ngram.
-The classifier's model.
+### `Model.compile(raw_model, max_ngram_length=1, min_count=1, hash_algorithm="sha256")`
 ### `gptc.compile(raw_model, max_ngram_length=1, min_count=1)`
 Compile a raw model (as a list, not JSON) and return the compiled model (as a
 `gptc.Model` object).
@ -79,15 +80,27 @@ For information about `max_ngram_length`, see section "Ngrams."
 Words or ngrams used less than `min_count` times throughout the input text are
 excluded from the model.
-### `gptc.Model.serialize()`
+The hash algorithm should be left as the default, which may change with a minor
 version update, but it can be changed by the application if needed. It is
 stored in the model, so changing the algorithm does not affect compatibility.
 The following algorithms are supported:
-Returns a `bytes` representing the model.
+* `md5`
 * `sha1`
 * `sha224`
 * `sha256`
 * `sha384`
 * `sha512`
 * `sha3_224`
 * `sha3_384`
 * `sha3_256`
 * `sha3_512`
 * `shake_128`
 * `shake_256`
 * `blake2b`
 * `blake2s`
-### `gptc.deserialize(encoded_model)`
+### `gptc.pack(directory, print_exceptions=False)`
 Deserialize a `Model` from a `bytes` returned by `Model.serialize()`.
 ### `gptc.pack(directory, print_exceptions=False)
 Pack the model in `directory` and return a tuple of the format:
@ -99,6 +112,13 @@ GPTC.
 See `models/unpacked/` for an example of the format.
 ### `gptc.Classifier(model, max_ngram_length=1)`
 `Classifier` objects are deprecated starting with GPTC 3.1.0, and will be
 removed in 5.0.0. See [the README from
 3.0.2](https://git.kj7rrv.com/kj7rrv/gptc/src/tag/v3.0.1/README.md) if you need
 documentation.
 ## Ngrams
 GPTC optionally supports using ngrams to improve classification accuracy. They
@ -118,7 +138,8 @@ reduced to the one used when compiling the model.
 ## Model format
-This section explains the raw model format, which is how models are created and edited.
+This section explains the raw model format, which is how models are created and
 edited.
 Raw models are formatted as a list of dicts. See below for the format:
@ -129,9 +150,10 @@ Raw models are formatted as a list of dicts. See below for the format:
        }
    ]
-GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str, str]]`), and they can be stored
+GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str,
-in any way these Python objects can be. However, it is recommended to store
+str]]`), and they can be stored in any way these Python objects can be.
-them in JSON format for compatibility with the command-line tool.
+However, it is recommended to store them in JSON format for compatibility with
 the command-line tool.
 ## Emoji
--- a/benchmark.py
+++ b/benchmark.py
@ -25,7 +25,7 @@ print(
    round(
        1000000
        * timeit.timeit(
-            "gptc.compile(raw_model, max_ngram_length)",
+            "gptc.Model.compile(raw_model, max_ngram_length)",
            number=compile_iterations,
            globals=globals(),
        )
--- a/gptc/init.py
+++ b/gptc/init.py
@ -2,12 +2,11 @@
 """General-Purpose Text Classifier"""
-from gptc.compiler import compile as compile
+from gptc.pack import pack
-from gptc.classifier import Classifier as Classifier
+from gptc.model import Model
-from gptc.pack import pack as pack
+from gptc.tokenizer import normalize
 from gptc.model import Model as Model, deserialize as deserialize
 from gptc.exceptions import (
-    GPTCError as GPTCError,
+    GPTCError,
-    ModelError as ModelError,
+    ModelError,
-    InvalidModelError as InvalidModelError,
+    InvalidModelError,
 )
--- a/gptc/main.py
+++ b/gptc/main.py
@ -17,6 +17,9 @@ def main() -> None:
        "compile", help="compile a raw model"
    )
    compile_parser.add_argument("model", help="raw model to compile")
    compile_parser.add_argument(
        "out", help="name of file to write compiled model to"
    )
    compile_parser.add_argument(
        "--max-ngram-length",
        "-n",
@ -41,19 +44,12 @@ def main() -> None:
        type=int,
        default=1,
    )
-    group = classify_parser.add_mutually_exclusive_group()
+
-    group.add_argument(
+    check_parser = subparsers.add_parser(
-        "-j",
+        "check", help="check one word or ngram in model"
        "--json",
        help="output confidence dict as JSON (default)",
        action="store_true",
    )
    group.add_argument(
        "-c",
        "--category",
        help="output most likely category or `None`",
        action="store_true",
    )
    check_parser.add_argument("model", help="compiled model to use")
    check_parser.add_argument("token", help="token or ngram to check")
    pack_parser = subparsers.add_parser(
        "pack", help="pack a model from a directory"
@ -63,29 +59,27 @@ def main() -> None:
    args = parser.parse_args()
    if args.subparser_name == "compile":
-        with open(args.model, "r") as f:
+        with open(args.model, "r", encoding="utf-8") as input_file:
-            model = json.load(f)
+            model = json.load(input_file)
-        sys.stdout.buffer.write(
+        with open(args.out, "wb+") as output_file:
-            gptc.compile(
+            gptc.Model.compile(
                model, args.max_ngram_length, args.min_count
-            ).serialize()
+            ).serialize(output_file)
        )
    elif args.subparser_name == "classify":
-        with open(args.model, "rb") as f:
+        with open(args.model, "rb") as model_file:
-            model = gptc.deserialize(f.read())
+            model = gptc.Model.deserialize(model_file)
        classifier = gptc.Classifier(model, args.max_ngram_length)
        if sys.stdin.isatty():
            text = input("Text to analyse: ")
        else:
            text = sys.stdin.read()
-        if args.category:
+        print(json.dumps(model.confidence(text, args.max_ngram_length)))
-            print(classifier.classify(text))
+    elif args.subparser_name == "check":
-        else:
+        with open(args.model, "rb") as model_file:
-            print(json.dumps(classifier.confidence(text)))
+            model = gptc.Model.deserialize(model_file)
        print(json.dumps(model.get(args.token)))
    else:
        print(json.dumps(gptc.pack(args.model, True)[0]))
--- a/gptc/classifier.py
+++ b/gptc/classifier.py
@ -1,91 +0,0 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import gptc.tokenizer, gptc.compiler, gptc.exceptions, gptc.weighting
 import warnings
 from typing import Dict, Union, cast, List
 class Classifier:
    """A text classifier.
    Parameters
    ----------
    model : dict
        A compiled GPTC model.
    max_ngram_length : int
        The maximum ngram length to use when tokenizing input. If this is
        greater than the value used when the model was compiled, it will be
        silently lowered to that value.
    Attributes
    ----------
    model : dict
        The model used.
    """
    def __init__(self, model: gptc.model.Model, max_ngram_length: int = 1):
        self.model = model
        model_ngrams = model.max_ngram_length
        self.max_ngram_length = min(max_ngram_length, model_ngrams)
    def confidence(self, text: str) -> Dict[str, float]:
        """Classify text with confidence.
        Parameters
        ----------
        text : str
            The text to classify
        Returns
        -------
        dict
            {category:probability, category:probability...} or {} if no words
            matching any categories in the model were found
        """
        model = self.model.weights
        tokens = gptc.tokenizer.tokenize(text, self.max_ngram_length)
        numbered_probs: Dict[int, float] = {}
        for word in tokens:
            try:
                weighted_numbers = gptc.weighting.weight(
                    [i / 65535 for i in cast(List[float], model[word])]
                )
                for category, value in enumerate(weighted_numbers):
                    try:
                        numbered_probs[category] += value
                    except KeyError:
                        numbered_probs[category] = value
            except KeyError:
                pass
        total = sum(numbered_probs.values())
        probs: Dict[str, float] = {
            self.model.names[category]: value / total
            for category, value in numbered_probs.items()
        }
        return probs
    def classify(self, text: str) -> Union[str, None]:
        """Classify text.
        Parameters
        ----------
        text : str
            The text to classify
        Returns
        -------
        str or None
            The most likely category, or None if no words matching any
            category in the model were found.
        """
        probs: Dict[str, float] = self.confidence(text)
        try:
            return sorted(probs.items(), key=lambda x: x[1])[-1][0]
        except IndexError:
            return None
--- a/gptc/compiler.py
+++ b/gptc/compiler.py
@ -1,86 +0,0 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import gptc.tokenizer
 import gptc.model
 from typing import Iterable, Mapping, List, Dict, Union
 def compile(
    raw_model: Iterable[Mapping[str, str]],
    max_ngram_length: int = 1,
    min_count: int = 1,
 ) -> gptc.model.Model:
    """Compile a raw model.
    Parameters
    ----------
    raw_model : list of dict
        A raw GPTC model.
    max_ngram_length : int
        Maximum ngram lenght to compile with.
    Returns
    -------
    dict
        A compiled GPTC model.
    """
    categories: Dict[str, List[int]] = {}
    for portion in raw_model:
        text = gptc.tokenizer.tokenize(portion["text"], max_ngram_length)
        category = portion["category"]
        try:
            categories[category] += text
        except KeyError:
            categories[category] = text
    word_counts: Dict[int, Dict[str, int]] = {}
    names = []
    for category, text in categories.items():
        if not category in names:
            names.append(category)
        for word in text:
            try:
                counts_for_word = word_counts[word]
            except KeyError:
                counts_for_word = {}
                word_counts[word] = counts_for_word
            try:
                word_counts[word][category] += 1
            except KeyError:
                word_counts[word][category] = 1
    word_counts = {
        word: counts
        for word, counts in word_counts.items()
        if sum(counts.values()) >= min_count
    }
    word_weights: Dict[int, Dict[str, float]] = {}
    for word, values in word_counts.items():
        for category, value in values.items():
            try:
                word_weights[word][category] = value / len(categories[category])
            except KeyError:
                word_weights[word] = {
                    category: value / len(categories[category])
                }
    model: Dict[int, List[int]] = {}
    for word, weights in word_weights.items():
        total = sum(weights.values())
        new_weights: List[int] = []
        for category in names:
            new_weights.append(
                round((weights.get(category, 0) / total) * 65535)
            )
        model[word] = new_weights
    return gptc.model.Model(model, names, max_ngram_length)
--- a/gptc/model.py
+++ b/gptc/model.py
@ -1,9 +1,120 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 from typing import (
        Iterable,
        Mapping,
    List,
    Dict,
    cast,
    BinaryIO,
    Tuple,
    TypedDict,
 )
 import json
 import gptc.tokenizer
 from gptc.exceptions import InvalidModelError
-from typing import Iterable, Mapping, List, Dict, Union
+import gptc.weighting
-import json
+
 def _count_words(
    raw_model: Iterable[Mapping[str, str]],
    max_ngram_length: int,
    hash_algorithm: str,
 ) -> Tuple[Dict[int, Dict[str, int]], Dict[str, int], List[str]]:
    word_counts: Dict[int, Dict[str, int]] = {}
    category_lengths: Dict[str, int] = {}
    names: List[str] = []
    for portion in raw_model:
        text = gptc.tokenizer.hash_list(
            gptc.tokenizer.tokenize(portion["text"], max_ngram_length),
            hash_algorithm,
        )
        category = portion["category"]
        if not category in names:
            names.append(category)
        category_lengths[category] = category_lengths.get(category, 0) + len(
            text
        )
        for word in text:
            if word in word_counts:
                try:
                    word_counts[word][category] += 1
                except KeyError:
                    word_counts[word][category] = 1
            else:
                word_counts[word] = {category: 1}
    return word_counts, category_lengths, names
 def _get_weights(
    min_count: int,
    word_counts: Dict[int, Dict[str, int]],
    category_lengths: Dict[str, int],
    names: List[str],
 ) -> Dict[int, List[int]]:
    model: Dict[int, List[int]] = {}
    for word, counts in word_counts.items():
        if sum(counts.values()) >= min_count:
            weights = {
                category: value / category_lengths[category]
                for category, value in counts.items()
            }
            total = sum(weights.values())
            new_weights: List[int] = []
            for category in names:
                new_weights.append(
                    round((weights.get(category, 0) / total) * 65535)
                )
            model[word] = new_weights
    return model
 class ExplanationEntry(TypedDict):
    weight: float
    probabilities: Dict[str, float]
    count: int
 Explanation = Dict[
    str,
    ExplanationEntry,
 ]
 Log = List[Tuple[str, float, List[float]]]
 class Confidences(dict[str, float]):
    def __init__(self, probs: Dict[str, float]):
        dict.__init__(self, probs)
 class TransparentConfidences(Confidences):
    def __init__(
        self,
        probs: Dict[str, float],
        explanation: Explanation,
    ):
        self.explanation = explanation
        Confidences.__init__(self, probs)
 def convert_log(log: Log, names: List[str]) -> Explanation:
    explanation: Explanation = {}
    for word2, weight, word_probs in log:
        if word2 in explanation:
            explanation[word2]["count"] += 1
        else:
            explanation[word2] = {
                "weight": weight,
                "probabilities": {
                    name: word_probs[index] for index, name in enumerate(names)
                },
                "count": 1,
            }
    return explanation
 class Model:
@ -12,80 +123,200 @@ class Model:
        weights: Dict[int, List[int]],
        names: List[str],
        max_ngram_length: int,
        hash_algorithm: str,
    ):
        self.weights = weights
        self.names = names
        self.max_ngram_length = max_ngram_length
        self.hash_algorithm = hash_algorithm
-    def serialize(self) -> bytes:
+    def confidence(
-        out = b"GPTC model v4\n"
+        self, text: str, max_ngram_length: int, transparent: bool = False
-        out += (
+    ) -> Confidences:
        """Classify text with confidence.
        Parameters
        ----------
        text : str
            The text to classify
        max_ngram_length : int
            The maximum ngram length to use in classifying
        Returns
        -------
        dict
            {category:probability, category:probability...} or {} if no words
            matching any categories in the model were found
        """
        model = self.weights
        max_ngram_length = min(self.max_ngram_length, max_ngram_length)
        raw_tokens = gptc.tokenizer.tokenize(
            text, min(max_ngram_length, self.max_ngram_length)
        )
        tokens = gptc.tokenizer.hash_list(
            raw_tokens,
            self.hash_algorithm,
        )
        if transparent:
            token_map = {tokens[i]: raw_tokens[i] for i in range(len(tokens))}
            log: Log = []
        numbered_probs: Dict[int, float] = {}
        for word in tokens:
            try:
                unweighted_numbers = [
                    i / 65535 for i in cast(List[float], model[word])
                ]
                weight, weighted_numbers = gptc.weighting.weight(
                    unweighted_numbers
                )
                if transparent:
                    log.append(
                        (
                            token_map[word],
                            weight,
                            unweighted_numbers,
                        )
                    )
                for category, value in enumerate(weighted_numbers):
                    try:
                        numbered_probs[category] += value
                    except KeyError:
                        numbered_probs[category] = value
            except KeyError:
                pass
        total = sum(numbered_probs.values())
        probs: Dict[str, float] = {
            self.names[category]: value / total
            for category, value in numbered_probs.items()
        }
        if transparent:
            explanation = convert_log(log, self.names)
            return TransparentConfidences(probs, explanation)
        return Confidences(probs)
    def get(self, token: str) -> Dict[str, float]:
        try:
            weights = self.weights[
                gptc.tokenizer.hash_single(
                    gptc.tokenizer.normalize(token), self.hash_algorithm
                )
            ]
        except KeyError:
            return {}
        return {
            category: weights[index] / 65535
            for index, category in enumerate(self.names)
        }
    def serialize(self, file: BinaryIO) -> None:
        file.write(b"GPTC model v6\n")
        file.write(
            json.dumps(
                {
                    "names": self.names,
                    "max_ngram_length": self.max_ngram_length,
-                    "has_emoji": True,
+                    "hash_algorithm": self.hash_algorithm,
                    # Due to an oversight in development, version 3.0.0 still
                    # had the code used to make emoji support optional, even
                    # though the `emoji` library was made a hard dependency.
                    # Part of this code checked whether or not the model
                    # supports emoji; deserialization would not work in 3.0.0
                    # if the model was compiled without this field. Emoji are
                    # always supported with 3.0.0 and newer when GPTC has been
                    # installed correctly, so this value should always be True.
                    # Related: #11
                }
            ).encode("utf-8")
            + b"\n"
        )
        for word, weights in self.weights.items():
-            out += word.to_bytes(6, "big") + b"".join(
+            file.write(
-                [weight.to_bytes(2, "big") for weight in weights]
+                word.to_bytes(6, "big")
                + b"".join([weight.to_bytes(2, "big") for weight in weights])
            )
        return out
    @staticmethod
    def compile(
        raw_model: Iterable[Mapping[str, str]],
        max_ngram_length: int = 1,
        min_count: int = 1,
        hash_algorithm: str = "sha256",
        ) -> 'Model':
        """Compile a raw model.
-def deserialize(encoded_model: bytes) -> Model:
+        Parameters
-    try:
+        ----------
-        prefix, config_json, encoded_weights = encoded_model.split(b"\n", 2)
+        raw_model : list of dict
-    except ValueError:
+            A raw GPTC model.
        raise InvalidModelError()
-    if prefix != b"GPTC model v4":
+        max_ngram_length : int
-        raise InvalidModelError()
+            Maximum ngram lenght to compile with.
-    try:
+        Returns
-        config = json.loads(config_json.decode("utf-8"))
+        -------
-    except (UnicodeDecodeError, json.JSONDecodeError):
+        dict
-        raise InvalidModelError()
+            A compiled GPTC model.
-    try:
+        """
-        names = config["names"]
+        word_counts, category_lengths, names = _count_words(
-        max_ngram_length = config["max_ngram_length"]
+            raw_model, max_ngram_length, hash_algorithm
-    except KeyError:
+        )
-        raise InvalidModelError()
+        model = _get_weights(min_count, word_counts, category_lengths, names)
        return Model(model, names, max_ngram_length, hash_algorithm)
-    if not (
+    @staticmethod
-        isinstance(names, list) and isinstance(max_ngram_length, int)
+    def deserialize(encoded_model: BinaryIO) -> "Model":
-    ) or not all([isinstance(name, str) for name in names]):
+        prefix = encoded_model.read(14)
-        raise InvalidModelError()
+        if prefix != b"GPTC model v6\n":
            raise InvalidModelError()
-    weight_code_length = 6 + 2 * len(names)
+        config_json = b""
        while True:
            byte = encoded_model.read(1)
            if byte == b"\n":
                break
-    if len(encoded_weights) % weight_code_length != 0:
+            if byte == b"":
-        raise InvalidModelError()
+                raise InvalidModelError()
-    weight_codes = [
+            config_json += byte
        encoded_weights[x : x + weight_code_length]
        for x in range(0, len(encoded_weights), weight_code_length)
    ]
-    weights = {
+        try:
-        int.from_bytes(code[:6], "big"): [
+            config = json.loads(config_json.decode("utf-8"))
-            int.from_bytes(value, "big")
+        except (UnicodeDecodeError, json.JSONDecodeError) as exc:
-            for value in [code[x : x + 2] for x in range(6, len(code), 2)]
+            raise InvalidModelError() from exc
        ]
        for code in weight_codes
    }
-    return Model(weights, names, max_ngram_length)
+        try:
            names = config["names"]
            max_ngram_length = config["max_ngram_length"]
            hash_algorithm = config["hash_algorithm"]
        except KeyError as exc:
            raise InvalidModelError() from exc
        if not (
            isinstance(names, list) and isinstance(max_ngram_length, int)
        ) or not all(isinstance(name, str) for name in names):
            raise InvalidModelError()
        weight_code_length = 6 + 2 * len(names)
        weights: Dict[int, List[int]] = {}
        while True:
            code = encoded_model.read(weight_code_length)
            if not code:
                break
            if len(code) != weight_code_length:
                raise InvalidModelError()
            weights[int.from_bytes(code[:6], "big")] = [
                int.from_bytes(value, "big")
                for value in [code[x : x + 2] for x in range(6, len(code), 2)]
            ]
        return Model(weights, names, max_ngram_length, hash_algorithm)
--- a/gptc/pack.py
+++ b/gptc/pack.py
@ -7,7 +7,7 @@ from typing import List, Dict, Tuple
 def pack(
    directory: str, print_exceptions: bool = False
-) -> Tuple[List[Dict[str, str]], List[Tuple[Exception]]]:
+) -> Tuple[List[Dict[str, str]], List[Tuple[OSError]]]:
    paths = os.listdir(directory)
    texts: Dict[str, List[str]] = {}
    exceptions = []
@ -17,16 +17,18 @@ def pack(
        try:
            for file in os.listdir(os.path.join(directory, path)):
                try:
-                    with open(os.path.join(directory, path, file)) as f:
+                    with open(
-                        texts[path].append(f.read())
+                        os.path.join(directory, path, file), encoding="utf-8"
-                except Exception as e:
+                    ) as input_file:
-                    exceptions.append((e,))
+                        texts[path].append(input_file.read())
                except OSError as error:
                    exceptions.append((error,))
                    if print_exceptions:
-                        print(e, file=sys.stderr)
+                        print(error, file=sys.stderr)
-        except Exception as e:
+        except OSError as error:
-            exceptions.append((e,))
+            exceptions.append((error,))
            if print_exceptions:
-                print(e, file=sys.stderr)
+                print(error, file=sys.stderr)
    raw_model = []
--- a/gptc/tokenizer.py
+++ b/gptc/tokenizer.py
@ -1,13 +1,13 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
-from typing import List, Union
+import unicodedata
 from typing import List, cast
 import hashlib
 import base64
 import emoji
-def tokenize(text: str, max_ngram_length: int = 1) -> List[int]:
+def tokenize(text: str, max_ngram_length: int = 1) -> List[str]:
-    text = text.lower()
+    text = unicodedata.normalize("NFKD", text).casefold()
    parts = []
    highest_end = 0
    for emoji_part in emoji.emoji_list(text):
@ -20,7 +20,12 @@ def tokenize(text: str, max_ngram_length: int = 1) -> List[int]:
    tokens = [""]
    for char in converted_text:
-        if char.isalpha() or char == "'":
+        if (
            char.isalpha()
            or char.isnumeric()
            or char == "'"
            or (char in ",." and (" " + tokens[-1])[-1].isnumeric())
        ):
            tokens[-1] += char
        elif emoji.is_emoji(char):
            tokens.append(char)
@ -31,16 +36,51 @@ def tokenize(text: str, max_ngram_length: int = 1) -> List[int]:
    tokens = [string for string in tokens if string]
    if max_ngram_length == 1:
-        ngrams = tokens
+        return tokens
    else:
        ngrams = []
        for ngram_length in range(1, max_ngram_length + 1):
            for index in range(len(tokens) + 1 - ngram_length):
                ngrams.append(" ".join(tokens[index : index + ngram_length]))
-    return [
+    ngrams = []
-        int.from_bytes(
+    for ngram_length in range(1, max_ngram_length + 1):
-            hashlib.sha256(token.encode("utf-8")).digest()[:6], "big"
+        for index in range(len(tokens) + 1 - ngram_length):
-        )
+            ngrams.append(" ".join(tokens[index : index + ngram_length]))
-        for token in ngrams
+    return ngrams
-    ]
+
 def _hash_single(token: str, hash_function: type) -> int:
    return int.from_bytes(
        hash_function(token.encode("utf-8")).digest()[:6], "big"
    )
 def _get_hash_function(hash_algorithm: str) -> type:
    if hash_algorithm in {
        "sha224",
        "md5",
        "sha512",
        "sha3_256",
        "blake2s",
        "sha3_224",
        "sha1",
        "sha256",
        "sha384",
        "shake_256",
        "blake2b",
        "sha3_512",
        "shake_128",
        "sha3_384",
    }:
        return cast(type, getattr(hashlib, hash_algorithm))
    raise ValueError("not a valid hash function: " + hash_algorithm)
 def hash_single(token: str, hash_algorithm: str) -> int:
    return _hash_single(token, _get_hash_function(hash_algorithm))
 def hash_list(tokens: List[str], hash_algorithm: str) -> List[int]:
    hash_function = _get_hash_function(hash_algorithm)
    return [_hash_single(token, hash_function) for token in tokens]
 def normalize(text: str) -> str:
    return " ".join(tokenize(text, 1))
--- a/gptc/weighting.py
+++ b/gptc/weighting.py
@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import math
-from typing import Sequence, Union, Tuple, List
+from typing import Sequence, Tuple, List
 def _mean(numbers: Sequence[float]) -> float:
@ -39,8 +39,8 @@ def _standard_deviation(numbers: Sequence[float]) -> float:
    return math.sqrt(_mean(squared_deviations))
-def weight(numbers: Sequence[float]) -> List[float]:
+def weight(numbers: Sequence[float]) -> Tuple[float, List[float]]:
    standard_deviation = _standard_deviation(numbers)
-    weight = standard_deviation * 2
+    weight_assigned = standard_deviation * 2
-    weighted_numbers = [i * weight for i in numbers]
+    weighted_numbers = [i * weight_assigned for i in numbers]
-    return weighted_numbers
+    return weight_assigned, weighted_numbers
--- a/models/compiled.gptc
+++ b/models/compiled.gptc
--- a/profiler.py
+++ b/profiler.py
@ -0,0 +1,16 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import cProfile
 import gptc
 import json
 import sys
 max_ngram_length = 10
 with open("models/raw.json") as f:
    raw_model = json.load(f)
 with open("models/benchmark_text.txt") as f:
    text = f.read()
 cProfile.run("gptc.Model.compile(raw_model, max_ngram_length)")
--- a/pyproject.toml
+++ b/pyproject.toml
@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "gptc"
-version = "3.0.1"
+version = "4.0.1"
 description = "General-purpose text classifier"
 readme = "README.md"
 authors = [{ name = "Samuel Sloniker", email = "sam@kj7rrv.com"}]
Author	SHA1	Message	Date
Samuel Sloniker	71e9249ff4	Classifier objects will be removed in 5.0	2023-05-31 13:42:42 -07:00
Samuel Sloniker	97c4eef086	Move deserialize to Model object	2023-04-17 21:35:38 -07:00
Samuel Sloniker	457b569741	Update README	2023-04-17 21:33:03 -07:00
Samuel Sloniker	4546c4cffa	Fix profiler and benchmark	2023-04-17 21:28:24 -07:00
Samuel Sloniker	7b7ef39d0b	Merge compiler into model.py	2023-04-17 21:15:18 -07:00
Samuel Sloniker	a252a15e9d	Clean up code	2023-04-17 21:06:47 -07:00
Samuel Sloniker	9513025e60	Fix type annotations	2023-04-17 18:16:20 -07:00
Samuel Sloniker	2c3fc77ba6	Finish classification explanations A couple things I missed in `7f68dc6fc6`	2023-04-16 15:48:19 -07:00
Samuel Sloniker	d8f3d2e701	Bump model version `99ad07a876` broke the model format, although probably only in a few edge cases Still enough of a change for a model version bump	2023-04-16 15:36:49 -07:00
Samuel Sloniker	7f68dc6fc6	Add classification explanations Closes #17	2023-04-16 15:35:53 -07:00
Samuel Sloniker	99ad07a876	Casefold Closes #14	2023-04-16 14:49:03 -07:00
Samuel Sloniker	f38f4ca801	Add profiler	2023-04-16 14:27:31 -07:00
Samuel Sloniker	56550ca457	Remove Classifier objects Closes #16	2023-04-16 14:27:07 -07:00
Samuel Sloniker	75fdb5ba3c	Split compiler into two functions	2023-01-15 09:39:35 -08:00
Samuel Sloniker	071656c2d2	Bump version to 4.0.1	2022-12-24 12:49:12 -08:00
Samuel Sloniker	aad590636a	Fix type annotations	2022-12-24 12:48:43 -08:00
Samuel Sloniker	099e810a18	Fix `check`	2022-12-24 12:44:09 -08:00
Samuel Sloniker	822aa7d1fd	Bump version to 4.0.0	2022-12-24 12:18:51 -08:00
Samuel Sloniker	8417c8acda	Recompile model	2022-12-24 12:18:25 -08:00
Samuel Sloniker	ec7f4116fc	Include file name of output in arguments	2022-12-24 12:17:44 -08:00
Samuel Sloniker	f8dbc78b82	Allow hash algorithm selection Closes #9	2022-12-24 11:18:05 -08:00
Samuel Sloniker	6f21e0d4e9	Remove debug print lines from compiler	2022-12-24 10:48:09 -08:00
Samuel Sloniker	41bba61410	Remove `has_emoji` and bump model version Closes #11	2022-12-24 10:47:23 -08:00
Samuel Sloniker	10668691ea	Normalize characters Closes #3	2022-12-24 10:46:40 -08:00
Samuel Sloniker	295a1189de	Include numbers in tokenized output Closes #12	2022-12-24 10:42:50 -08:00
Samuel Sloniker	74b2ba81b9	Deserialize from file	2022-12-23 10:49:24 -08:00
Samuel Sloniker	9916744801	New type annotation for serialize	2022-12-23 10:33:56 -08:00
Samuel Sloniker	7e7b5f3e9c	Performance improvements	2022-12-22 18:01:37 -08:00
Samuel Sloniker	a76c6d3da8	Bump version to 3.1.1	2022-11-27 15:01:06 -08:00
Samuel Sloniker	c84758af56	list, not tuple	2022-11-27 15:00:37 -08:00
Samuel Sloniker	3a9c8d2bf2	Revert "Bump version to 3.1.1" This reverts commit `12f97ae765`.	2022-11-27 14:56:10 -08:00
Samuel Sloniker	12f97ae765	Bump version to 3.1.1	2022-11-27 14:54:11 -08:00
Samuel Sloniker	c754293d69	Compiler performance improvements	2022-11-27 14:32:44 -08:00
Samuel Sloniker	8d42a92848	Add type annotation to Model.get()	2022-11-27 13:36:49 -08:00
Samuel Sloniker	e4eb322aa7	Bump version to 3.1.0	2022-11-26 18:37:11 -08:00
Samuel Sloniker	83ef71e8ce	Remove doc for `gptc classify --category`	2022-11-26 18:36:41 -08:00
Samuel Sloniker	991d3fd54a	Revert "Bump version to 3.1.0" This reverts commit `b3e6a13e65`.	2022-11-26 18:36:18 -08:00
Samuel Sloniker	b3e6a13e65	Bump version to 3.1.0	2022-11-26 18:34:04 -08:00
Samuel Sloniker	b1228edd9c	Add CLI for Model.get()	2022-11-26 18:28:44 -08:00
Samuel Sloniker	25192ffddf	Add ability to look up individual token Closes #10	2022-11-26 18:17:02 -08:00
Samuel Sloniker	548d670960	Use Classifier for --category	2022-11-26 17:50:26 -08:00
Samuel Sloniker	b3a43150d8	Split hash function	2022-11-26 17:42:42 -08:00
Samuel Sloniker	08437a2696	Add normalize()	2022-11-26 17:17:28 -08:00
Samuel Sloniker	fc4665bb9e	Separate tokenization and hashing	2022-11-26 17:04:56 -08:00
Samuel Sloniker	30287288f2	Fix README issues	2022-11-26 16:45:30 -08:00
Samuel Sloniker	448f200923	Add `confidence` to Model; deprecate Classifier	2022-11-26 16:41:29 -08:00