13 changed files with 348 additions and 461 deletions
--- a/README.md
+++ b/README.md
@ -18,19 +18,18 @@ This will prompt for a string and classify it, then print (in JSON) a dict of
 the format `{category: probability, category:probability, ...}` to stdout. (For
 information about `-n <max_ngram_length>`, see section "Ngrams.")
-### Checking individual words or ngrams
+Alternatively, if you only need the most likely category, you can use this:
-    gptc check <compiled model file> <token or ngram>
+    gptc classify [-n <max_ngram_length>] <-c|--category> <compiled model file>
-This is very similar to `gptc classify`, except it takes the input as an
+This will prompt for a string and classify it, outputting the category on
-argument, and it treats the input as a single token or ngram.
+stdout (or "None" if it cannot determine anything).
 ### Compiling models
-    gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file> <compiled model file>
+    gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file>
-This will write the compiled model encoded in binary format to `<compiled model
+This will print the compiled model encoded in binary format to stdout.
 file>`.
 If `-c` is specified, words and ngrams used less than `min_count` times will be
 excluded from the compiled model.
@ -44,15 +43,14 @@ example of the format. Any exceptions will be printed to stderr.
 ## Library
-### `Model.serialize(file)`
+### `gptc.Classifier(model, max_ngram_length=1)`
-Write binary data representing the model to `file`.
+Create a `Classifier` object using the given compiled model (as a `gptc.Model`
 object, not as a serialized byte string).
-### `Model.deserialize(encoded_model)`
+For information about `max_ngram_length`, see section "Ngrams."
-Deserialize a `Model` from a file containing data from `Model.serialize()`.
+#### `Classifier.confidence(text)`
 ### `Model.confidence(text, max_ngram_length)`
 Classify `text`. Returns a dict of the format `{category: probability,
 category:probability, ...}`
@ -62,15 +60,16 @@ common words between the input and the training data (likely, for example, with
 input in a different language from the training data), an empty dict will be
 returned.
-For information about `max_ngram_length`, see section "Ngrams."
+#### `Classifier.classify(text)`
-### `Model.get(token)`
+Classify `text`. Returns the category into which the text is placed (as a
 string), or `None` when it cannot classify the text.
-Return a confidence dict for the given token or ngram. This function is very
+#### `Classifier.model`
 similar to `Model.confidence()`, except it treats the input as a single token
 or ngram.
-### `Model.compile(raw_model, max_ngram_length=1, min_count=1, hash_algorithm="sha256")`
+The classifier's model.
 ### `gptc.compile(raw_model, max_ngram_length=1, min_count=1)`
 Compile a raw model (as a list, not JSON) and return the compiled model (as a
 `gptc.Model` object).
@ -80,27 +79,15 @@ For information about `max_ngram_length`, see section "Ngrams."
 Words or ngrams used less than `min_count` times throughout the input text are
 excluded from the model.
-The hash algorithm should be left as the default, which may change with a minor
+### `gptc.Model.serialize()`
 version update, but it can be changed by the application if needed. It is
 stored in the model, so changing the algorithm does not affect compatibility.
 The following algorithms are supported:
-* `md5`
+Returns a `bytes` representing the model.
 * `sha1`
 * `sha224`
 * `sha256`
 * `sha384`
 * `sha512`
 * `sha3_224`
 * `sha3_384`
 * `sha3_256`
 * `sha3_512`
 * `shake_128`
 * `shake_256`
 * `blake2b`
 * `blake2s`
-### `gptc.pack(directory, print_exceptions=False)`
+### `gptc.deserialize(encoded_model)`
 Deserialize a `Model` from a `bytes` returned by `Model.serialize()`.
 ### `gptc.pack(directory, print_exceptions=False)
 Pack the model in `directory` and return a tuple of the format:
@ -112,13 +99,6 @@ GPTC.
 See `models/unpacked/` for an example of the format.
 ### `gptc.Classifier(model, max_ngram_length=1)`
 `Classifier` objects are deprecated starting with GPTC 3.1.0, and will be
 removed in 5.0.0. See [the README from
 3.0.2](https://git.kj7rrv.com/kj7rrv/gptc/src/tag/v3.0.1/README.md) if you need
 documentation.
 ## Ngrams
 GPTC optionally supports using ngrams to improve classification accuracy. They
@ -138,8 +118,7 @@ reduced to the one used when compiling the model.
 ## Model format
-This section explains the raw model format, which is how models are created and
+This section explains the raw model format, which is how models are created and edited.
 edited.
 Raw models are formatted as a list of dicts. See below for the format:
@ -150,10 +129,9 @@ Raw models are formatted as a list of dicts. See below for the format:
        }
    ]
-GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str,
+GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str, str]]`), and they can be stored
-str]]`), and they can be stored in any way these Python objects can be.
+in any way these Python objects can be. However, it is recommended to store
-However, it is recommended to store them in JSON format for compatibility with
+them in JSON format for compatibility with the command-line tool.
 the command-line tool.
 ## Emoji
--- a/benchmark.py
+++ b/benchmark.py
@ -25,7 +25,7 @@ print(
    round(
        1000000
        * timeit.timeit(
-            "gptc.Model.compile(raw_model, max_ngram_length)",
+            "gptc.compile(raw_model, max_ngram_length)",
            number=compile_iterations,
            globals=globals(),
        )
--- a/gptc/init.py
+++ b/gptc/init.py
@ -2,11 +2,13 @@
 """General-Purpose Text Classifier"""
-from gptc.pack import pack
+from gptc.compiler import compile as compile
-from gptc.model import Model
+from gptc.classifier import Classifier as Classifier
-from gptc.tokenizer import normalize
+from gptc.pack import pack as pack
 from gptc.tokenizer import has_emoji as has_emoji
 from gptc.model import Model as Model, deserialize as deserialize
 from gptc.exceptions import (
-    GPTCError,
+    GPTCError as GPTCError,
-    ModelError,
+    ModelError as ModelError,
-    InvalidModelError,
+    InvalidModelError as InvalidModelError,
 )
--- a/gptc/main.py
+++ b/gptc/main.py
@ -17,9 +17,6 @@ def main() -> None:
        "compile", help="compile a raw model"
    )
    compile_parser.add_argument("model", help="raw model to compile")
    compile_parser.add_argument(
        "out", help="name of file to write compiled model to"
    )
    compile_parser.add_argument(
        "--max-ngram-length",
        "-n",
@ -44,12 +41,19 @@ def main() -> None:
        type=int,
        default=1,
    )
-
+    group = classify_parser.add_mutually_exclusive_group()
-    check_parser = subparsers.add_parser(
+    group.add_argument(
-        "check", help="check one word or ngram in model"
+        "-j",
        "--json",
        help="output confidence dict as JSON (default)",
        action="store_true",
    )
    group.add_argument(
        "-c",
        "--category",
        help="output most likely category or `None`",
        action="store_true",
    )
    check_parser.add_argument("model", help="compiled model to use")
    check_parser.add_argument("token", help="token or ngram to check")
    pack_parser = subparsers.add_parser(
        "pack", help="pack a model from a directory"
@ -59,27 +63,29 @@ def main() -> None:
    args = parser.parse_args()
    if args.subparser_name == "compile":
-        with open(args.model, "r", encoding="utf-8") as input_file:
+        with open(args.model, "r") as f:
-            model = json.load(input_file)
+            model = json.load(f)
-        with open(args.out, "wb+") as output_file:
+        sys.stdout.buffer.write(
-            gptc.Model.compile(
+            gptc.compile(
                model, args.max_ngram_length, args.min_count
-            ).serialize(output_file)
+            ).serialize()
        )
    elif args.subparser_name == "classify":
-        with open(args.model, "rb") as model_file:
+        with open(args.model, "rb") as f:
-            model = gptc.Model.deserialize(model_file)
+            model = gptc.deserialize(f.read())
        classifier = gptc.Classifier(model, args.max_ngram_length)
        if sys.stdin.isatty():
            text = input("Text to analyse: ")
        else:
            text = sys.stdin.read()
-        print(json.dumps(model.confidence(text, args.max_ngram_length)))
+        if args.category:
-    elif args.subparser_name == "check":
+            print(classifier.classify(text))
-        with open(args.model, "rb") as model_file:
+        else:
-            model = gptc.Model.deserialize(model_file)
+            print(json.dumps(classifier.confidence(text)))
        print(json.dumps(model.get(args.token)))
    else:
        print(json.dumps(gptc.pack(args.model, True)[0]))
--- a/gptc/classifier.py
+++ b/gptc/classifier.py
@ -0,0 +1,94 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import gptc.tokenizer, gptc.compiler, gptc.exceptions, gptc.weighting
 import warnings
 from typing import Dict, Union, cast, List
 class Classifier:
    """A text classifier.
    Parameters
    ----------
    model : dict
        A compiled GPTC model.
    max_ngram_length : int
        The maximum ngram length to use when tokenizing input. If this is
        greater than the value used when the model was compiled, it will be
        silently lowered to that value.
    Attributes
    ----------
    model : dict
        The model used.
    """
    def __init__(self, model: gptc.model.Model, max_ngram_length: int = 1):
        self.model = model
        model_ngrams = model.max_ngram_length
        self.max_ngram_length = min(max_ngram_length, model_ngrams)
        self.has_emoji = gptc.tokenizer.has_emoji and model.has_emoji
    def confidence(self, text: str) -> Dict[str, float]:
        """Classify text with confidence.
        Parameters
        ----------
        text : str
            The text to classify
        Returns
        -------
        dict
            {category:probability, category:probability...} or {} if no words
            matching any categories in the model were found
        """
        model = self.model.weights
        tokens = gptc.tokenizer.tokenize(
            text, self.max_ngram_length, self.has_emoji
        )
        numbered_probs: Dict[int, float] = {}
        for word in tokens:
            try:
                weighted_numbers = gptc.weighting.weight(
                    [i / 65535 for i in cast(List[float], model[word])]
                )
                for category, value in enumerate(weighted_numbers):
                    try:
                        numbered_probs[category] += value
                    except KeyError:
                        numbered_probs[category] = value
            except KeyError:
                pass
        total = sum(numbered_probs.values())
        probs: Dict[str, float] = {
            self.model.names[category]: value / total
            for category, value in numbered_probs.items()
        }
        return probs
    def classify(self, text: str) -> Union[str, None]:
        """Classify text.
        Parameters
        ----------
        text : str
            The text to classify
        Returns
        -------
        str or None
            The most likely category, or None if no words matching any
            category in the model were found.
        """
        probs: Dict[str, float] = self.confidence(text)
        try:
            return sorted(probs.items(), key=lambda x: x[1])[-1][0]
        except IndexError:
            return None
--- a/gptc/compiler.py
+++ b/gptc/compiler.py
@ -0,0 +1,86 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import gptc.tokenizer
 import gptc.model
 from typing import Iterable, Mapping, List, Dict, Union
 def compile(
    raw_model: Iterable[Mapping[str, str]],
    max_ngram_length: int = 1,
    min_count: int = 1,
 ) -> gptc.model.Model:
    """Compile a raw model.
    Parameters
    ----------
    raw_model : list of dict
        A raw GPTC model.
    max_ngram_length : int
        Maximum ngram lenght to compile with.
    Returns
    -------
    dict
        A compiled GPTC model.
    """
    categories: Dict[str, List[int]] = {}
    for portion in raw_model:
        text = gptc.tokenizer.tokenize(portion["text"], max_ngram_length)
        category = portion["category"]
        try:
            categories[category] += text
        except KeyError:
            categories[category] = text
    word_counts: Dict[int, Dict[str, int]] = {}
    names = []
    for category, text in categories.items():
        if not category in names:
            names.append(category)
        for word in text:
            try:
                counts_for_word = word_counts[word]
            except KeyError:
                counts_for_word = {}
                word_counts[word] = counts_for_word
            try:
                word_counts[word][category] += 1
            except KeyError:
                word_counts[word][category] = 1
    word_counts = {
        word: counts
        for word, counts in word_counts.items()
        if sum(counts.values()) >= min_count
    }
    word_weights: Dict[int, Dict[str, float]] = {}
    for word, values in word_counts.items():
        for category, value in values.items():
            try:
                word_weights[word][category] = value / len(categories[category])
            except KeyError:
                word_weights[word] = {
                    category: value / len(categories[category])
                }
    model: Dict[int, List[int]] = {}
    for word, weights in word_weights.items():
        total = sum(weights.values())
        new_weights: List[int] = []
        for category in names:
            new_weights.append(
                round((weights.get(category, 0) / total) * 65535)
            )
        model[word] = new_weights
    return gptc.model.Model(model, names, max_ngram_length)
--- a/gptc/model.py
+++ b/gptc/model.py
@ -1,120 +1,9 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 from typing import (
        Iterable,
        Mapping,
    List,
    Dict,
    cast,
    BinaryIO,
    Tuple,
    TypedDict,
 )
 import json
 import gptc.tokenizer
 from gptc.exceptions import InvalidModelError
-import gptc.weighting
+from typing import Iterable, Mapping, List, Dict, Union
-
+import json
 def _count_words(
    raw_model: Iterable[Mapping[str, str]],
    max_ngram_length: int,
    hash_algorithm: str,
 ) -> Tuple[Dict[int, Dict[str, int]], Dict[str, int], List[str]]:
    word_counts: Dict[int, Dict[str, int]] = {}
    category_lengths: Dict[str, int] = {}
    names: List[str] = []
    for portion in raw_model:
        text = gptc.tokenizer.hash_list(
            gptc.tokenizer.tokenize(portion["text"], max_ngram_length),
            hash_algorithm,
        )
        category = portion["category"]
        if not category in names:
            names.append(category)
        category_lengths[category] = category_lengths.get(category, 0) + len(
            text
        )
        for word in text:
            if word in word_counts:
                try:
                    word_counts[word][category] += 1
                except KeyError:
                    word_counts[word][category] = 1
            else:
                word_counts[word] = {category: 1}
    return word_counts, category_lengths, names
 def _get_weights(
    min_count: int,
    word_counts: Dict[int, Dict[str, int]],
    category_lengths: Dict[str, int],
    names: List[str],
 ) -> Dict[int, List[int]]:
    model: Dict[int, List[int]] = {}
    for word, counts in word_counts.items():
        if sum(counts.values()) >= min_count:
            weights = {
                category: value / category_lengths[category]
                for category, value in counts.items()
            }
            total = sum(weights.values())
            new_weights: List[int] = []
            for category in names:
                new_weights.append(
                    round((weights.get(category, 0) / total) * 65535)
                )
            model[word] = new_weights
    return model
 class ExplanationEntry(TypedDict):
    weight: float
    probabilities: Dict[str, float]
    count: int
 Explanation = Dict[
    str,
    ExplanationEntry,
 ]
 Log = List[Tuple[str, float, List[float]]]
 class Confidences(dict[str, float]):
    def __init__(self, probs: Dict[str, float]):
        dict.__init__(self, probs)
 class TransparentConfidences(Confidences):
    def __init__(
        self,
        probs: Dict[str, float],
        explanation: Explanation,
    ):
        self.explanation = explanation
        Confidences.__init__(self, probs)
 def convert_log(log: Log, names: List[str]) -> Explanation:
    explanation: Explanation = {}
    for word2, weight, word_probs in log:
        if word2 in explanation:
            explanation[word2]["count"] += 1
        else:
            explanation[word2] = {
                "weight": weight,
                "probabilities": {
                    name: word_probs[index] for index, name in enumerate(names)
                },
                "count": 1,
            }
    return explanation
 class Model:
@ -123,200 +12,78 @@ class Model:
        weights: Dict[int, List[int]],
        names: List[str],
        max_ngram_length: int,
-        hash_algorithm: str,
+        has_emoji: Union[None, bool] = None,
    ):
        self.weights = weights
        self.names = names
        self.max_ngram_length = max_ngram_length
-        self.hash_algorithm = hash_algorithm
+        self.has_emoji = (
-
+            gptc.tokenizer.has_emoji if has_emoji is None else has_emoji
    def confidence(
        self, text: str, max_ngram_length: int, transparent: bool = False
    ) -> Confidences:
        """Classify text with confidence.
        Parameters
        ----------
        text : str
            The text to classify
        max_ngram_length : int
            The maximum ngram length to use in classifying
        Returns
        -------
        dict
            {category:probability, category:probability...} or {} if no words
            matching any categories in the model were found
        """
        model = self.weights
        max_ngram_length = min(self.max_ngram_length, max_ngram_length)
        raw_tokens = gptc.tokenizer.tokenize(
            text, min(max_ngram_length, self.max_ngram_length)
        )
-        tokens = gptc.tokenizer.hash_list(
+    def serialize(self) -> bytes:
-            raw_tokens,
+        out = b"GPTC model v4\n"
-            self.hash_algorithm,
+        out += (
        )
        if transparent:
            token_map = {tokens[i]: raw_tokens[i] for i in range(len(tokens))}
            log: Log = []
        numbered_probs: Dict[int, float] = {}
        for word in tokens:
            try:
                unweighted_numbers = [
                    i / 65535 for i in cast(List[float], model[word])
                ]
                weight, weighted_numbers = gptc.weighting.weight(
                    unweighted_numbers
                )
                if transparent:
                    log.append(
                        (
                            token_map[word],
                            weight,
                            unweighted_numbers,
                        )
                    )
                for category, value in enumerate(weighted_numbers):
                    try:
                        numbered_probs[category] += value
                    except KeyError:
                        numbered_probs[category] = value
            except KeyError:
                pass
        total = sum(numbered_probs.values())
        probs: Dict[str, float] = {
            self.names[category]: value / total
            for category, value in numbered_probs.items()
        }
        if transparent:
            explanation = convert_log(log, self.names)
            return TransparentConfidences(probs, explanation)
        return Confidences(probs)
    def get(self, token: str) -> Dict[str, float]:
        try:
            weights = self.weights[
                gptc.tokenizer.hash_single(
                    gptc.tokenizer.normalize(token), self.hash_algorithm
                )
            ]
        except KeyError:
            return {}
        return {
            category: weights[index] / 65535
            for index, category in enumerate(self.names)
        }
    def serialize(self, file: BinaryIO) -> None:
        file.write(b"GPTC model v6\n")
        file.write(
            json.dumps(
                {
                    "names": self.names,
                    "max_ngram_length": self.max_ngram_length,
-                    "hash_algorithm": self.hash_algorithm,
+                    "has_emoji": self.has_emoji,
                }
            ).encode("utf-8")
            + b"\n"
        )
        for word, weights in self.weights.items():
-            file.write(
+            out += word.to_bytes(6, "big") + b"".join(
-                word.to_bytes(6, "big")
+                [weight.to_bytes(2, "big") for weight in weights]
                + b"".join([weight.to_bytes(2, "big") for weight in weights])
            )
        return out
    @staticmethod
    def compile(
        raw_model: Iterable[Mapping[str, str]],
        max_ngram_length: int = 1,
        min_count: int = 1,
        hash_algorithm: str = "sha256",
        ) -> 'Model':
        """Compile a raw model.
-        Parameters
+def deserialize(encoded_model: bytes) -> Model:
-        ----------
+    try:
-        raw_model : list of dict
+        prefix, config_json, encoded_weights = encoded_model.split(b"\n", 2)
-            A raw GPTC model.
+    except ValueError:
        max_ngram_length : int
            Maximum ngram lenght to compile with.
        Returns
        -------
        dict
            A compiled GPTC model.
        """
        word_counts, category_lengths, names = _count_words(
            raw_model, max_ngram_length, hash_algorithm
        )
        model = _get_weights(min_count, word_counts, category_lengths, names)
        return Model(model, names, max_ngram_length, hash_algorithm)
    @staticmethod
    def deserialize(encoded_model: BinaryIO) -> "Model":
        prefix = encoded_model.read(14)
        if prefix != b"GPTC model v6\n":
        raise InvalidModelError()
-        config_json = b""
+    if prefix != b"GPTC model v4":
        while True:
            byte = encoded_model.read(1)
            if byte == b"\n":
                break
            if byte == b"":
        raise InvalidModelError()
            config_json += byte
    try:
        config = json.loads(config_json.decode("utf-8"))
-        except (UnicodeDecodeError, json.JSONDecodeError) as exc:
+    except (UnicodeDecodeError, json.JSONDecodeError):
-            raise InvalidModelError() from exc
+        raise InvalidModelError()
    try:
        names = config["names"]
        max_ngram_length = config["max_ngram_length"]
-            hash_algorithm = config["hash_algorithm"]
+        has_emoji = config["has_emoji"]
-        except KeyError as exc:
+    except KeyError:
-            raise InvalidModelError() from exc
+        raise InvalidModelError()
    if not (
-            isinstance(names, list) and isinstance(max_ngram_length, int)
+        isinstance(names, list)
-        ) or not all(isinstance(name, str) for name in names):
+        and isinstance(max_ngram_length, int)
        and isinstance(has_emoji, bool)
    ) or not all([isinstance(name, str) for name in names]):
        raise InvalidModelError()
    weight_code_length = 6 + 2 * len(names)
-        weights: Dict[int, List[int]] = {}
+    if len(encoded_weights) % weight_code_length != 0:
        while True:
            code = encoded_model.read(weight_code_length)
            if not code:
                break
            if len(code) != weight_code_length:
        raise InvalidModelError()
-            weights[int.from_bytes(code[:6], "big")] = [
+    weight_codes = [
        encoded_weights[x : x + weight_code_length]
        for x in range(0, len(encoded_weights), weight_code_length)
    ]
    weights = {
        int.from_bytes(code[:6], "big"): [
            int.from_bytes(value, "big")
            for value in [code[x : x + 2] for x in range(6, len(code), 2)]
        ]
        for code in weight_codes
    }
-        return Model(weights, names, max_ngram_length, hash_algorithm)
+    return Model(weights, names, max_ngram_length, has_emoji)
--- a/gptc/pack.py
+++ b/gptc/pack.py
@ -7,7 +7,7 @@ from typing import List, Dict, Tuple
 def pack(
    directory: str, print_exceptions: bool = False
-) -> Tuple[List[Dict[str, str]], List[Tuple[OSError]]]:
+) -> Tuple[List[Dict[str, str]], List[Tuple[Exception]]]:
    paths = os.listdir(directory)
    texts: Dict[str, List[str]] = {}
    exceptions = []
@ -17,18 +17,16 @@ def pack(
        try:
            for file in os.listdir(os.path.join(directory, path)):
                try:
-                    with open(
+                    with open(os.path.join(directory, path, file)) as f:
-                        os.path.join(directory, path, file), encoding="utf-8"
+                        texts[path].append(f.read())
-                    ) as input_file:
+                except Exception as e:
-                        texts[path].append(input_file.read())
+                    exceptions.append((e,))
                except OSError as error:
                    exceptions.append((error,))
                    if print_exceptions:
-                        print(error, file=sys.stderr)
+                        print(e, file=sys.stderr)
-        except OSError as error:
+        except Exception as e:
-            exceptions.append((error,))
+            exceptions.append((e,))
            if print_exceptions:
-                print(error, file=sys.stderr)
+                print(e, file=sys.stderr)
    raw_model = []
--- a/gptc/tokenizer.py
+++ b/gptc/tokenizer.py
@ -1,13 +1,25 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
-import unicodedata
+from typing import List, Union
 from typing import List, cast
 import hashlib
 import base64
 try:
    import emoji
    has_emoji = True
 except ImportError:
    has_emoji = False
-def tokenize(text: str, max_ngram_length: int = 1) -> List[str]:
+
-    text = unicodedata.normalize("NFKD", text).casefold()
+def tokenize(
    text: str, max_ngram_length: int = 1, use_emoji: bool = True
 ) -> List[int]:
    """Convert a string to a list of lemmas."""
    converted_text: Union[str, List[str]] = text.lower()
    if has_emoji and use_emoji:
        text = text.lower()
        parts = []
        highest_end = 0
        for emoji_part in emoji.emoji_list(text):
@ -20,14 +32,9 @@ def tokenize(text: str, max_ngram_length: int = 1) -> List[str]:
    tokens = [""]
    for char in converted_text:
-        if (
+        if char.isalpha() or char == "'":
            char.isalpha()
            or char.isnumeric()
            or char == "'"
            or (char in ",." and (" " + tokens[-1])[-1].isnumeric())
        ):
            tokens[-1] += char
-        elif emoji.is_emoji(char):
+        elif has_emoji and emoji.is_emoji(char):
            tokens.append(char)
            tokens.append("")
        elif tokens[-1] != "":
@ -36,51 +43,16 @@ def tokenize(text: str, max_ngram_length: int = 1) -> List[str]:
    tokens = [string for string in tokens if string]
    if max_ngram_length == 1:
-        return tokens
+        ngrams = tokens
-
+    else:
        ngrams = []
        for ngram_length in range(1, max_ngram_length + 1):
            for index in range(len(tokens) + 1 - ngram_length):
                ngrams.append(" ".join(tokens[index : index + ngram_length]))
    return ngrams
-
+    return [
-def _hash_single(token: str, hash_function: type) -> int:
+        int.from_bytes(
-    return int.from_bytes(
+            hashlib.sha256(token.encode("utf-8")).digest()[:6], "big"
        hash_function(token.encode("utf-8")).digest()[:6], "big"
        )
-
+        for token in ngrams
-
+    ]
 def _get_hash_function(hash_algorithm: str) -> type:
    if hash_algorithm in {
        "sha224",
        "md5",
        "sha512",
        "sha3_256",
        "blake2s",
        "sha3_224",
        "sha1",
        "sha256",
        "sha384",
        "shake_256",
        "blake2b",
        "sha3_512",
        "shake_128",
        "sha3_384",
    }:
        return cast(type, getattr(hashlib, hash_algorithm))
    raise ValueError("not a valid hash function: " + hash_algorithm)
 def hash_single(token: str, hash_algorithm: str) -> int:
    return _hash_single(token, _get_hash_function(hash_algorithm))
 def hash_list(tokens: List[str], hash_algorithm: str) -> List[int]:
    hash_function = _get_hash_function(hash_algorithm)
    return [_hash_single(token, hash_function) for token in tokens]
 def normalize(text: str) -> str:
    return " ".join(tokenize(text, 1))
--- a/gptc/weighting.py
+++ b/gptc/weighting.py
@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import math
-from typing import Sequence, Tuple, List
+from typing import Sequence, Union, Tuple, List
 def _mean(numbers: Sequence[float]) -> float:
@ -39,8 +39,8 @@ def _standard_deviation(numbers: Sequence[float]) -> float:
    return math.sqrt(_mean(squared_deviations))
-def weight(numbers: Sequence[float]) -> Tuple[float, List[float]]:
+def weight(numbers: Sequence[float]) -> List[float]:
    standard_deviation = _standard_deviation(numbers)
-    weight_assigned = standard_deviation * 2
+    weight = standard_deviation * 2
-    weighted_numbers = [i * weight_assigned for i in numbers]
+    weighted_numbers = [i * weight for i in numbers]
-    return weight_assigned, weighted_numbers
+    return weighted_numbers
--- a/models/compiled.gptc
+++ b/models/compiled.gptc
--- a/profiler.py
+++ b/profiler.py
@ -1,16 +0,0 @@
 # SPDX-License-Identifier: GPL-3.0-or-later
 import cProfile
 import gptc
 import json
 import sys
 max_ngram_length = 10
 with open("models/raw.json") as f:
    raw_model = json.load(f)
 with open("models/benchmark_text.txt") as f:
    text = f.read()
 cProfile.run("gptc.Model.compile(raw_model, max_ngram_length)")
--- a/pyproject.toml
+++ b/pyproject.toml
@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "gptc"
-version = "4.0.1"
+version = "3.0.0"
 description = "General-purpose text classifier"
 readme = "README.md"
 authors = [{ name = "Samuel Sloniker", email = "sam@kj7rrv.com"}]