Compare commits

...

46 Commits

Author SHA1 Message Date
71e9249ff4 Classifier objects will be removed in 5.0 2023-05-31 13:42:42 -07:00
97c4eef086
Move deserialize to Model object 2023-04-17 21:35:38 -07:00
457b569741
Update README 2023-04-17 21:33:03 -07:00
4546c4cffa
Fix profiler and benchmark 2023-04-17 21:28:24 -07:00
7b7ef39d0b
Merge compiler into model.py 2023-04-17 21:15:18 -07:00
a252a15e9d
Clean up code 2023-04-17 21:06:47 -07:00
9513025e60
Fix type annotations 2023-04-17 18:16:20 -07:00
2c3fc77ba6
Finish classification explanations
A couple things I missed in 7f68dc6fc6
2023-04-16 15:48:19 -07:00
d8f3d2e701
Bump model version
99ad07a876 broke the model format,
although probably only in a few edge cases

Still enough of a change for a model version bump
2023-04-16 15:36:49 -07:00
7f68dc6fc6
Add classification explanations
Closes #17
2023-04-16 15:35:53 -07:00
99ad07a876
Casefold
Closes #14
2023-04-16 14:49:03 -07:00
f38f4ca801
Add profiler 2023-04-16 14:27:31 -07:00
56550ca457
Remove Classifier objects
Closes #16
2023-04-16 14:27:07 -07:00
75fdb5ba3c
Split compiler into two functions 2023-01-15 09:39:35 -08:00
071656c2d2
Bump version to 4.0.1 2022-12-24 12:49:12 -08:00
aad590636a
Fix type annotations 2022-12-24 12:48:43 -08:00
099e810a18
Fix check 2022-12-24 12:44:09 -08:00
822aa7d1fd
Bump version to 4.0.0 2022-12-24 12:18:51 -08:00
8417c8acda
Recompile model 2022-12-24 12:18:25 -08:00
ec7f4116fc
Include file name of output in arguments 2022-12-24 12:17:44 -08:00
f8dbc78b82
Allow hash algorithm selection
Closes #9
2022-12-24 11:18:05 -08:00
6f21e0d4e9
Remove debug print lines from compiler 2022-12-24 10:48:09 -08:00
41bba61410
Remove has_emoji and bump model version
Closes #11
2022-12-24 10:47:23 -08:00
10668691ea
Normalize characters
Closes #3
2022-12-24 10:46:40 -08:00
295a1189de
Include numbers in tokenized output
Closes #12
2022-12-24 10:42:50 -08:00
74b2ba81b9
Deserialize from file 2022-12-23 10:49:24 -08:00
9916744801
New type annotation for serialize 2022-12-23 10:33:56 -08:00
7e7b5f3e9c
Performance improvements 2022-12-22 18:01:37 -08:00
a76c6d3da8
Bump version to 3.1.1 2022-11-27 15:01:06 -08:00
c84758af56
list, not tuple 2022-11-27 15:00:37 -08:00
3a9c8d2bf2
Revert "Bump version to 3.1.1"
This reverts commit 12f97ae765.
2022-11-27 14:56:10 -08:00
12f97ae765
Bump version to 3.1.1 2022-11-27 14:54:11 -08:00
c754293d69
Compiler performance improvements 2022-11-27 14:32:44 -08:00
8d42a92848
Add type annotation to Model.get() 2022-11-27 13:36:49 -08:00
e4eb322aa7
Bump version to 3.1.0 2022-11-26 18:37:11 -08:00
83ef71e8ce
Remove doc for gptc classify --category 2022-11-26 18:36:41 -08:00
991d3fd54a
Revert "Bump version to 3.1.0"
This reverts commit b3e6a13e65.
2022-11-26 18:36:18 -08:00
b3e6a13e65
Bump version to 3.1.0 2022-11-26 18:34:04 -08:00
b1228edd9c
Add CLI for Model.get() 2022-11-26 18:28:44 -08:00
25192ffddf
Add ability to look up individual token
Closes #10
2022-11-26 18:17:02 -08:00
548d670960
Use Classifier for --category 2022-11-26 17:50:26 -08:00
b3a43150d8
Split hash function 2022-11-26 17:42:42 -08:00
08437a2696
Add normalize() 2022-11-26 17:17:28 -08:00
fc4665bb9e
Separate tokenization and hashing 2022-11-26 17:04:56 -08:00
30287288f2
Fix README issues 2022-11-26 16:45:30 -08:00
448f200923
Add confidence to Model; deprecate Classifier 2022-11-26 16:41:29 -08:00
13 changed files with 452 additions and 325 deletions

View File

@ -18,18 +18,19 @@ This will prompt for a string and classify it, then print (in JSON) a dict of
the format `{category: probability, category:probability, ...}` to stdout. (For the format `{category: probability, category:probability, ...}` to stdout. (For
information about `-n <max_ngram_length>`, see section "Ngrams.") information about `-n <max_ngram_length>`, see section "Ngrams.")
Alternatively, if you only need the most likely category, you can use this: ### Checking individual words or ngrams
gptc classify [-n <max_ngram_length>] <-c|--category> <compiled model file> gptc check <compiled model file> <token or ngram>
This will prompt for a string and classify it, outputting the category on This is very similar to `gptc classify`, except it takes the input as an
stdout (or "None" if it cannot determine anything). argument, and it treats the input as a single token or ngram.
### Compiling models ### Compiling models
gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file> gptc compile [-n <max_ngram_length>] [-c <min_count>] <raw model file> <compiled model file>
This will print the compiled model encoded in binary format to stdout. This will write the compiled model encoded in binary format to `<compiled model
file>`.
If `-c` is specified, words and ngrams used less than `min_count` times will be If `-c` is specified, words and ngrams used less than `min_count` times will be
excluded from the compiled model. excluded from the compiled model.
@ -43,14 +44,15 @@ example of the format. Any exceptions will be printed to stderr.
## Library ## Library
### `gptc.Classifier(model, max_ngram_length=1)` ### `Model.serialize(file)`
Create a `Classifier` object using the given compiled model (as a `gptc.Model` Write binary data representing the model to `file`.
object, not as a serialized byte string).
For information about `max_ngram_length`, see section "Ngrams." ### `Model.deserialize(encoded_model)`
#### `Classifier.confidence(text)` Deserialize a `Model` from a file containing data from `Model.serialize()`.
### `Model.confidence(text, max_ngram_length)`
Classify `text`. Returns a dict of the format `{category: probability, Classify `text`. Returns a dict of the format `{category: probability,
category:probability, ...}` category:probability, ...}`
@ -60,16 +62,15 @@ common words between the input and the training data (likely, for example, with
input in a different language from the training data), an empty dict will be input in a different language from the training data), an empty dict will be
returned. returned.
#### `Classifier.classify(text)` For information about `max_ngram_length`, see section "Ngrams."
Classify `text`. Returns the category into which the text is placed (as a ### `Model.get(token)`
string), or `None` when it cannot classify the text.
#### `Classifier.model` Return a confidence dict for the given token or ngram. This function is very
similar to `Model.confidence()`, except it treats the input as a single token
or ngram.
The classifier's model. ### `Model.compile(raw_model, max_ngram_length=1, min_count=1, hash_algorithm="sha256")`
### `gptc.compile(raw_model, max_ngram_length=1, min_count=1)`
Compile a raw model (as a list, not JSON) and return the compiled model (as a Compile a raw model (as a list, not JSON) and return the compiled model (as a
`gptc.Model` object). `gptc.Model` object).
@ -79,15 +80,27 @@ For information about `max_ngram_length`, see section "Ngrams."
Words or ngrams used less than `min_count` times throughout the input text are Words or ngrams used less than `min_count` times throughout the input text are
excluded from the model. excluded from the model.
### `gptc.Model.serialize()` The hash algorithm should be left as the default, which may change with a minor
version update, but it can be changed by the application if needed. It is
stored in the model, so changing the algorithm does not affect compatibility.
The following algorithms are supported:
Returns a `bytes` representing the model. * `md5`
* `sha1`
* `sha224`
* `sha256`
* `sha384`
* `sha512`
* `sha3_224`
* `sha3_384`
* `sha3_256`
* `sha3_512`
* `shake_128`
* `shake_256`
* `blake2b`
* `blake2s`
### `gptc.deserialize(encoded_model)` ### `gptc.pack(directory, print_exceptions=False)`
Deserialize a `Model` from a `bytes` returned by `Model.serialize()`.
### `gptc.pack(directory, print_exceptions=False)
Pack the model in `directory` and return a tuple of the format: Pack the model in `directory` and return a tuple of the format:
@ -99,6 +112,13 @@ GPTC.
See `models/unpacked/` for an example of the format. See `models/unpacked/` for an example of the format.
### `gptc.Classifier(model, max_ngram_length=1)`
`Classifier` objects are deprecated starting with GPTC 3.1.0, and will be
removed in 5.0.0. See [the README from
3.0.2](https://git.kj7rrv.com/kj7rrv/gptc/src/tag/v3.0.1/README.md) if you need
documentation.
## Ngrams ## Ngrams
GPTC optionally supports using ngrams to improve classification accuracy. They GPTC optionally supports using ngrams to improve classification accuracy. They
@ -118,7 +138,8 @@ reduced to the one used when compiling the model.
## Model format ## Model format
This section explains the raw model format, which is how models are created and edited. This section explains the raw model format, which is how models are created and
edited.
Raw models are formatted as a list of dicts. See below for the format: Raw models are formatted as a list of dicts. See below for the format:
@ -129,9 +150,10 @@ Raw models are formatted as a list of dicts. See below for the format:
} }
] ]
GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str, str]]`), and they can be stored GPTC handles raw models as `list`s of `dict`s of `str`s (`List[Dict[str,
in any way these Python objects can be. However, it is recommended to store str]]`), and they can be stored in any way these Python objects can be.
them in JSON format for compatibility with the command-line tool. However, it is recommended to store them in JSON format for compatibility with
the command-line tool.
## Emoji ## Emoji

View File

@ -25,7 +25,7 @@ print(
round( round(
1000000 1000000
* timeit.timeit( * timeit.timeit(
"gptc.compile(raw_model, max_ngram_length)", "gptc.Model.compile(raw_model, max_ngram_length)",
number=compile_iterations, number=compile_iterations,
globals=globals(), globals=globals(),
) )

View File

@ -2,12 +2,11 @@
"""General-Purpose Text Classifier""" """General-Purpose Text Classifier"""
from gptc.compiler import compile as compile from gptc.pack import pack
from gptc.classifier import Classifier as Classifier from gptc.model import Model
from gptc.pack import pack as pack from gptc.tokenizer import normalize
from gptc.model import Model as Model, deserialize as deserialize
from gptc.exceptions import ( from gptc.exceptions import (
GPTCError as GPTCError, GPTCError,
ModelError as ModelError, ModelError,
InvalidModelError as InvalidModelError, InvalidModelError,
) )

View File

@ -17,6 +17,9 @@ def main() -> None:
"compile", help="compile a raw model" "compile", help="compile a raw model"
) )
compile_parser.add_argument("model", help="raw model to compile") compile_parser.add_argument("model", help="raw model to compile")
compile_parser.add_argument(
"out", help="name of file to write compiled model to"
)
compile_parser.add_argument( compile_parser.add_argument(
"--max-ngram-length", "--max-ngram-length",
"-n", "-n",
@ -41,19 +44,12 @@ def main() -> None:
type=int, type=int,
default=1, default=1,
) )
group = classify_parser.add_mutually_exclusive_group()
group.add_argument( check_parser = subparsers.add_parser(
"-j", "check", help="check one word or ngram in model"
"--json",
help="output confidence dict as JSON (default)",
action="store_true",
)
group.add_argument(
"-c",
"--category",
help="output most likely category or `None`",
action="store_true",
) )
check_parser.add_argument("model", help="compiled model to use")
check_parser.add_argument("token", help="token or ngram to check")
pack_parser = subparsers.add_parser( pack_parser = subparsers.add_parser(
"pack", help="pack a model from a directory" "pack", help="pack a model from a directory"
@ -63,29 +59,27 @@ def main() -> None:
args = parser.parse_args() args = parser.parse_args()
if args.subparser_name == "compile": if args.subparser_name == "compile":
with open(args.model, "r") as f: with open(args.model, "r", encoding="utf-8") as input_file:
model = json.load(f) model = json.load(input_file)
sys.stdout.buffer.write( with open(args.out, "wb+") as output_file:
gptc.compile( gptc.Model.compile(
model, args.max_ngram_length, args.min_count model, args.max_ngram_length, args.min_count
).serialize() ).serialize(output_file)
)
elif args.subparser_name == "classify": elif args.subparser_name == "classify":
with open(args.model, "rb") as f: with open(args.model, "rb") as model_file:
model = gptc.deserialize(f.read()) model = gptc.Model.deserialize(model_file)
classifier = gptc.Classifier(model, args.max_ngram_length)
if sys.stdin.isatty(): if sys.stdin.isatty():
text = input("Text to analyse: ") text = input("Text to analyse: ")
else: else:
text = sys.stdin.read() text = sys.stdin.read()
if args.category: print(json.dumps(model.confidence(text, args.max_ngram_length)))
print(classifier.classify(text)) elif args.subparser_name == "check":
else: with open(args.model, "rb") as model_file:
print(json.dumps(classifier.confidence(text))) model = gptc.Model.deserialize(model_file)
print(json.dumps(model.get(args.token)))
else: else:
print(json.dumps(gptc.pack(args.model, True)[0])) print(json.dumps(gptc.pack(args.model, True)[0]))

View File

@ -1,91 +0,0 @@
# SPDX-License-Identifier: GPL-3.0-or-later
import gptc.tokenizer, gptc.compiler, gptc.exceptions, gptc.weighting
import warnings
from typing import Dict, Union, cast, List
class Classifier:
"""A text classifier.
Parameters
----------
model : dict
A compiled GPTC model.
max_ngram_length : int
The maximum ngram length to use when tokenizing input. If this is
greater than the value used when the model was compiled, it will be
silently lowered to that value.
Attributes
----------
model : dict
The model used.
"""
def __init__(self, model: gptc.model.Model, max_ngram_length: int = 1):
self.model = model
model_ngrams = model.max_ngram_length
self.max_ngram_length = min(max_ngram_length, model_ngrams)
def confidence(self, text: str) -> Dict[str, float]:
"""Classify text with confidence.
Parameters
----------
text : str
The text to classify
Returns
-------
dict
{category:probability, category:probability...} or {} if no words
matching any categories in the model were found
"""
model = self.model.weights
tokens = gptc.tokenizer.tokenize(text, self.max_ngram_length)
numbered_probs: Dict[int, float] = {}
for word in tokens:
try:
weighted_numbers = gptc.weighting.weight(
[i / 65535 for i in cast(List[float], model[word])]
)
for category, value in enumerate(weighted_numbers):
try:
numbered_probs[category] += value
except KeyError:
numbered_probs[category] = value
except KeyError:
pass
total = sum(numbered_probs.values())
probs: Dict[str, float] = {
self.model.names[category]: value / total
for category, value in numbered_probs.items()
}
return probs
def classify(self, text: str) -> Union[str, None]:
"""Classify text.
Parameters
----------
text : str
The text to classify
Returns
-------
str or None
The most likely category, or None if no words matching any
category in the model were found.
"""
probs: Dict[str, float] = self.confidence(text)
try:
return sorted(probs.items(), key=lambda x: x[1])[-1][0]
except IndexError:
return None

View File

@ -1,86 +0,0 @@
# SPDX-License-Identifier: GPL-3.0-or-later
import gptc.tokenizer
import gptc.model
from typing import Iterable, Mapping, List, Dict, Union
def compile(
raw_model: Iterable[Mapping[str, str]],
max_ngram_length: int = 1,
min_count: int = 1,
) -> gptc.model.Model:
"""Compile a raw model.
Parameters
----------
raw_model : list of dict
A raw GPTC model.
max_ngram_length : int
Maximum ngram lenght to compile with.
Returns
-------
dict
A compiled GPTC model.
"""
categories: Dict[str, List[int]] = {}
for portion in raw_model:
text = gptc.tokenizer.tokenize(portion["text"], max_ngram_length)
category = portion["category"]
try:
categories[category] += text
except KeyError:
categories[category] = text
word_counts: Dict[int, Dict[str, int]] = {}
names = []
for category, text in categories.items():
if not category in names:
names.append(category)
for word in text:
try:
counts_for_word = word_counts[word]
except KeyError:
counts_for_word = {}
word_counts[word] = counts_for_word
try:
word_counts[word][category] += 1
except KeyError:
word_counts[word][category] = 1
word_counts = {
word: counts
for word, counts in word_counts.items()
if sum(counts.values()) >= min_count
}
word_weights: Dict[int, Dict[str, float]] = {}
for word, values in word_counts.items():
for category, value in values.items():
try:
word_weights[word][category] = value / len(categories[category])
except KeyError:
word_weights[word] = {
category: value / len(categories[category])
}
model: Dict[int, List[int]] = {}
for word, weights in word_weights.items():
total = sum(weights.values())
new_weights: List[int] = []
for category in names:
new_weights.append(
round((weights.get(category, 0) / total) * 65535)
)
model[word] = new_weights
return gptc.model.Model(model, names, max_ngram_length)

View File

@ -1,9 +1,120 @@
# SPDX-License-Identifier: GPL-3.0-or-later # SPDX-License-Identifier: GPL-3.0-or-later
from typing import (
Iterable,
Mapping,
List,
Dict,
cast,
BinaryIO,
Tuple,
TypedDict,
)
import json
import gptc.tokenizer import gptc.tokenizer
from gptc.exceptions import InvalidModelError from gptc.exceptions import InvalidModelError
from typing import Iterable, Mapping, List, Dict, Union import gptc.weighting
import json
def _count_words(
raw_model: Iterable[Mapping[str, str]],
max_ngram_length: int,
hash_algorithm: str,
) -> Tuple[Dict[int, Dict[str, int]], Dict[str, int], List[str]]:
word_counts: Dict[int, Dict[str, int]] = {}
category_lengths: Dict[str, int] = {}
names: List[str] = []
for portion in raw_model:
text = gptc.tokenizer.hash_list(
gptc.tokenizer.tokenize(portion["text"], max_ngram_length),
hash_algorithm,
)
category = portion["category"]
if not category in names:
names.append(category)
category_lengths[category] = category_lengths.get(category, 0) + len(
text
)
for word in text:
if word in word_counts:
try:
word_counts[word][category] += 1
except KeyError:
word_counts[word][category] = 1
else:
word_counts[word] = {category: 1}
return word_counts, category_lengths, names
def _get_weights(
min_count: int,
word_counts: Dict[int, Dict[str, int]],
category_lengths: Dict[str, int],
names: List[str],
) -> Dict[int, List[int]]:
model: Dict[int, List[int]] = {}
for word, counts in word_counts.items():
if sum(counts.values()) >= min_count:
weights = {
category: value / category_lengths[category]
for category, value in counts.items()
}
total = sum(weights.values())
new_weights: List[int] = []
for category in names:
new_weights.append(
round((weights.get(category, 0) / total) * 65535)
)
model[word] = new_weights
return model
class ExplanationEntry(TypedDict):
weight: float
probabilities: Dict[str, float]
count: int
Explanation = Dict[
str,
ExplanationEntry,
]
Log = List[Tuple[str, float, List[float]]]
class Confidences(dict[str, float]):
def __init__(self, probs: Dict[str, float]):
dict.__init__(self, probs)
class TransparentConfidences(Confidences):
def __init__(
self,
probs: Dict[str, float],
explanation: Explanation,
):
self.explanation = explanation
Confidences.__init__(self, probs)
def convert_log(log: Log, names: List[str]) -> Explanation:
explanation: Explanation = {}
for word2, weight, word_probs in log:
if word2 in explanation:
explanation[word2]["count"] += 1
else:
explanation[word2] = {
"weight": weight,
"probabilities": {
name: word_probs[index] for index, name in enumerate(names)
},
"count": 1,
}
return explanation
class Model: class Model:
@ -12,80 +123,200 @@ class Model:
weights: Dict[int, List[int]], weights: Dict[int, List[int]],
names: List[str], names: List[str],
max_ngram_length: int, max_ngram_length: int,
hash_algorithm: str,
): ):
self.weights = weights self.weights = weights
self.names = names self.names = names
self.max_ngram_length = max_ngram_length self.max_ngram_length = max_ngram_length
self.hash_algorithm = hash_algorithm
def serialize(self) -> bytes: def confidence(
out = b"GPTC model v4\n" self, text: str, max_ngram_length: int, transparent: bool = False
out += ( ) -> Confidences:
"""Classify text with confidence.
Parameters
----------
text : str
The text to classify
max_ngram_length : int
The maximum ngram length to use in classifying
Returns
-------
dict
{category:probability, category:probability...} or {} if no words
matching any categories in the model were found
"""
model = self.weights
max_ngram_length = min(self.max_ngram_length, max_ngram_length)
raw_tokens = gptc.tokenizer.tokenize(
text, min(max_ngram_length, self.max_ngram_length)
)
tokens = gptc.tokenizer.hash_list(
raw_tokens,
self.hash_algorithm,
)
if transparent:
token_map = {tokens[i]: raw_tokens[i] for i in range(len(tokens))}
log: Log = []
numbered_probs: Dict[int, float] = {}
for word in tokens:
try:
unweighted_numbers = [
i / 65535 for i in cast(List[float], model[word])
]
weight, weighted_numbers = gptc.weighting.weight(
unweighted_numbers
)
if transparent:
log.append(
(
token_map[word],
weight,
unweighted_numbers,
)
)
for category, value in enumerate(weighted_numbers):
try:
numbered_probs[category] += value
except KeyError:
numbered_probs[category] = value
except KeyError:
pass
total = sum(numbered_probs.values())
probs: Dict[str, float] = {
self.names[category]: value / total
for category, value in numbered_probs.items()
}
if transparent:
explanation = convert_log(log, self.names)
return TransparentConfidences(probs, explanation)
return Confidences(probs)
def get(self, token: str) -> Dict[str, float]:
try:
weights = self.weights[
gptc.tokenizer.hash_single(
gptc.tokenizer.normalize(token), self.hash_algorithm
)
]
except KeyError:
return {}
return {
category: weights[index] / 65535
for index, category in enumerate(self.names)
}
def serialize(self, file: BinaryIO) -> None:
file.write(b"GPTC model v6\n")
file.write(
json.dumps( json.dumps(
{ {
"names": self.names, "names": self.names,
"max_ngram_length": self.max_ngram_length, "max_ngram_length": self.max_ngram_length,
"has_emoji": True, "hash_algorithm": self.hash_algorithm,
# Due to an oversight in development, version 3.0.0 still
# had the code used to make emoji support optional, even
# though the `emoji` library was made a hard dependency.
# Part of this code checked whether or not the model
# supports emoji; deserialization would not work in 3.0.0
# if the model was compiled without this field. Emoji are
# always supported with 3.0.0 and newer when GPTC has been
# installed correctly, so this value should always be True.
# Related: #11
} }
).encode("utf-8") ).encode("utf-8")
+ b"\n" + b"\n"
) )
for word, weights in self.weights.items(): for word, weights in self.weights.items():
out += word.to_bytes(6, "big") + b"".join( file.write(
[weight.to_bytes(2, "big") for weight in weights] word.to_bytes(6, "big")
+ b"".join([weight.to_bytes(2, "big") for weight in weights])
) )
return out
@staticmethod
def compile(
raw_model: Iterable[Mapping[str, str]],
max_ngram_length: int = 1,
min_count: int = 1,
hash_algorithm: str = "sha256",
) -> 'Model':
"""Compile a raw model.
def deserialize(encoded_model: bytes) -> Model: Parameters
try: ----------
prefix, config_json, encoded_weights = encoded_model.split(b"\n", 2) raw_model : list of dict
except ValueError: A raw GPTC model.
raise InvalidModelError()
if prefix != b"GPTC model v4": max_ngram_length : int
raise InvalidModelError() Maximum ngram lenght to compile with.
try: Returns
config = json.loads(config_json.decode("utf-8")) -------
except (UnicodeDecodeError, json.JSONDecodeError): dict
raise InvalidModelError() A compiled GPTC model.
try: """
names = config["names"] word_counts, category_lengths, names = _count_words(
max_ngram_length = config["max_ngram_length"] raw_model, max_ngram_length, hash_algorithm
except KeyError: )
raise InvalidModelError() model = _get_weights(min_count, word_counts, category_lengths, names)
return Model(model, names, max_ngram_length, hash_algorithm)
if not ( @staticmethod
isinstance(names, list) and isinstance(max_ngram_length, int) def deserialize(encoded_model: BinaryIO) -> "Model":
) or not all([isinstance(name, str) for name in names]): prefix = encoded_model.read(14)
raise InvalidModelError() if prefix != b"GPTC model v6\n":
raise InvalidModelError()
weight_code_length = 6 + 2 * len(names) config_json = b""
while True:
byte = encoded_model.read(1)
if byte == b"\n":
break
if len(encoded_weights) % weight_code_length != 0: if byte == b"":
raise InvalidModelError() raise InvalidModelError()
weight_codes = [ config_json += byte
encoded_weights[x : x + weight_code_length]
for x in range(0, len(encoded_weights), weight_code_length)
]
weights = { try:
int.from_bytes(code[:6], "big"): [ config = json.loads(config_json.decode("utf-8"))
int.from_bytes(value, "big") except (UnicodeDecodeError, json.JSONDecodeError) as exc:
for value in [code[x : x + 2] for x in range(6, len(code), 2)] raise InvalidModelError() from exc
]
for code in weight_codes
}
return Model(weights, names, max_ngram_length) try:
names = config["names"]
max_ngram_length = config["max_ngram_length"]
hash_algorithm = config["hash_algorithm"]
except KeyError as exc:
raise InvalidModelError() from exc
if not (
isinstance(names, list) and isinstance(max_ngram_length, int)
) or not all(isinstance(name, str) for name in names):
raise InvalidModelError()
weight_code_length = 6 + 2 * len(names)
weights: Dict[int, List[int]] = {}
while True:
code = encoded_model.read(weight_code_length)
if not code:
break
if len(code) != weight_code_length:
raise InvalidModelError()
weights[int.from_bytes(code[:6], "big")] = [
int.from_bytes(value, "big")
for value in [code[x : x + 2] for x in range(6, len(code), 2)]
]
return Model(weights, names, max_ngram_length, hash_algorithm)

View File

@ -7,7 +7,7 @@ from typing import List, Dict, Tuple
def pack( def pack(
directory: str, print_exceptions: bool = False directory: str, print_exceptions: bool = False
) -> Tuple[List[Dict[str, str]], List[Tuple[Exception]]]: ) -> Tuple[List[Dict[str, str]], List[Tuple[OSError]]]:
paths = os.listdir(directory) paths = os.listdir(directory)
texts: Dict[str, List[str]] = {} texts: Dict[str, List[str]] = {}
exceptions = [] exceptions = []
@ -17,16 +17,18 @@ def pack(
try: try:
for file in os.listdir(os.path.join(directory, path)): for file in os.listdir(os.path.join(directory, path)):
try: try:
with open(os.path.join(directory, path, file)) as f: with open(
texts[path].append(f.read()) os.path.join(directory, path, file), encoding="utf-8"
except Exception as e: ) as input_file:
exceptions.append((e,)) texts[path].append(input_file.read())
except OSError as error:
exceptions.append((error,))
if print_exceptions: if print_exceptions:
print(e, file=sys.stderr) print(error, file=sys.stderr)
except Exception as e: except OSError as error:
exceptions.append((e,)) exceptions.append((error,))
if print_exceptions: if print_exceptions:
print(e, file=sys.stderr) print(error, file=sys.stderr)
raw_model = [] raw_model = []

View File

@ -1,13 +1,13 @@
# SPDX-License-Identifier: GPL-3.0-or-later # SPDX-License-Identifier: GPL-3.0-or-later
from typing import List, Union import unicodedata
from typing import List, cast
import hashlib import hashlib
import base64
import emoji import emoji
def tokenize(text: str, max_ngram_length: int = 1) -> List[int]: def tokenize(text: str, max_ngram_length: int = 1) -> List[str]:
text = text.lower() text = unicodedata.normalize("NFKD", text).casefold()
parts = [] parts = []
highest_end = 0 highest_end = 0
for emoji_part in emoji.emoji_list(text): for emoji_part in emoji.emoji_list(text):
@ -20,7 +20,12 @@ def tokenize(text: str, max_ngram_length: int = 1) -> List[int]:
tokens = [""] tokens = [""]
for char in converted_text: for char in converted_text:
if char.isalpha() or char == "'": if (
char.isalpha()
or char.isnumeric()
or char == "'"
or (char in ",." and (" " + tokens[-1])[-1].isnumeric())
):
tokens[-1] += char tokens[-1] += char
elif emoji.is_emoji(char): elif emoji.is_emoji(char):
tokens.append(char) tokens.append(char)
@ -31,16 +36,51 @@ def tokenize(text: str, max_ngram_length: int = 1) -> List[int]:
tokens = [string for string in tokens if string] tokens = [string for string in tokens if string]
if max_ngram_length == 1: if max_ngram_length == 1:
ngrams = tokens return tokens
else:
ngrams = []
for ngram_length in range(1, max_ngram_length + 1):
for index in range(len(tokens) + 1 - ngram_length):
ngrams.append(" ".join(tokens[index : index + ngram_length]))
return [ ngrams = []
int.from_bytes( for ngram_length in range(1, max_ngram_length + 1):
hashlib.sha256(token.encode("utf-8")).digest()[:6], "big" for index in range(len(tokens) + 1 - ngram_length):
) ngrams.append(" ".join(tokens[index : index + ngram_length]))
for token in ngrams return ngrams
]
def _hash_single(token: str, hash_function: type) -> int:
return int.from_bytes(
hash_function(token.encode("utf-8")).digest()[:6], "big"
)
def _get_hash_function(hash_algorithm: str) -> type:
if hash_algorithm in {
"sha224",
"md5",
"sha512",
"sha3_256",
"blake2s",
"sha3_224",
"sha1",
"sha256",
"sha384",
"shake_256",
"blake2b",
"sha3_512",
"shake_128",
"sha3_384",
}:
return cast(type, getattr(hashlib, hash_algorithm))
raise ValueError("not a valid hash function: " + hash_algorithm)
def hash_single(token: str, hash_algorithm: str) -> int:
return _hash_single(token, _get_hash_function(hash_algorithm))
def hash_list(tokens: List[str], hash_algorithm: str) -> List[int]:
hash_function = _get_hash_function(hash_algorithm)
return [_hash_single(token, hash_function) for token in tokens]
def normalize(text: str) -> str:
return " ".join(tokenize(text, 1))

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-3.0-or-later # SPDX-License-Identifier: GPL-3.0-or-later
import math import math
from typing import Sequence, Union, Tuple, List from typing import Sequence, Tuple, List
def _mean(numbers: Sequence[float]) -> float: def _mean(numbers: Sequence[float]) -> float:
@ -39,8 +39,8 @@ def _standard_deviation(numbers: Sequence[float]) -> float:
return math.sqrt(_mean(squared_deviations)) return math.sqrt(_mean(squared_deviations))
def weight(numbers: Sequence[float]) -> List[float]: def weight(numbers: Sequence[float]) -> Tuple[float, List[float]]:
standard_deviation = _standard_deviation(numbers) standard_deviation = _standard_deviation(numbers)
weight = standard_deviation * 2 weight_assigned = standard_deviation * 2
weighted_numbers = [i * weight for i in numbers] weighted_numbers = [i * weight_assigned for i in numbers]
return weighted_numbers return weight_assigned, weighted_numbers

Binary file not shown.

16
profiler.py Normal file
View File

@ -0,0 +1,16 @@
# SPDX-License-Identifier: GPL-3.0-or-later
import cProfile
import gptc
import json
import sys
max_ngram_length = 10
with open("models/raw.json") as f:
raw_model = json.load(f)
with open("models/benchmark_text.txt") as f:
text = f.read()
cProfile.run("gptc.Model.compile(raw_model, max_ngram_length)")

View File

@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "gptc" name = "gptc"
version = "3.0.1" version = "4.0.1"
description = "General-purpose text classifier" description = "General-purpose text classifier"
readme = "README.md" readme = "README.md"
authors = [{ name = "Samuel Sloniker", email = "sam@kj7rrv.com"}] authors = [{ name = "Samuel Sloniker", email = "sam@kj7rrv.com"}]