mlx.data.tokenizer_helpers.read_trie_from_spm#
- class mlx.data.tokenizer_helpers.read_trie_from_spm(spm_file)#
Read an
mlx.data.core.CharTrie
from a sentencepiece file.Reading directly from a model file requires installing sentencepiece, however if the vocabulary and the scores are exported the file can be read without installing sentencepiece.
Note
Sentencepiece models are almost always BPE models with scores being the associated log likelihood of from a unigram language model. Using the
mlx.data.core.CharTrie
and the loaded scores will provide the shortest possible tokenization with the highest possible log likelihood but it can be slightly different than the BPE one.Use
read_bpe_from_spm()
to load the model to be used with amlx.data.core.BPETokenizer
.- Parameters:
spm_file (str) – Either a sentencepiece model file or a vocab file extracted from a sentencepiece model.
- Returns:
The trie and the corresponding weights from the SPM mdoel.
- Return type:
tuple[
mlx.data.core.CharTrie
, list[float]]