mlx.data.core.Tokenizer

mlx.data.core.Tokenizer#

class mlx.data.core.Tokenizer#

A Tokenizer that can be used to tokenize arbitrary strings.

Parameters:
  • trie (mlx.data.core.CharTrie) – The trie containing the possible tokens.

  • ignore_unk (bool) – Whether unknown tokens should be ignored or an error should be raised. (default: false)

  • trie_key_scores (list[float]) – A list containing one score per trie node. If left empty each score is assumed equal to 1. Tokenize shortest minimizes the sum of these scores over the sequence of tokens.

Methods

__init__(self, trie[, ignore_unk, ...])

Make a tokenizer object that can be used to tokenize arbitrary strings.

tokenize(self, input)

Return the full graph of possible tokenizations.

tokenize_rand(self, input)

Tokenize the input with a valid tokenization chosen randomly from the set of valid tokenizations.

tokenize_shortest(self, input)

Tokenize the input such that the sum of trie_key_scores is minimized.