mlx.data.Buffer.tokenize

mlx.data.Buffer.tokenize#

Buffer.tokenize(self: mlx.data._c.Buffer, key: str, trie: mlx::data::core::Trie<char>, mode: mlx.data._c.TokenizeMode = <TokenizeMode.Shortest: 0>, ignore_unk: bool = False, trie_key_scores: List[float] = [], output_key: str = '') mlx.data._c.Buffer#

Tokenize the contents of the array at key.

This operation uses a mlx.data.core.CharTrie to tokenize the contents of the array. The tokenizer computes a graph of Trie nodes that matches the content of the array at key. Subsequently, it either samples a path along the graph (if mode is mlx.data.core.TokenizeMode.rand) or finds the shortest weighted path using the trie_key_scores for weights.

If trie_key_scores are not provided, then each has the same weight of 1 and the result is the smallest number of tokens that can represent the content.

Parameters:
  • key (str) – The sample key that contains the array we are operating on.

  • trie (mlx.data.core.CharTrie) – The trie to use for the tokenization.

  • mode (mlx.data.core.TokenizeMode) – The tokenizer mode to use. Shortest or random as described above. (default: mlx.data.core.TokenizeMode.shortest)

  • ignore_unk (bool) – If True then ignore content that cannot be represented. Otherwise throw an exception. (default: False)

  • trie_key_scores (list of float) – The weights of each node in the trie. (default: [] which means each node gets a weight of 1)

  • output_key (str) – If it is not empty then write the result to this key instead of overwriting key. (default: ‘’)