mlx.data.core.Tokenizer.tokenize_rand

mlx.data.core.Tokenizer.tokenize_rand#

Tokenizer.tokenize_rand(self: mlx.data._c.core.Tokenizer, input: str) List[int]#

Tokenize the input with a valid tokenization chosen randomly from the set of valid tokenizations.

For instance if our set of tokens is {‘a’, ‘aa’, ‘b’} then the string ‘aab’ can have 2 different tokenizations:

  • 0, 0, 2

  • 1, 2

Tokenizer.tokenize_shortest() will return the second one if no trie_key_scores are provided while Tokenizer.tokenize_rand() will sample either of the two.

Parameters:

input (str) – The input string to be tokenized.