mlx.core.quantize#
- quantize(w: array, /, group_size: int = 64, bits: int = 4, *, stream: None | Stream | Device = None) tuple[array, array, array] #
Quantize the matrix
w
usingbits
bits per element.Note, every
group_size
elements in a row ofw
are quantized together. Hence, number of columns ofw
should be divisible bygroup_size
. In particular, the rows ofw
are divided into groups of sizegroup_size
which are quantized together.Warning
quantize
currently only supports 2D inputs with dimensions which are multiples of 32Formally, for a group of \(g\) consecutive elements \(w_1\) to \(w_g\) in a row of
w
we compute the quantized representation of each element \(\hat{w_i}\) as follows\[\begin{split}\begin{aligned} \alpha &= \max_i w_i \\ \beta &= \min_i w_i \\ s &= \frac{\alpha - \beta}{2^b - 1} \\ \hat{w_i} &= \textrm{round}\left( \frac{w_i - \beta}{s}\right). \end{aligned}\end{split}\]After the above computation, \(\hat{w_i}\) fits in \(b\) bits and is packed in an unsigned 32-bit integer from the lower to upper bits. For instance, for 4-bit quantization we fit 8 elements in an unsigned 32 bit integer where the 1st element occupies the 4 least significant bits, the 2nd bits 4-7 etc.
In order to be able to dequantize the elements of
w
we also need to save \(s\) and \(\beta\) which are the returnedscales
andbiases
respectively.- Parameters:
- Returns:
A tuple containing
w_q (array): The quantized version of
w
scales (array): The scale to multiply each element with, namely \(s\)
biases (array): The biases to add to each element, namely \(\beta\)
- Return type: