mlx.core.quantize

Contents

mlx.core.quantize#

quantize(w: array, /, group_size: int = 64, bits: int = 4, mode: str = 'affine', *, stream: None | Stream | Device = None) tuple[array, array, array]#

Quantize the matrix w using bits bits per element.

Note, every group_size elements in a row of w are quantized together. Hence, number of columns of w should be divisible by group_size. In particular, the rows of w are divided into groups of size group_size which are quantized together.

Warning

quantize currently only supports 2D inputs with the second dimension divisible by group_size

The supported quantization modes are "affine" and "mxfp4". They are described in more detail below.

Parameters:
  • w (array) – Matrix to be quantized

  • group_size (int, optional) – The size of the group in w that shares a scale and bias. Default: 64.

  • bits (int, optional) – The number of bits occupied by each element of w in the returned quantized matrix. Default: 4.

  • mode (str, optional) – The quantization mode. Default: "affine".

Returns:

A tuple with either two or three elements containing:

  • w_q (array): The quantized version of w

  • scales (array): The quantization scales

  • biases (array): The quantization biases (returned for mode=="affine").

Return type:

tuple

Notes

The affine mode quantizes groups of \(g\) consecutive elements in a row of w. For each group the quantized representation of each element \(\hat{w_i}\) is computed as follows:

\[\begin{split}\begin{aligned} \alpha &= \max_i w_i \\ \beta &= \min_i w_i \\ s &= \frac{\alpha - \beta}{2^b - 1} \\ \hat{w_i} &= \textrm{round}\left( \frac{w_i - \beta}{s}\right). \end{aligned}\end{split}\]

After the above computation, \(\hat{w_i}\) fits in \(b\) bits and is packed in an unsigned 32-bit integer from the lower to upper bits. For instance, for 4-bit quantization we fit 8 elements in an unsigned 32 bit integer where the 1st element occupies the 4 least significant bits, the 2nd bits 4-7 etc.

To dequantize the elements of w, we also save \(s\) and \(\beta\) which are the returned scales and biases respectively.

The mxfp4 mode similarly quantizes groups of \(g\) elements of w. For mxfp4 the group size must be 32. The elements are quantized to 4-bit precision floating-point values (E2M1) with a shared 8-bit scale per group. Unlike affine quantization, mxfp4 does not have a bias value. More details on the format can be found in the specification.