mlx.core.quantize#
- quantize(w: array, /, group_size: int = 64, bits: int = 4, mode: str = 'affine', *, stream: None | Stream | Device = None) tuple[array, array, array]#
Quantize the matrix
wusingbitsbits per element.Note, every
group_sizeelements in a row ofware quantized together. Hence, number of columns ofwshould be divisible bygroup_size. In particular, the rows ofware divided into groups of sizegroup_sizewhich are quantized together.Warning
quantizecurrently only supports 2D inputs with the second dimension divisible bygroup_sizeThe supported quantization modes are
"affine"and"mxfp4". They are described in more detail below.- Parameters:
w (array) – Matrix to be quantized
group_size (int, optional) – The size of the group in
wthat shares a scale and bias. Default:64.bits (int, optional) – The number of bits occupied by each element of
win the returned quantized matrix. Default:4.mode (str, optional) – The quantization mode. Default:
"affine".
- Returns:
A tuple with either two or three elements containing:
w_q (array): The quantized version of
wscales (array): The quantization scales
biases (array): The quantization biases (returned for
mode=="affine").
- Return type:
Notes
The
affinemode quantizes groups of \(g\) consecutive elements in a row ofw. For each group the quantized representation of each element \(\hat{w_i}\) is computed as follows:\[\begin{split}\begin{aligned} \alpha &= \max_i w_i \\ \beta &= \min_i w_i \\ s &= \frac{\alpha - \beta}{2^b - 1} \\ \hat{w_i} &= \textrm{round}\left( \frac{w_i - \beta}{s}\right). \end{aligned}\end{split}\]After the above computation, \(\hat{w_i}\) fits in \(b\) bits and is packed in an unsigned 32-bit integer from the lower to upper bits. For instance, for 4-bit quantization we fit 8 elements in an unsigned 32 bit integer where the 1st element occupies the 4 least significant bits, the 2nd bits 4-7 etc.
To dequantize the elements of
w, we also save \(s\) and \(\beta\) which are the returnedscalesandbiasesrespectively.The
mxfp4mode similarly quantizes groups of \(g\) elements ofw. Formxfp4the group size must be32. The elements are quantized to 4-bit precision floating-point values (E2M1) with a shared 8-bit scale per group. Unlikeaffinequantization,mxfp4does not have a bias value. More details on the format can be found in the specification.