mlx.core.quantize#
- quantize(w: array, /, group_size: int = 64, bits: int = 4, mode: str = 'affine', *, stream: None | Stream | Device = None) tuple[array, array, array] #
Quantize the matrix
w
usingbits
bits per element.Note, every
group_size
elements in a row ofw
are quantized together. Hence, number of columns ofw
should be divisible bygroup_size
. In particular, the rows ofw
are divided into groups of sizegroup_size
which are quantized together.Warning
quantize
currently only supports 2D inputs with the second dimension divisible bygroup_size
The supported quantization modes are
"affine"
and"mxfp4"
. They are described in more detail below.- Parameters:
w (array) – Matrix to be quantized
group_size (int, optional) – The size of the group in
w
that shares a scale and bias. Default:64
.bits (int, optional) – The number of bits occupied by each element of
w
in the returned quantized matrix. Default:4
.mode (str, optional) – The quantization mode. Default:
"affine"
.
- Returns:
A tuple with either two or three elements containing:
w_q (array): The quantized version of
w
scales (array): The quantization scales
biases (array): The quantization biases (returned for
mode=="affine"
).
- Return type:
Notes
The
affine
mode quantizes groups of \(g\) consecutive elements in a row ofw
. For each group the quantized representation of each element \(\hat{w_i}\) is computed as follows:\[\begin{split}\begin{aligned} \alpha &= \max_i w_i \\ \beta &= \min_i w_i \\ s &= \frac{\alpha - \beta}{2^b - 1} \\ \hat{w_i} &= \textrm{round}\left( \frac{w_i - \beta}{s}\right). \end{aligned}\end{split}\]After the above computation, \(\hat{w_i}\) fits in \(b\) bits and is packed in an unsigned 32-bit integer from the lower to upper bits. For instance, for 4-bit quantization we fit 8 elements in an unsigned 32 bit integer where the 1st element occupies the 4 least significant bits, the 2nd bits 4-7 etc.
To dequantize the elements of
w
, we also save \(s\) and \(\beta\) which are the returnedscales
andbiases
respectively.The
mxfp4
mode similarly quantizes groups of \(g\) elements ofw
. Formxfp4
the group size must be32
. The elements are quantized to 4-bit precision floating-point values (E2M1) with a shared 8-bit scale per group. Unlikeaffine
quantization,mxfp4
does not have a bias value. More details on the format can be found in the specification.