mlx.core.quantized_matmul#

quantized_matmul(x: array, w: array, /, scales: array, biases: array, transpose: bool = True, group_size: int = 64, bits: int = 4, *, stream: None | Stream | Device = None) → array#

Perform the matrix multiplication with the quantized matrix w. The quantization uses one floating point scale and bias per group_size of elements. Each element in w takes bits bits and is packed in an unsigned 32 bit integer.

Parameters:

x (array) – Input array
w (array) – Quantized matrix packed in unsigned integers
scales (array) – The scales to use per group_size elements of w
biases (array) – The biases to use per group_size elements of w
transpose (bool, optional) – Defines whether to multiply with the transposed w or not, namely whether we are performing x @ w.T or x @ w. Default: True.
group_size (int, optional) – The size of the group in w that shares a scale and bias. Default: 64.
bits (int, optional) – The number of bits occupied by each element in w. Default: 4.

Returns:

The result of the multiplication of x with w.

Return type:

array

mlx.core.quantized_matmul

Contents

mlx.core.quantized_matmul#