quantized_matmul(x: array, w: array, /, scales: array, biases: array, transpose: bool = True, group_size: int = 64, bits: int = 4, *, stream: None | Stream | Device = None) array#

Perform the matrix multiplication with the quantized matrix w. The quantization uses one floating point scale and bias per group_size of elements. Each element in w takes bits bits and is packed in an unsigned 32 bit integer.

  • x (array) – Input array

  • w (array) – Quantized matrix packed in unsigned integers

  • scales (array) – The scales to use per group_size elements of w

  • biases (array) – The biases to use per group_size elements of w

  • transpose (bool, optional) – Defines whether to multiply with the transposed w or not, namely whether we are performing x @ w.T or x @ w. (default: True)

  • group_size (int, optional) – The size of the group in w that shares a scale and bias. (default: 64)

  • bits (int, optional) – The number of bits occupied by each element in w. (default: 4)


The result of the multiplication of x with w.

Return type:

result (array)