class Adafactor(learning_rate: float | Callable[[array], array] | None = None, eps: Tuple[float, float] = (1e-30, 0.001), clip_threshold: float = 1.0, decay_rate: float = -0.8, beta_1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, warmup_init: bool = False)#

The Adafactor optimizer.

Our Adafactor implementation follows the original paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

  • learning_rate (float or callable, optional) – The learning rate. Default: None.

  • eps (tuple(float, float), optional) – The first term \(\epsilon_1\) added to the square of the gradients to improve numerical stability and the second term \(\epsilon_2\) is used for parameter scaling if parameter_scale is set to True. Default: (1e-30, 1e-3).

  • clip_threshold (float, optional) – Clips the unscaled update at clip_threshold. Default: 1.0.

  • decay_rate (float, optional) – Coefficient for the running average of the squared gradient. Default: -0.8.

  • beta_1 (float, optional) – If set to a value bigger than zero then first moment will be used. Default: None.

  • weight_decay (float, optional) – The weight decay \(\lambda\). Default: 0.0.

  • scale_parameter (bool, optional) – If set to True the learning rate will be scaled by \(\max(\epsilon_1, \text{RMS}(w_{t-1}))\). Default: True.

  • relative_step (bool, optional) – If set to True the learning_rate will be ignored and relative step size will be computed. Default: True.

  • warmup_init (bool, optional) – If set to True then the relative step size will be calculated by the current step. Default: False.


__init__([learning_rate, eps, ...])

apply_single(gradient, parameter, state)

Performs the Adafactor parameter and state update.

init_single(parameter, state)

Initialize optimizer state