mlx.optimizers.Adafactor#
- class Adafactor(learning_rate: float | Callable[[array], array] | None = None, eps: Tuple[float, float] = (1e-30, 0.001), clip_threshold: float = 1.0, decay_rate: float = -0.8, beta_1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, warmup_init: bool = False)#
The Adafactor optimizer.
Our Adafactor implementation follows the original paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
- Parameters:
learning_rate (float or callable, optional) – The learning rate. Default:
None
.eps (tuple(float, float), optional) – The first term \(\epsilon_1\) added to the square of the gradients to improve numerical stability and the second term \(\epsilon_2\) is used for parameter scaling if
parameter_scale
is set toTrue
. Default:(1e-30, 1e-3)
.clip_threshold (float, optional) – Clips the unscaled update at
clip_threshold
. Default:1.0
.decay_rate (float, optional) – Coefficient for the running average of the squared gradient. Default:
-0.8
.beta_1 (float, optional) – If set to a value bigger than zero then first moment will be used. Default:
None
.weight_decay (float, optional) – The weight decay \(\lambda\). Default:
0.0
.scale_parameter (bool, optional) – If set to
True
the learning rate will be scaled by \(\max(\epsilon_1, \text{RMS}(w_{t-1}))\). Default:True
.relative_step (bool, optional) – If set to
True
thelearning_rate
will be ignored and relative step size will be computed. Default:True
.warmup_init (bool, optional) – If set to
True
then the relative step size will be calculated by the current step. Default:False
.
Methods
__init__
([learning_rate, eps, ...])apply_single
(gradient, parameter, state)Performs the Adafactor parameter and state update.
init_single
(parameter, state)Initialize optimizer state