class Lion(learning_rate: float | Callable[[array], array], betas: List[float] = [0.9, 0.99], weight_decay: float = 0.0)#

The Lion optimizer [1].

Since updates are computed through the sign operation, they tend to have larger norm than for other optimizers such as SGD and Adam. We recommend a learning rate that is 3-10x smaller than AdamW and a weight decay 3-10x larger than AdamW to maintain the strength (lr * wd). Our Lion implementation follows the original paper. In detail,

[1]: Chen, X. Symbolic Discovery of Optimization Algorithms. arXiv preprint arXiv:2302.06675.

\[\begin{split}c_{t + 1} &= \beta_1 m_t + (1 - \beta_1) g_t \\ m_{t + 1} &= \beta_2 m_t + (1 - \beta_2) g_t \\ w_{t + 1} &= w_t - \eta (\text{sign}(c_t) + \lambda w_t)\end{split}\]
  • learning_rate (float or callable) – The learning rate \(\eta\).

  • betas (Tuple[float, float], optional) – The coefficients \((\beta_1, \beta_2)\) used for computing the gradient momentum and update direction. Default: (0.9, 0.99)

  • weight_decay (float, optional) – The weight decay \(\lambda\). Default: 0.0


__init__(learning_rate[, betas, weight_decay])

apply_single(gradient, parameter, state)

Performs the Lion parameter update and stores \(m\) in the optimizer state.

init_single(parameter, state)

Initialize optimizer state