The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

Caio Kalil Lauand, Ioannis Kontoyiannis, Sean Meyn

Published: 2023/9/6

Abstract

Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $\alpha>0$, defined by the recursion, $$\theta_{n+1} = \theta_{n}+ \alpha f(\theta_n,\Phi_{n+1})$$ in which $\theta_n\in\mathbb{R}^d$ and $\{\Phi_{n}\}$ is a Markov chain. The goal is to approximately solve root finding problem $\bar{f}(\theta^*)=0$, where $\bar{f}(\theta)=\mathbb{E}[f(\theta,\Phi)]$ and $\Phi$ has the steady-state distribution of $\{\Phi_{n}\}$. The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$, and for $\alpha>0$ sufficiently small: $\textbf{1.}$ The pair process $\{(\theta_n,\Phi_n)\}$ is geometrically ergodic in a topological sense. $\textbf{2.}$ For every $1\le p\le 4$, there is a constant $b_p$ such that $\limsup_{n\to\infty}\mathbb{E}[\|\theta_n-\theta^*\|^p]\le b_p \alpha^{p/2}$ for each initial condition. $\textbf{3.}$ The Polyak-Ruppert-style averaged estimates $\theta^{\text{PR}}_n=n^{-1}\sum_{k=1}^{n}\theta_k$ converge to a limit $\theta^{\text{PR}}_\infty$ almost surely and in mean square, which satisfies $\theta^{\text{PR}}_\infty=\theta^*+\alpha \bar{\Upsilon}^*+O(\alpha^2)$ for an identified non-random $\bar{\Upsilon}^*\in\mathbb{R}^d$. Moreover, the covariance is approximately optimal: The limiting covariance matrix of $\theta{\text PR}_n$ is approximately minimal in a matricial sense. The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

Read Full Paper (arXiv.org)