Stochastic Bregman Proximal Gradient Method Revisited: Kernel Conditioning and Painless Variance Reduction

Junyu Zhang

Published: 2024/1/6

Abstract

We investigate stochastic Bregman proximal gradient (SBPG) methods for minimizing a finite-sum nonconvex function $\Psi(x):=\frac{1}{n}\sum_{i=1}^nf_i(x)+\phi(x)$, where $\phi$ is convex and nonsmooth, while $f_i$, instead of gradient global Lipschitz continuity, satisfies a smooth-adaptability condition w.r.t. some kernel $h$. Standard acceleration techniques for stochastic algorithms (momentum, shuffling, variance reduction) depend on bounding stochastic errors by gradient differences that are further controlled via Lipschitz property. Lacking this, existing SBPG results are limited to vanilla stochastic approximation that cannot yield the optimal $O(\sqrt{n})$ complexity dependence on $n$. Moreover, existing works report complexities under various nonstandard stationarity measures that largely deviate from the standard minimal limiting Fr\'echet subdifferential $\mathrm{dist}(0,\partial\Psi(\cdot))$. Our analysis reveals that these popular stationarity measures are often much smaller than $\mathrm{dist}(0,\partial\Psi(\cdot))$, leading to overstated solution quality and non-stationary output. To resolve these issues, we design a new gradient mapping $\mathcal{D}_{\phi,h}^\lambda (\cdot)$ by BPG residuals in dual space and a new kernel-conditioning (KC) regularity, under which the mismatch between $\|\mathcal{D}_{\phi,h}^\lambda (\cdot)\|$ and $\mathrm{dist}(0,\partial\Psi(\cdot))$ is provably $O(1)$ and instance-free. Moreover, KC-regularity guarantees Lipschitz-like bounds for gradient differences, providing general analysis tools for momentum, shuffling, and variance reduction under smooth-adaptability. We illustrate this point on variance reduced SBPG methods and establish an $O(\sqrt{n})$ complexity for $\|\mathcal{D}_{\phi,h}^\lambda (\cdot)\|$, providing instance-free (worst-case) complexity under $\mathrm{dist}(0,\partial\Psi(\cdot))$.

Stochastic Bregman Proximal Gradient Method Revisited: Kernel Conditioning and Painless Variance Reduction | SummarXiv | SummarXiv