Non-Stationary Bandit Convex Optimization: An Optimal Algorithm with Two-Point Feedback

Chang He, Bo Jiang, Shuzhong Zhang

Published: 2025/8/6

Abstract

This paper studies bandit convex optimization in non-stationary environments with two-point feedback, using dynamic regret as the performance measure. We propose an algorithm based on bandit mirror descent that extends naturally to non-Euclidean settings. Let $T$ be the total number of iterations and $\mathcal{P}_{T,p}$ the path variation with respect to the $\ell_p$-norm. In Euclidean space, our algorithm matches the optimal regret bound $\mathcal{O}(\sqrt{dT(1+\mathcal{P}_{T,2})})$, improving upon \citet{zhao2021bandit} by a factor of $\mathcal{O}(\sqrt{d})$. Beyond Euclidean settings, our algorithm achieves an upper bound of $\mathcal{O}(\sqrt{d\log(d)T\log(T)(1 + \mathcal{P}_{T,1})})$ on the simplex, which is nearly optimal up to log factors. For the cross-polytope, the bound reduces to $\mathcal{O}(\sqrt{d\log(d)T(1+\mathcal{P}_{T,p})})$ for some $p = 1 + 1/\log(d)$.

Non-Stationary Bandit Convex Optimization: An Optimal Algorithm with Two-Point Feedback | SummarXiv | SummarXiv