Simple and Optimal Sublinear Algorithms for Mean Estimation
Beatrice Bertolotti, Matteo Russo, Chris Schwiegelshohn, Sudarshan Shyam
公開日: 2024/6/7
Abstract
We study the sublinear multivariate mean estimation problem in $d$-dimensional Euclidean space. Specifically, we aim to find the mean $\mu$ of a ground point set $A$, which minimizes the sum of squared Euclidean distances of the points in $A$ to $\mu$. We first show that a multiplicative $(1+\varepsilon)$ approximation to $\mu$ can be found with probability $1-\delta$ using $O(\varepsilon^{-1}\log \delta^{-1})$ many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two estimators with optimal sample complexity that can be computed in optimal running time for extracting a suitable approximate mean: 1. The coordinate-wise median of $\log \delta^{-1}$ sample means of sample size $\varepsilon^{-1}$. As a corollary, we also show improved convergence rates for this estimator for estimating means of multivariate distributions. 2. The geometric median of $\log \delta^{-1}$ sample means of sample size $\varepsilon^{-1}$. To compute a solution efficiently, we design a novel and simple gradient descent algorithm that is significantly faster for our specific setting than all other known algorithms for computing geometric medians. In addition, we propose an order statistics approach that is empirically competitive with these algorithms, has an optimal sample complexity and matches the running time up to lower order terms. We finally provide an extensive experimental evaluation among several estimators which concludes that the geometric-median-of-means-based approach is typically the most competitive in practice.