Differentially Private Clustering in Data Streams

Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong

Published: 2023/7/14

Abstract

Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms may not be as applicable in many scenarios. In this work, we provide the first differentially private algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using space that is sublinear (in $T$) in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error, or (2) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_\gamma(poly(k,2^{O_\gamma(d)},\log(T)))$ space for any $\gamma>0$, and the additive error is $poly(k,2^{O_\gamma(d)},\log(T))$. Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.

Differentially Private Clustering in Data Streams | SummarXiv | SummarXiv