Streaming periodicity with mismatches, wildcards, and edits

Taha El Ghazi, Tatiana Starikovskaya

Published: 2025/9/18

Abstract

In this work, we study the problem of detecting periodic trends in strings. While detecting exact periodicity has been studied extensively, real-world data is often noisy, where small deviations or mismatches occur between repetitions. This work focuses on a generalized approach to period detection that efficiently handles noise. Given a string $S$ of length $n$, the task is to identify integers $p$ such that the prefix and the suffix of $S$, each of length $n-p+1$, are similar under a given distance measure. Erg\"un et al. [APPROX-RANDOM 2017] were the first to study this problem in the streaming model under the Hamming distance. In this work, we combine, in a non-trivial way, the Hamming distance sketch of Clifford et al. [SODA 2019] and the structural description of the $k$-mismatch occurrences of a pattern in a text by Charalampopoulos et al. [FOCS 2020] to present a more efficient streaming algorithm for period detection under the Hamming distance. As a corollary, we derive a streaming algorithm for detecting periods of strings which may contain wildcards, a special symbol that match any character of the alphabet. Our algorithm is not only more efficient than that of Erg\"un et al. [TCS 2020], but it also operates without their assumption that the string must be free of wildcards in its final characters. Additionally, we introduce the first two-pass streaming algorithm for computing periods under the edit distance by leveraging and extending the Bhattacharya-Kouck\'y's grammar decomposition technique [STOC 2023].

Read Full Paper (arXiv.org)