Differentially Private Substring and Document Counting

Bernardini, Giulia; Bille, Philip; Gørtz, Inge Li; Steiner, Teresa Anna

Abstract:Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy.
We give an $\epsilon$-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of $O(\ell \cdot\mathrm{polylog}(n\ell|\Sigma|))$, where $\ell$ is the maximum length of a document in the database, $n$ is the number of documents, and $|\Sigma|$ is the size of the alphabet. We show that this is optimal up to a $O(\mathrm{polylog}(n\ell))$ factor. Further, we show that for $(\epsilon,\delta)$-differential privacy, the bound for document counting can be improved to $O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|\Sigma|))$. Additionally, our data structures are efficient. In particular, our data structures use $O(n\ell^2)$ space, $O(n^2\ell^4)$ preprocessing time, and $O(|P|)$ query time where $P$ is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest.
Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and $q$-grams. For $q$-grams, we further improve the preprocessing time of the data structure.

Comments:	33 pages
Subjects:	Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR)
Cite as:	arXiv:2412.13813 [cs.DS]
	(or arXiv:2412.13813v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2412.13813

Computer Science > Data Structures and Algorithms

Title:Differentially Private Substring and Document Counting

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators