Algorithms and Hardness for Estimating Statistical Similarity

Bhattacharyya, Arnab; Gayen, Sutanu; Meel, Kuldeep S.; Myrisiotis, Dimitrios; Pavan, A.; Vinodchandran, N. V.

Computer Science > Data Structures and Algorithms

arXiv:2502.10527 (cs)

This paper has been withdrawn by Dimitrios Myrisiotis

[Submitted on 14 Feb 2025 (v1), last revised 23 Apr 2025 (this version, v2)]

Title:Algorithms and Hardness for Estimating Statistical Similarity

Authors:Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, N. V. Vinodchandran

No PDF available, click to view other formats

Abstract:We study the problem of computing statistical similarity between probability distributions. For distributions $P$ and $Q$ over a finite sample space, their statistical similarity is defined as $S_{\mathrm{stat}}(P, Q) := \sum_{x} \min(P(x), Q(x))$. Statistical similarity is a basic measure of similarity between distributions, with several natural interpretations, and captures the Bayes error in prediction and hypothesis testing problems. Recent work has established that, somewhat surprisingly, even for the simple class of product distributions, exactly computing statistical similarity is $\#\mathsf{P}$-hard. This motivates the question of designing approximation algorithms for statistical similarity. Our primary contribution is a Fully Polynomial-Time deterministic Approximation Scheme (FPTAS) for estimating statistical similarity between two product distributions. To obtain this result, we introduce a new variant of the Knapsack problem, which we call the Masked Knapsack problem, and design an FPTAS to estimate the number of solutions of a multidimensional version of this problem. This new technical contribution could be of independent interest. Furthermore, we also establish a complementary hardness result. We show that it is $\mathsf{NP}$-hard to estimate statistical similarity when $P$ and $Q$ are Bayes net distributions of in-degree $2$.

Comments:	There is an error in the proof of Lemma 23, which invalidates Theorems 11 and 8. The rest of our results hold true
Subjects:	Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC)
Cite as:	arXiv:2502.10527 [cs.DS]
	(or arXiv:2502.10527v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2502.10527

Submission history

From: Dimitrios Myrisiotis [view email]
[v1] Fri, 14 Feb 2025 19:45:11 UTC (22 KB)
[v2] Wed, 23 Apr 2025 20:15:38 UTC (1 KB) (withdrawn)

Computer Science > Data Structures and Algorithms

Title:Algorithms and Hardness for Estimating Statistical Similarity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Algorithms and Hardness for Estimating Statistical Similarity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators