Data Structures and Algorithms
- [1] arXiv:2405.03792 [pdf, ps, html, other]
-
Title: Prize-Collecting Steiner Tree: A 1.79 ApproximationSubjects: Data Structures and Algorithms (cs.DS)
Prize-Collecting Steiner Tree (PCST) is a generalization of the Steiner Tree problem, a fundamental problem in computer science. In the classic Steiner Tree problem, we aim to connect a set of vertices known as terminals using the minimum-weight tree in a given weighted graph. In this generalized version, each vertex has a penalty, and there is flexibility to decide whether to connect each vertex or pay its associated penalty, making the problem more realistic and practical.
Both the Steiner Tree problem and its Prize-Collecting version had long-standing $2$-approximation algorithms, matching the integrality gap of the natural LP formulations for both. This barrier for both problems has been surpassed, with algorithms achieving approximation factors below $2$. While research on the Steiner Tree problem has led to a series of reductions in the approximation ratio below $2$, culminating in a $\ln(4)+\epsilon$ approximation by Byrka, Grandoni, Rothvoß, and Sanità, the Prize-Collecting version has not seen improvements in the past 15 years since the work of Archer, Bateni, Hajiaghayi, and Karloff, which reduced the approximation factor for this problem from $2$ to $1.9672$. Interestingly, even the Prize-Collecting TSP approximation, which was first improved below $2$ in the same paper, has seen several advancements since then.
In this paper, we reduce the approximation factor for the PCST problem substantially to 1.7994 via a novel iterative approach. - [2] arXiv:2405.03801 [pdf, ps, html, other]
-
Title: Finding Most Shattering Minimum Vertex Cuts of Polylogarithmic Size in Near-Linear TimeComments: Accepted to ICALP 2024Subjects: Data Structures and Algorithms (cs.DS)
We show the first near-linear time randomized algorithms for listing all minimum vertex cuts of polylogarithmic size that separate the graph into at least three connected components (also known as shredders) and for finding the most shattering one, i.e., the one maximizing the number of connected components. Our algorithms break the quadratic time bound by Cheriyan and Thurimella (STOC'96) for both problems that has been unimproved for more than two decades. Our work also removes an important bottleneck to near-linear time algorithms for the vertex connectivity augmentation problem (Jordan '95) and finding an even-length cycle in a directed graph, a problem shown to be equivalent to many other fundamental problems (Vazirani and Yannakakis '90, Robertson et al. '99). Note that it is necessary to list only minimum vertex cuts that separate the graph into at least three components because there can be an exponential number of minimum vertex cuts in general.
To obtain near-linear time algorithms, we have extended techniques in local flow algorithms developed by Forster et al. (SODA'20) to list shredders on a local scale. We also exploit fast queries to a pairwise vertex connectivity oracle subject to vertex failures (Long and Saranurak FOCS'22, Kosinas ESA'23). This is the first application of connectivity oracles subject to vertex failures to speed up a static graph algorithm. - [3] arXiv:2405.03856 [pdf, ps, html, other]
-
Title: Finding perfect matchings in bridgeless cubic multigraphs without dynamic (2-)connectivitySubjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM)
Petersen's theorem, one of the earliest results in graph theory, states that any bridgeless cubic multigraph contains a perfect matching. While the original proof was neither constructive nor algorithmic, Biedl, Bose, Demaine, and Lubiw [J. Algorithms 38(1)] showed how to implement a later constructive proof by Frink in $\mathcal{O}(n\log^{4}n)$ time using a fully dynamic 2-edge-connectivity structure. Then, Diks and Stańczyk [SOFSEM 2010] described a faster approach that only needs a fully dynamic connectivity structure and works in $\mathcal{O}(n\log^{2}n)$ time. Both algorithms, while reasonable simple, utilize non-trivial (2-edge-)connectivity structures. We show that this is not necessary, and in fact a structure for maintaining a dynamic tree, e.g. link-cut trees, suffices to obtain a simple $\mathcal{O}(n\log n)$ time algorithm.
- [4] arXiv:2405.04052 [pdf, ps, html, other]
-
Title: Minimizing the Minimizers via Alphabet ReorderingComments: Extended version of a paper accepted at CPM 2024Subjects: Data Structures and Algorithms (cs.DS)
Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let $S=S[1]\ldots S[n]$ be a string over a totally ordered alphabet $\Sigma$. Further let $w\geq 2$ and $k\geq 1$ be two integers. The minimizer of $S[i\mathinner{.\,.} i+w+k-2]$ is the smallest position in $[i,i+w-1]$ where the lexicographically smallest length-$k$ substring of $S[i\mathinner{.\,.} i+w+k-2]$ starts. The set of minimizers over all $i\in[1,n-w-k+2]$ is the set $\mathcal{M}_{w,k}(S)$ of the minimizers of $S$. We consider the following basic problem: Given $S$, $w$, and $k$, can we efficiently compute a total order on $\Sigma$ that minimizes $|\mathcal{M}_{w,k}(S)|$? We show that this is unlikely by proving that the problem is NP-hard for any $w\geq 2$ and $k\geq 1$. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.
- [5] arXiv:2405.04428 [pdf, ps, html, other]
-
Title: BBK: a simpler, faster algorithm for enumerating maximal bicliques in large sparse bipartite graphsComments: 21 pages, 4 figures, 3 tablesSubjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Bipartite graphs are a prevalent modeling tool for real-world networks, capturing interactions between vertices of two different types. Within this framework, bicliques emerge as crucial structures when studying dense subgraphs: they are sets of vertices such that all vertices of the first type interact with all vertices of the second type. Therefore, they allow identifying groups of closely related vertices of the network, such as individuals with similar interests or webpages with similar contents. This article introduces a new algorithm designed for the exhaustive enumeration of maximal bicliques within a bipartite graph. This algorithm, called BBK for Bipartite Bron-Kerbosch, is a new extension to the bipartite case of the Bron-Kerbosch algorithm, which enumerates the maximal cliques in standard (non-bipartite) graphs. It is faster than the state-of-the-art algorithms and allows the enumeration on massive bipartite graphs that are not manageable with existing implementations. We analyze it theoretically to establish two complexity formulas: one as a function of the input and one as a function of the output characteristics of the algorithm. We also provide an open-access implementation of BBK in C++, which we use to experiment and validate its efficiency on massive real-world datasets and show that its execution time is shorter in practice than state-of-the art algorithms. These experiments also show that the order in which the vertices are processed, as well as the choice of one of the two types of vertices on which to initiate the enumeration have an impact on the computation time.
- [6] arXiv:2405.04467 [pdf, ps, html, other]
-
Title: Online List Labeling with Near-Logarithmic WritesComments: 12 pages, 1 figure. Improved version of a rejected draftSubjects: Data Structures and Algorithms (cs.DS)
In the Online List Labeling problem, a set of $n \leq N$ elements from a totally ordered universe must be stored in sorted order in an array with $m=N+\lceil\varepsilon N \rceil$ slots, where $\varepsilon \in (0,1]$ is constant, while an adversary chooses elements that must be inserted and deleted from the set.
We devise a skip-list based algorithm for maintaining order against an oblivious adversary and show that the expected amortized number of writes is $O(\varepsilon^{-1}\log (n) \operatorname{poly}(\log \log n))$ per update.
New submissions for Wednesday, 8 May 2024 (showing 6 of 6 entries )
- [7] arXiv:2405.03851 (cross-list from cs.DB) [pdf, ps, html, other]
-
Title: Upper Bounds for Complexity of Asymptotically Optimal Learned IndexesSubjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS)
Learned indexes leverage machine learning models to accelerate query answering in databases, showing impressive practical performance. However, theoretical understanding of these methods remains incomplete. Existing research suggests that learned indexes have superior asymptotic complexity compared to their non-learned counterparts, but these findings have been established under restrictive probabilistic assumptions. Specifically, for a sorted array with $n$ elements, it has been shown that learned indexes can find a key in $O(\log(\log n))$ expected time using at most linear space, compared with $O(\log n)$ for non-learned methods.
In this work, we prove $O(1)$ expected time can be achieved with at most linear space, thereby establishing the tightest upper bound so far for the time complexity of an asymptotically optimal learned index. Notably, we use weaker probabilistic assumptions than prior work, meaning our results generalize previous efforts. Furthermore, we introduce a new measure of statistical complexity for data. This metric exhibits an information-theoretical interpretation and can be estimated in practice. This characterization provides further theoretical understanding of learned indexes, by helping to explain why some datasets seem to be particularly challenging for these methods. - [8] arXiv:2405.04020 (cross-list from cs.GT) [pdf, ps, html, other]
-
Title: Metric Distortion of Line-up Elections: The Right Person for the Right JobSubjects: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS)
We provide mechanisms and new metric distortion bounds for line-up elections. In such elections, a set of $n$ voters, $k$ candidates, and $\ell$ positions are all located in a metric space. The goal is to choose a set of candidates and assign them to different positions, so as to minimize the total cost of the voters. The cost of each voter consists of the distances from itself to the chosen candidates (measuring how much the voter likes the chosen candidates, or how similar it is to them), as well as the distances from the candidates to the positions they are assigned to (measuring the fitness of the candidates for their positions). Our mechanisms, however, do not know the exact distances, and instead produce good outcomes while only using a smaller amount of information, resulting in small distortion.
We consider several different types of information: ordinal voter preferences, ordinal position preferences, and knowing the exact locations of candidates and positions, but not those of voters. In each of these cases, we provide constant distortion bounds, thus showing that only a small amount of information is enough to form outcomes close to optimum in line-up elections. - [9] arXiv:2405.04237 (cross-list from cs.DC) [pdf, ps, other]
-
Title: QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systemsComments: 12 pages, 10 figures, 2 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Performance (cs.PF)
In this paper we present a novel algorithm developed for computing the QR factorisation of extremely ill-conditioned tall-and-skinny matrices on distributed memory systems. The algorithm is based on the communication-avoiding CholeskyQR2 algorithm and its block Gram-Schmidt variant. The latter improves the numerical stability of the CholeskyQR2 algorithm and significantly reduces the loss of orthogonality even for matrices with condition numbers up to $10^{15}$. Currently, there is no distributed GPU version of this algorithm available in the literature which prevents the application of this method to very large matrices. In our work we provide a distributed implementation of this algorithm and also introduce a modified version that improves the performance, especially in the case of extremely ill-conditioned matrices. The main innovation of our approach lies in the interleaving of the CholeskyQR steps with the Gram-Schmidt orthogonalisation, which ensures that update steps are performed with fully orthogonalised panels. The obtained orthogonality and numerical stability of our modified algorithm is equivalent to CholeskyQR2 with Gram-Schmidt and other state-of-the-art methods. Weak scaling tests performed with our test matrices show significant performance improvements. In particular, our algorithm outperforms state-of-the-art Householder-based QR factorisation algorithms available in ScaLAPACK by a factor of $6$ on CPU-only systems and up to $80\times$ on GPU-based systems with distributed memory.
- [10] arXiv:2405.04261 (cross-list from cs.IT) [pdf, ps, other]
-
Title: Graph Reconstruction from Noisy Random SubgraphsComments: 6 pages, to appear in ISIT 2024Subjects: Information Theory (cs.IT); Data Structures and Algorithms (cs.DS)
We consider the problem of reconstructing an undirected graph $G$ on $n$ vertices given multiple random noisy subgraphs or "traces". Specifically, a trace is generated by sampling each vertex with probability $p_v$, then taking the resulting induced subgraph on the sampled vertices, and then adding noise in the form of either (a) deleting each edge in the subgraph with probability $1-p_e$, or (b) deleting each edge with probability $f_e$ and transforming a non-edge into an edge with probability $f_e$. We show that, under mild assumptions on $p_v$, $p_e$ and $f_e$, if $G$ is selected uniformly at random, then $O(p_e^{-1} p_v^{-2} \log n)$ or $O((f_e-1/2)^{-2} p_v^{-2} \log n)$ traces suffice to reconstruct $G$ with high probability. In contrast, if $G$ is arbitrary, then $\exp(\Omega(n))$ traces are necessary even when $p_v=1, p_e=1/2$.
- [11] arXiv:2405.04435 (cross-list from cs.CL) [pdf, ps, html, other]
-
Title: Fast Exact Retrieval for Nearest-neighbor Lookup (FERN)Comments: NAACL 2024 SRWSubjects: Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at this https URL}
Cross submissions for Wednesday, 8 May 2024 (showing 5 of 5 entries )
- [12] arXiv:2304.01889 (replaced) [pdf, ps, html, other]
-
Title: Chasing Positive BodiesSubjects: Data Structures and Algorithms (cs.DS)
We study the problem of chasing positive bodies in $\ell_1$: given a sequence of bodies $K_{t}=\{x^{t}\in\mathbb{R}_{+}^{n}\mid C^{t}x^{t}\geq 1,P^{t}x^{t}\leq 1\}$ revealed online, where $C^{t}$ and $P^{t}$ are nonnegative matrices, the goal is to (approximately) maintain a point $x_t \in K_t$ such that $\sum_t \|x_t - x_{t-1}\|_1$ is minimized. This captures the fully-dynamic low-recourse variant of any problem that can be expressed as a mixed packing-covering linear program and thus also the fractional version of many central problems in dynamic algorithms such as set cover, load balancing, hyperedge orientation, minimum spanning tree, and matching.
We give an $O(\log d)$-competitive algorithm for this problem, where $d$ is the maximum row sparsity of any matrix $C^t$. This bypasses and improves exponentially over the lower bound of $\sqrt{n}$ known for general convex bodies. Our algorithm is based on iterated information projections, and, in contrast to general convex body chasing algorithms, is entirely memoryless.
We also show how to round our solution dynamically to obtain the first fully dynamic algorithms with competitive recourse for all the stated problems above; i.e. their recourse is less than the recourse of every other algorithm on every update sequence, up to polylogarithmic factors. This is a significantly stronger notion than the notion of absolute recourse in the dynamic algorithms literature. - [13] arXiv:2309.05172 (replaced) [pdf, ps, html, other]
-
Title: 2-Approximation for Prize-Collecting Steiner ForestSubjects: Data Structures and Algorithms (cs.DS)
Approximation algorithms for the prize-collecting Steiner forest problem (PCSF) have been a subject of research for over three decades, starting with the seminal works of Agrawal, Klein, and Ravi and Goemans and Williamson on Steiner forest and prize-collecting problems. In this paper, we propose and analyze a natural deterministic algorithm for PCSF that achieves a $2$-approximate solution in polynomial time. This represents a significant improvement compared to the previously best known algorithm with a $2.54$-approximation factor developed by Hajiaghayi and Jain in 2006. Furthermore, K{ö}nemann, Olver, Pashkovich, Ravi, Swamy, and Vygen have established an integrality gap of at least $9/4$ for the natural LP relaxation for PCSF. However, we surpass this gap through the utilization of a combinatorial algorithm and a novel analysis technique. Since $2$ is the best known approximation guarantee for Steiner forest problem, which is a special case of PCSF, our result matches this factor and closes the gap between the Steiner forest problem and its generalized version, PCSF.
- [14] arXiv:2310.05839 (replaced) [pdf, ps, other]
-
Title: Parameterized Complexity of MinCSP over the Point AlgebraSubjects: Data Structures and Algorithms (cs.DS)
The input in the Minimum-Cost Constraint Satisfaction Problem (MinCSP) over the Point Algebra contains a set of variables, a collection of constraints of the form $x < y$, $x = y$, $x \leq y$ and $x \neq y$, and a budget $k$. The goal is to check whether it is possible to assign rational values to the variables while breaking constraints of total cost at most $k$. This problem generalizes several prominent graph separation and transversal problems: MinCSP$(<)$ is equivalent to Directed Feedback Arc Set, MinCSP$(<,\leq)$ is equivalent to Directed Subset Feedback Arc Set, MinCSP$(=,\neq)$ is equivalent to Edge Multicut, and MinCSP$(\leq,\neq)$ is equivalent to Directed Symmetric Multicut. Apart from trivial cases, MinCSP$(\Gamma)$ for $\Gamma \subseteq \{<,=,\leq,\neq\}$ is NP-hard even to approximate within any constant factor under the Unique Games Conjecture. Hence, we study parameterized complexity of this problem under a natural parameterization by the solution cost $k$. We obtain a complete classification: if $\Gamma \subseteq \{<,=,\leq,\neq\}$ contains both $\leq$ and $\neq$, then MinCSP$(\Gamma)$ is W[1]-hard, otherwise it is fixed-parameter tractable. For the positive cases, we solve MinCSP$(<,=,\neq)$, generalizing the FPT results for Directed Feedback Arc Set and Edge Multicut as well as their weighted versions. Our algorithm works by reducing the problem into a Boolean MinCSP, which is in turn solved by flow augmentation. For the lower bounds, we prove that Directed Symmetric Multicut is W[1]-hard, solving an open problem.
- [15] arXiv:2401.01404 (replaced) [pdf, ps, other]
-
Title: Scalable network reconstruction in subquadratic timeComments: 12 pages, 7 figures. Code and documentation available at this https URLSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO); Machine Learning (stat.ML)
Network reconstruction consists in determining the unobserved pairwise couplings between $N$ nodes given only observational data on the resulting behavior that is conditioned on those couplings -- typically a time-series or independent samples from a graphical model. A major obstacle to the scalability of algorithms proposed for this problem is a seemingly unavoidable quadratic complexity of $\Omega(N^2)$, corresponding to the requirement of each possible pairwise coupling being contemplated at least once, despite the fact that most networks of interest are sparse, with a number of non-zero couplings that is only $O(N)$. Here we present a general algorithm applicable to a broad range of reconstruction problems that significantly outperforms this quadratic baseline. Our algorithm relies on a stochastic second neighbor search (Dong et al., 2011) that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search. If we rely on the conjecture that the second-neighbor search finishes in log-linear time (Baron & Darling, 2020; 2022), we demonstrate theoretically that our algorithm finishes in subquadratic time, with a data-dependent complexity loosely upper bounded by $O(N^{3/2}\log N)$, but with a more typical log-linear complexity of $O(N\log^2N)$. In practice, we show that our algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline -- in a manner consistent with our theoretical analysis -- allows for easy parallelization, and thus enables the reconstruction of networks with hundreds of thousands and even millions of nodes and edges.
- [16] arXiv:2302.03456 (replaced) [pdf, ps, html, other]
-
Title: 1-in-3 vs. Not-All-Equal: Dichotomy of a broken promiseComments: Full version of a LICS 2024 paper; v2 has a different title, abstract, and introductionSubjects: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
The 1-in-3 and Not-All-Equal satisfiability problems for Boolean CNF formulas are two well-known NP-hard problems. In contrast, the promise 1-in-3 vs. Not-All-Equal problem can be solved in polynomial time. In the present work, we investigate this constraint satisfaction problem in a regime where the promise is weakened from either side by a rainbow-free structure, and establish a complexity dichotomy for the resulting class of computational problems.
- [17] arXiv:2404.06087 (replaced) [pdf, ps, html, other]
-
Title: The Overlap Gap Property limits limit swapping in QAOAComments: 22 pages, 2 figuresSubjects: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Data Structures and Algorithms (cs.DS)
The Quantum Approximate Optimization Algorithm (QAOA) is a quantum algorithm designed for Combinatorial Optimization Problem (COP). We show that if a COP with an underlying Erdös--Rényi hypergraph exhibits the Overlap Gap Property (OGP), then a random regular hypergraph exhibits it as well. Given that Max-$q$-XORSAT on an Erdös--Rényi hypergraph is known to exhibit the OGP, and since the performance of QAOA for the pure $q$-spin model matches asymptotically for Max-$q$-XORSAT on large-girth regular hypergraph, we show that the average-case value obtained by QAOA for the pure $q$-spin model for even $q\ge 4$ is bounded away from optimality even when the algorithm runs indefinitely. This suggests that a necessary condition for the validity of limit swapping in QAOA is the absence of OGP in a given combinatorial optimization problem. Furthermore, the results suggests that even when sub-optimised, the performance of QAOA on spin glass is equal in performance to classical algorithms in solving the mean field spin glass problem providing further evidence that the conjecture of getting the exact solution under limit swapping for the Sherrington--Kirkpatrick model to be true.
- [18] arXiv:2404.12953 (replaced) [pdf, ps, other]
-
Title: Low-Depth Spatial Tree AlgorithmsComments: to appear at IPDPS 2024Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic challenges, particularly when managing sparse data, a pivotal component in progressing data science. The spatial computer model quantifies communication locality by weighting processor communication costs by distance, introducing a term named energy. Moreover, it integrates depth, a widely-utilized metric, to promote high parallelism. We propose and analyze a framework for efficient spatial tree algorithms within the spatial computer model. Our primary method constructs a spatial tree layout that optimizes the locality of the neighbors in the compute grid. This approach thereby enables locality-optimized messaging within the tree. Our layout achieves a polynomial factor improvement in energy compared to utilizing a PRAM approach. Using this layout, we develop energy-efficient treefix sum and lowest common ancestor algorithms, which are both fundamental building blocks for other graph algorithms. With high probability, our algorithms exhibit near-linear energy and poly-logarithmic depth. Our contributions augment a growing body of work demonstrating that computations can have both high spatial locality and low depth. Moreover, our work constitutes an advancement in the spatial layout of irregular and sparse computations.