Databases
See recent articles
- [1] arXiv:2407.14098 [pdf, html, other]
-
Title: Top-k Representative Search for Comparative Tree SummarizationSubjects: Databases (cs.DB)
Data summarization aims at utilizing a small-scale summary to represent massive datasets as a whole, which is useful for visualization and information sipped generation. However, most existing studies of hierarchical summarization only work on \emph{one single tree} by selecting $k$ representative nodes, which neglects an important problem of comparative summarization on two trees. In this paper, given two trees with the same topology structure and different node weights, we aim at finding $k$ representative nodes, where $k_1$ nodes summarize the common relationship between them and $k_2$ nodes highlight significantly different sub-trees meanwhile satisfying $k_1+k_2=k$. To optimize summarization results, we introduce a scaling coefficient for balancing the summary view between two sub-trees in terms of similarity and difference. Additionally, we propose a novel definition based on the Hellinger distance to quantify the node distribution difference between the sub-trees. We present a greedy algorithm SVDT to find high-quality results with approximation guaranteed in an efficient way. Furthermore, we explore an extension of our comparative summarization to handle two trees with different structures. Extensive experiments demonstrate the effectiveness and efficiency of our SVDT algorithm against existing summarization competitors.
- [2] arXiv:2407.14384 [pdf, html, other]
-
Title: The Sticky Path to Expressive Querying: Decidability of Navigational Queries under Existential RulesSubjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Extensive research in the field of ontology-based query answering has led to the identification of numerous fragments of existential rules (also known as tuple-generating dependencies) that exhibit decidable answering of atomic and conjunctive queries. Motivated by the increased theoretical and practical interest in navigational queries, this paper considers the question for which of these fragments decidability of querying extends to regular path queries (RPQs). In fact, decidability of RPQs has recently been shown to generally hold for the comprehensive family of all fragments that come with the guarantee of universal models being reasonably well-shaped (that is, being of finite cliquewidth). Yet, for the second major family of fragments, known as finite unification sets (short: fus), which are based on first-order-rewritability, corresponding results have been largely elusive so far. We complete the picture by showing that RPQ answering over arbitrary fus rulesets is undecidable. On the positive side, we establish that the problem is decidable for the prominent fus subclass of sticky rulesets, with the caveat that a very mild extension of the RPQ formalism turns the problem undecidable again.
New submissions for Monday, 22 July 2024 (showing 2 of 2 entries )
- [3] arXiv:2407.14290 (cross-list from astro-ph.IM) [pdf, html, other]
-
Title: Evaluation of Provenance Serialisations for Astronomical ProvenanceMichael A. C. Johnson, Marcus Paradies, Hans-Rainer Klöckner, Albina Muzafarova, Kristen Lackeos, David J. Champion, Marta Dembska, Sirko SchindlerComments: 9 pages, 8 figures, to be published in the 16th International Workshop on Theory and Practice of ProvenanceSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Databases (cs.DB)
Provenance data from astronomical pipelines are instrumental in establishing trust and reproducibility in the data processing and products. In addition, astronomers can query their provenance to answer questions routed in areas such as anomaly detection, recommendation, and prediction. The next generation of astronomical survey telescopes such as the Vera Rubin Observatory or Square Kilometre Array, are capable of producing peta to exabyte scale data, thereby amplifying the importance of even small improvements to the efficiency of provenance storage or querying. In order to determine how astronomers should store and query their provenance data, this paper reports on a comparison between the turtle and JSON provenance serialisations. The triple store Apache Jena Fuseki and the graph database system Neo4j were selected as representative database management systems (DBMS) for turtle and JSON, respectively. Simulated provenance data was uploaded to and queried over each DBMS and the metrics measured for comparison were the accuracy and timing of the queries as well as the data upload times. It was found that both serialisations are competent for this purpose, and both have similar query accuracy. The turtle provenance was found to be more efficient at storing and uploading the data. Regarding queries, for small datasets ($<$5MB) and simple information retrieval queries, the turtle serialisation was also found to be more efficient. However, queries for JSON serialised provenance were found to be more efficient for more complex queries which involved matching patterns across the DBMS, this effect scaled with the size of the queried provenance.
Cross submissions for Monday, 22 July 2024 (showing 1 of 1 entries )
- [4] arXiv:2404.07354 (replaced) [pdf, html, other]
-
Title: FairEM360: A Suite for Responsible Entity MatchingSubjects: Databases (cs.DB); Computers and Society (cs.CY); Machine Learning (cs.LG)
Entity matching is one the earliest tasks that occur in the big data pipeline and is alarmingly exposed to unintentional biases that affect the quality of data. Identifying and mitigating the biases that exist in the data or are introduced by the matcher at this stage can contribute to promoting fairness in downstream tasks. This demonstration showcases FairEM360, a framework for 1) auditing the output of entity matchers across a wide range of fairness measures and paradigms, 2) providing potential explanations for the underlying reasons for unfairness, and 3) providing resolutions for the unfairness issues through an exploratory process with human-in-the-loop feedback, utilizing an ensemble of matchers. We aspire for FairEM360 to contribute to the prioritization of fairness as a key consideration in the evaluation of EM pipelines.
- [5] arXiv:2302.02325 (replaced) [pdf, other]
-
Title: Resilient Consensus Sustained CollaborativelyComments: 15 pages, 7 figuresSubjects: Cryptography and Security (cs.CR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Decentralized systems built around blockchain technology promise clients an immutable ledger. They add a transaction to the ledger after it undergoes consensus among the replicas that run a Proof-of-Stake (PoS) or Byzantine Fault-Tolerant (BFT) consensus protocol. Unfortunately, these protocols face a long-range attack where an adversary having access to the private keys of the replicas can rewrite the ledger. One solution is forcing each committed block from these protocols to undergo another consensus, Proof-of-Work(PoW) consensus; PoW protocol leads to wastage of computational resources as miners compete to solve complex puzzles. In this paper, we present the design of our Power-of-Collaboration (PoC) protocol, which guards existing PoS/BFT blockchains against long-range attacks and requires miners to collaborate rather than compete. PoC guarantees fairness and accountability and only marginally degrades the throughput of the underlying system.