Computer Science > Databases
[Submitted on 26 Mar 2025]
Title:RED2Hunt: an Actionable Framework for Cleaning Operational Databases with Surrogate Keys
View PDF HTML (experimental)Abstract:Surrogate keys are now extensively utilized by database designers to implement keys in SQL tables. They are straightforward, easy to understand, and enable efficient access, despite lacking any real-world semantic meaning. In this context, complex redundancy issues might emerge and often go unnoticed as long as they do not affect the operational applications built on top of the databases. These issues become evident when organizations seek to leverage data science, posing significant challenges to the implementation of analytical projects.
This paper, grounded in real-world applications, defines the concept of artificial unicity and proposes RED2Hunt (RElational Databases REDundancy Hunting), a human-in-the-loop framework for identifying hidden redundancy and, if problems occur, cleaning relational databases implemented with surrogate keys. We first define the central and intricate notion of artificial unicity and then the RED2Hunt framework to address it. We rely on simple abstractions easy to visualize based on the so-called redundancy profile associated to some relations and the notion of attribute stability. Quite interestingly, those profiles can be computed very efficiently in quasi-linear time. We have devised different metrics to guide the domain expert and an actionable framework to generate new redundancy-free databases. The proposed framework was implemented on top of PostgreSQL. From the publicly available IMDB database, we have generated synthetic databases, implementing different redundancy scenarios, on which we tested RED2Hunt to study its scalability. RED2Hunt has also been tested on operational databases implemented with surrogate keys. Lessons learned from these real-life applications are discussed.
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.