"hasSignification()": une nouvelle fonction de distance pour soutenir la d\'etection de donn\'ees personnelles

Mrabet, Amine; Hassan, Ali; Darmon, Patrice

Computer Science > Computation and Language

arXiv:2206.06836 (cs)

[Submitted on 14 Jun 2022]

Title:"hasSignification()": une nouvelle fonction de distance pour soutenir la détection de données personnelles

Authors:Amine Mrabet, Ali Hassan, Patrice Darmon (Umanis)

View PDF

Abstract:Today with Big Data and data lakes, we are faced of a mass of data that is very difficult to manage it manually. The protection of personal data in this context requires an automatic analysis for data discovery. Storing the names of attributes already analyzed in a knowledge base could optimize this automatic discovery. To have a better knowledge base, we should not store any attributes whose name does not make sense. In this article, to check if the name of an attribute has a meaning, we propose a solution that calculate the distances between this name and the words in a dictionary. Our studies on the distance functions like N-Gram, Jaro-Winkler and Levenshtein show limits to set an acceptance threshold for an attribute in the knowledge base. In order to overcome these limitations, our solution aims to strengthen the score calculation by using an exponential function based on the longest sequence. In addition, a double scan in dictionary is also proposed in order to process the attributes which have a compound name.

Comments:	in French language
Subjects:	Computation and Language (cs.CL); Databases (cs.DB)
Cite as:	arXiv:2206.06836 [cs.CL]
	(or arXiv:2206.06836v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2206.06836

Submission history

From: ali hassan [view email] [via CCSD proxy]
[v1] Tue, 14 Jun 2022 13:31:26 UTC (413 KB)

Computer Science > Computation and Language

Title:"hasSignification()": une nouvelle fonction de distance pour soutenir la détection de données personnelles

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:"hasSignification()": une nouvelle fonction de distance pour soutenir la détection de données personnelles

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators