Formalising lexical and syntactic diversity for data sampling in French

Estève, Louis; Scholivet, Manon; Savary, Agata

Computer Science > Computation and Language

arXiv:2501.08003 (cs)

[Submitted on 14 Jan 2025]

Title:Formalising lexical and syntactic diversity for data sampling in French

Authors:Louis Estève, Manon Scholivet, Agata Savary

View PDF HTML (experimental)

Abstract:Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.08003 [cs.CL]
	(or arXiv:2501.08003v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.08003

Submission history

From: Manon Scholivet [view email]
[v1] Tue, 14 Jan 2025 10:47:33 UTC (219 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2025-01

Change to browse by:

References & Citations

export BibTeX citation

Computer Science > Computation and Language

Title:Formalising lexical and syntactic diversity for data sampling in French

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Formalising lexical and syntactic diversity for data sampling in French

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators