When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

Stengel-Eskin, Elias; Platanios, Emmanouil Antonios; Pauls, Adam; Thomson, Sam; Fang, Hao; Van Durme, Benjamin; Eisner, Jason; Su, Yu

Computer Science > Computation and Language

arXiv:2205.12228 (cs)

[Submitted on 24 May 2022 (v1), last revised 8 Nov 2022 (this version, v2)]

Title:When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

Authors:Elias Stengel-Eskin, Emmanouil Antonios Platanios, Adam Pauls, Sam Thomson, Hao Fang, Benjamin Van Durme, Jason Eisner, Yu Su

View PDF

Abstract:In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation of this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on the new symbol often decreases if we do not accordingly increase its training data. This suggests that it becomes more difficult to learn new symbols with a larger training dataset. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues. Code, models, and data are available at this https URL

Comments:	EMNLP 2022
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.12228 [cs.CL]
	(or arXiv:2205.12228v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.12228

Submission history

From: Elias Stengel-Eskin [view email]
[v1] Tue, 24 May 2022 17:36:27 UTC (2,985 KB)
[v2] Tue, 8 Nov 2022 13:45:14 UTC (3,524 KB)

Computer Science > Computation and Language

Title:When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators