What Kind of Language Is Hard to Language-Model?

Mielke, Sabrina J.; Cotterell, Ryan; Gorman, Kyle; Roark, Brian; Eisner, Jason

Computer Science > Computation and Language

arXiv:1906.04726 (cs)

[Submitted on 11 Jun 2019 (v1), last revised 25 Feb 2020 (this version, v2)]

Title:What Kind of Language Is Hard to Language-Model?

Authors:Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

View PDF

Abstract:How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that "translationese" is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

Comments:	Published at ACL 2019
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1906.04726 [cs.CL]
	(or arXiv:1906.04726v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1906.04726

Submission history

From: Sabrina Mielke [view email]
[v1] Tue, 11 Jun 2019 17:56:08 UTC (279 KB)
[v2] Tue, 25 Feb 2020 18:38:57 UTC (279 KB)

Computer Science > Computation and Language

Title:What Kind of Language Is Hard to Language-Model?

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:What Kind of Language Is Hard to Language-Model?

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators