Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation

Bergmanis, Toms; Stafanovičs, Artūrs; Pinnis, Mārcis

Computer Science > Computation and Language

arXiv:2009.05460 (cs)

[Submitted on 11 Sep 2020 (v1), last revised 14 Sep 2020 (this version, v2)]

Title:Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation

Authors:Toms Bergmanis, Artūrs Stafanovičs, Mārcis Pinnis

View PDF

Abstract:Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We propose a simple generative noise model to generate adversarial examples of ten different types. We use these to augment machine translation systems' training data and show that, when tested on noisy data, systems trained using adversarial examples perform almost as well as when translating clean data, while baseline systems' performance drops by 2-3 BLEU points. To measure the robustness and noise invariance of machine translation systems' outputs, we use the average translation edit rate between the translation of the original sentence and its noised variants. Using this measure, we show that systems trained on adversarial examples on average yield 50% consistency improvements when compared to baselines trained on clean data.

Comments:	Accepted in BALTIC HLT 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2009.05460 [cs.CL]
	(or arXiv:2009.05460v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2009.05460

Submission history

From: Toms Bergmanis [view email]
[v1] Fri, 11 Sep 2020 14:12:54 UTC (268 KB)
[v2] Mon, 14 Sep 2020 11:16:38 UTC (268 KB)

Computer Science > Computation and Language

Title:Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators