Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Mathur, Nitika; Baldwin, Timothy; Cohn, Trevor

Computer Science > Computation and Language

arXiv:2006.06264 (cs)

[Submitted on 11 Jun 2020 (v1), last revised 12 Jun 2020 (this version, v2)]

Title:Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Authors:Nitika Mathur, Timothy Baldwin, Trevor Cohn

View PDF

Abstract:Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

Comments:	Accepted at ACL 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2006.06264 [cs.CL]
	(or arXiv:2006.06264v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2006.06264

Submission history

From: Nitika Mathur [view email]
[v1] Thu, 11 Jun 2020 09:12:53 UTC (480 KB)
[v2] Fri, 12 Jun 2020 04:35:41 UTC (480 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-06

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Trevor Cohn

export BibTeX citation

Computer Science > Computation and Language

Title:Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators