WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Wu, Qiyu; Nagata, Masaaki; Tsuruoka, Yoshimasa

Computer Science > Computation and Language

arXiv:2306.05644 (cs)

[Submitted on 9 Jun 2023 (v1), last revised 19 Oct 2023 (this version, v2)]

Title:WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Authors:Qiyu Wu, Masaaki Nagata, Yoshimasa Tsuruoka

View PDF

Abstract:Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state-of-the-art by improving upon the best-supervised baseline by 3.3~6.1 points in F1 and 1.5~6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.

Comments:	ACL 2023 main conference long paper
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2306.05644 [cs.CL]
	(or arXiv:2306.05644v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.05644

Submission history

From: Qiyu Wu [view email]
[v1] Fri, 9 Jun 2023 03:11:42 UTC (412 KB)
[v2] Thu, 19 Oct 2023 05:47:52 UTC (412 KB)

Computer Science > Computation and Language

Title:WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators