Patching Leaks in the Charformer for Efficient Character-Level Generation

Edman, Lukas; Toral, Antonio; van Noord, Gertjan

Computer Science > Computation and Language

arXiv:2205.14086 (cs)

[Submitted on 27 May 2022]

Title:Patching Leaks in the Charformer for Efficient Character-Level Generation

Authors:Lukas Edman, Antonio Toral, Gertjan van Noord

View PDF

Abstract:Character-based representations have important advantages over subword-based ones for morphologically rich languages. They come with increased robustness to noisy input and do not need a separate tokenization step. However, they also have a crucial disadvantage: they notably increase the length of text sequences. The GBST method from Charformer groups (aka downsamples) characters to solve this, but allows information to leak when applied to a Transformer decoder. We solve this information leak issue, thereby enabling character grouping in the decoder. We show that Charformer downsampling has no apparent benefits in NMT over previous downsampling methods in terms of translation quality, however it can be trained roughly 30% faster. Promising performance on English--Turkish translation indicate the potential of character-level models for morphologically-rich languages.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.14086 [cs.CL]
	(or arXiv:2205.14086v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.14086

Submission history

From: Lukas Edman [view email]
[v1] Fri, 27 May 2022 16:36:45 UTC (107 KB)

Computer Science > Computation and Language

Title:Patching Leaks in the Charformer for Efficient Character-Level Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Patching Leaks in the Charformer for Efficient Character-Level Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators