ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Piau, Marcos; Lotufo, Roberto; Nogueira, Rodrigo

Computer Science > Computation and Language

arXiv:2406.10806 (cs)

[Submitted on 16 Jun 2024]

Title:ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Authors:Marcos Piau, Roberto Lotufo, Rodrigo Nogueira

View PDF HTML (experimental)

Abstract:Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and the finetuned MonoT5 rerankers on HuggingFace at this https URL and this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:	arXiv:2406.10806 [cs.CL]
	(or arXiv:2406.10806v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.10806

Submission history

From: Marcos Piau Vieira [view email]
[v1] Sun, 16 Jun 2024 05:17:56 UTC (131 KB)

Computer Science > Computation and Language

Title:ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators