ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Alcoforado, Alexandre; Ferraz, Thomas Palmeira; Gerber, Rodrigo; Bustos, Enzo; Oliveira, André Seidel; Veloso, Bruno Miguel; Siqueira, Fabio Levy; Costa, Anna Helena Reali

doi:10.1007/978-3-030-98305-5_12

Computer Science > Computation and Language

arXiv:2201.01337 (cs)

[Submitted on 4 Jan 2022 (v1), last revised 4 Jun 2022 (this version, v3)]

Title:ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Authors:Alexandre Alcoforado, Thomas Palmeira Ferraz, Rodrigo Gerber, Enzo Bustos, André Seidel Oliveira, Bruno Miguel Veloso, Fabio Levy Siqueira, Anna Helena Reali Costa

View PDF

Abstract:Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.

Comments:	Accepted at PROPOR 2022: 15th International Conference on Computational Processing of Portuguese
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2201.01337 [cs.CL]
	(or arXiv:2201.01337v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2201.01337
Journal reference:	In: Pinheiro V. et al. (eds) Computational Processing of the Portuguese Language. PROPOR 2022. Lecture Notes in Computer Science, vol 13208. Springer, Cham
Related DOI:	https://doi.org/10.1007/978-3-030-98305-5_12

Submission history

From: Thomas Palmeira Ferraz [view email]
[v1] Tue, 4 Jan 2022 20:08:17 UTC (366 KB)
[v2] Thu, 27 Jan 2022 17:46:32 UTC (83 KB)
[v3] Sat, 4 Jun 2022 21:02:16 UTC (63 KB)

Computer Science > Computation and Language

Title:ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators