LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Kuzman, Taja; Ljubešić, Nikola

doi:10.1109/ACCESS.2025.3544814

Computer Science > Computation and Language

arXiv:2411.19638 (cs)

[Submitted on 29 Nov 2024]

Title:LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Authors:Taja Kuzman, Nikola Ljubešić

View PDF HTML (experimental)

Abstract:With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.

Comments:	This work has been submitted to the IEEE for possible publication
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2411.19638 [cs.CL]
	(or arXiv:2411.19638v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.19638
Journal reference:	IEEE Access 2025
Related DOI:	https://doi.org/10.1109/ACCESS.2025.3544814

Submission history

From: Taja Kuzman [view email]
[v1] Fri, 29 Nov 2024 11:42:58 UTC (2,157 KB)

Computer Science > Computation and Language

Title:LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators