An exploration of features to improve the generalisability of fake news detection models

Hoy, Nathaniel; Koulouri, Theodora

doi:10.1016/j.eswa.2025.126949

Abstract:Fake news poses global risks by influencing elections and spreading misinformation, making detection critical. Existing NLP and supervised Machine Learning methods perform well under cross-validation but struggle to generalise across datasets, even within the same domain. This issue stems from coarsely labelled training data, where articles are labelled based on their publisher, introducing biases that token-based models like TF-IDF and BERT are sensitive to. While Large Language Models (LLMs) offer promise, their application in fake news detection remains limited. This study demonstrates that meaningful features can still be extracted from coarsely labelled data to improve real-world robustness. Stylistic features-lexical, syntactic, and semantic-are explored due to their reduced sensitivity to dataset biases. Additionally, novel social-monetisation features are introduced, capturing economic incentives behind fake news, such as advertisements, external links, and social media elements. The study trains on the coarsely labelled NELA 2020-21 dataset and evaluates using the manually labelled Facebook URLs dataset, a gold standard for generalisability. Results highlight the limitations of token-based models trained on biased data and contribute to the scarce evidence on LLMs like LLaMa in this field. Findings indicate that stylistic and social-monetisation features offer more generalisable predictions than token-based methods and LLMs. Statistical and permutation feature importance analyses further reveal their potential to enhance performance and mitigate dataset biases, providing a path forward for improving fake news detection.

Comments:	Accepted at Expert Systems with Applications (Elsevier)
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2502.20299 [cs.LG]
	(or arXiv:2502.20299v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.20299
Related DOI:	https://doi.org/10.1016/j.eswa.2025.126949

Computer Science > Machine Learning

Title:An exploration of features to improve the generalisability of fake news detection models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators