Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement

Sharma, Karishma; Ferrara, Emilio; Liu, Yan

Computer Science > Social and Information Networks

arXiv:2202.12413 (cs)

COVID-19 e-print

Important: e-prints posted on arXiv are not peer-reviewed by arXiv; they should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the field.

[Submitted on 24 Feb 2022]

Title:Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement

Authors:Karishma Sharma, Emilio Ferrara, Yan Liu

View PDF

Abstract:Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels towards building a large-scale dataset with minimum human effort. To provide labeled datasets with distinction of misleading narratives where information might be missing significant context or has inaccurate ancillary details, the proposed framework will use the few labeled samples as class prototypes to separate high confidence samples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines.

Subjects:	Social and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2202.12413 [cs.SI]
	(or arXiv:2202.12413v1 [cs.SI] for this version)
	https://doi.org/10.48550/arXiv.2202.12413
Journal reference:	WWW (2022)

Submission history

From: Karishma Sharma [view email]
[v1] Thu, 24 Feb 2022 23:10:29 UTC (712 KB)

Computer Science > Social and Information Networks

Title:Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Social and Information Networks

Title:Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators