KOLD: Korean Offensive Language Dataset

Jeong, Younghoon; Oh, Juhyun; Ahn, Jaimeen; Lee, Jongwon; Mon, Jihyung; Park, Sungjoon; Oh, Alice

Computer Science > Computation and Language

arXiv:2205.11315v1 (cs)

[Submitted on 23 May 2022 (this version), latest version 5 Nov 2022 (v2)]

Title:KOLD: Korean Offensive Language Dataset

Authors:Younghoon Jeong, Juhyun Oh, Jaimeen Ahn, Jongwon Lee, Jihyung Mon, Sungjoon Park, Alice Oh

View PDF

Abstract:Although large attention has been paid to the detection of hate speech, most work has been done in English, failing to make it applicable to other languages. To fill this gap, we present a Korean offensive language dataset (KOLD), 40k comments labeled with offensiveness, target, and targeted group information. We also collect two types of span, offensive and target span that justifies the decision of the categorization within the text. Comparing the distribution of targeted groups with the existing English dataset, we point out the necessity of a hate speech dataset fitted to the language that best reflects the culture. Trained with our dataset, we report the baseline performance of the models built on top of large pretrained language models. We also show that title information serves as context and is helpful to discern the target of hatred, especially when they are omitted in the comment.

Comments:	8 pages, 1 figure
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2205.11315 [cs.CL]
	(or arXiv:2205.11315v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.11315

Submission history

From: Younghoon Jeong [view email]
[v1] Mon, 23 May 2022 13:58:45 UTC (6,649 KB)
[v2] Sat, 5 Nov 2022 01:36:35 UTC (7,092 KB)

Computer Science > Computation and Language

Title:KOLD: Korean Offensive Language Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:KOLD: Korean Offensive Language Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators