Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Park, Chanjun; Shim, Midan; Eo, Sugyeong; Lee, Seolhwa; Seo, Jaehyung; Moon, Hyeonseok; Lim, Heuiseok

Computer Science > Computation and Language

arXiv:2110.15023 (cs)

[Submitted on 28 Oct 2021]

Title:Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Authors:Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Lee, Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim

View PDF

Abstract:Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2110.15023 [cs.CL]
	(or arXiv:2110.15023v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2110.15023

Submission history

From: Chanjun Park [view email]
[v1] Thu, 28 Oct 2021 11:15:54 UTC (4,503 KB)

Computer Science > Computation and Language

Title:Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators