A Large-Scale Chinese Short-Text Conversation Dataset

Wang, Yida; Ke, Pei; Zheng, Yinhe; Huang, Kaili; Jiang, Yong; Zhu, Xiaoyan; Huang, Minlie

Computer Science > Computation and Language

arXiv:2008.03946 (cs)

[Submitted on 10 Aug 2020 (v1), last revised 26 Apr 2022 (this version, v2)]

Title:A Large-Scale Chinese Short-Text Conversation Dataset

Authors:Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu, Minlie Huang

View PDF

Abstract:The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at this https URL.

Comments:	Accepted by NLPCC 2020 (Best Student Paper)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2008.03946 [cs.CL]
	(or arXiv:2008.03946v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2008.03946

Submission history

From: Yinhe Zheng Dr. [view email]
[v1] Mon, 10 Aug 2020 08:12:49 UTC (33 KB)
[v2] Tue, 26 Apr 2022 07:07:56 UTC (1,046 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-08

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Yida Wang
Yinhe Zheng
Yong Jiang
Xiaoyan Zhu
Minlie Huang

export BibTeX citation

Computer Science > Computation and Language

Title:A Large-Scale Chinese Short-Text Conversation Dataset

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Large-Scale Chinese Short-Text Conversation Dataset

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators