Comparing Titles vs. Full-text for Multi-Label Classification of Scientific Papers and News Articles

Galke, Lukas; Mai, Florian; Schelten, Alan; Brunsch, Dennis; Scherp, Ansgar

Computer Science > Digital Libraries

arXiv:1705.05311v1 (cs)

[Submitted on 15 May 2017 (this version), latest version 27 Sep 2017 (v2)]

Title:Comparing Titles vs. Full-text for Multi-Label Classification of Scientific Papers and News Articles

Authors:Lukas Galke, Florian Mai, Alan Schelten, Dennis Brunsch, Ansgar Scherp

View PDF

Abstract:Until today there has been no systematic comparison of how far document classification can be conducted using just the titles of the documents. However, methods using only the titles are very important since automated processing of titles has no legal barriers. Copyright laws often hinder automated document classification on full-text and even abstracts. In this paper, we compare established methods like Bayes, Rocchio, kNN, SVM, and logistic regression as well as recent methods like Learning to Rank and neural networks to the multi-label document classification problem. We demonstrate that classifications solely using the documents' titles can be very good and very close to the classification results using full-text. We use two established news corpora and two scientific document collections. The experiments are large-scale in terms of documents per corpus (up to 100,000) as well as number of labels (up to 10,000). The best method on title data is a modern variant of neural networks. For three datasets, the difference to full-text is very small. For one dataset, a stacking of logistic regression and decision trees performs slightly better than neural networks. Furthermore, we observe that the best methods on titles are even better than several state-of-the-art methods on full-text.

Comments:	10 pages, 1 figure, 3 tables
Subjects:	Digital Libraries (cs.DL); Computation and Language (cs.CL)
Cite as:	arXiv:1705.05311 [cs.DL]
	(or arXiv:1705.05311v1 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.1705.05311

Submission history

From: Lukas Galke [view email]
[v1] Mon, 15 May 2017 16:07:35 UTC (79 KB)
[v2] Wed, 27 Sep 2017 10:05:49 UTC (133 KB)

Computer Science > Digital Libraries

Title:Comparing Titles vs. Full-text for Multi-Label Classification of Scientific Papers and News Articles

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:Comparing Titles vs. Full-text for Multi-Label Classification of Scientific Papers and News Articles

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators