Computer Science > Digital Libraries
[Submitted on 15 May 2017 (this version), latest version 27 Sep 2017 (v2)]
Title:Comparing Titles vs. Full-text for Multi-Label Classification of Scientific Papers and News Articles
View PDFAbstract:Until today there has been no systematic comparison of how far document classification can be conducted using just the titles of the documents. However, methods using only the titles are very important since automated processing of titles has no legal barriers. Copyright laws often hinder automated document classification on full-text and even abstracts. In this paper, we compare established methods like Bayes, Rocchio, kNN, SVM, and logistic regression as well as recent methods like Learning to Rank and neural networks to the multi-label document classification problem. We demonstrate that classifications solely using the documents' titles can be very good and very close to the classification results using full-text. We use two established news corpora and two scientific document collections. The experiments are large-scale in terms of documents per corpus (up to 100,000) as well as number of labels (up to 10,000). The best method on title data is a modern variant of neural networks. For three datasets, the difference to full-text is very small. For one dataset, a stacking of logistic regression and decision trees performs slightly better than neural networks. Furthermore, we observe that the best methods on titles are even better than several state-of-the-art methods on full-text.
Submission history
From: Lukas Galke [view email][v1] Mon, 15 May 2017 16:07:35 UTC (79 KB)
[v2] Wed, 27 Sep 2017 10:05:49 UTC (133 KB)
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.