Peacock: Learning Long-Tail Topic Features for Industrial Applications

Wang, Yi; Zhao, Xuemin; Sun, Zhenlong; Yan, Hao; Wang, Lifeng; Jin, Zhihui; Wang, Liubin; Gao, Yang; Law, Ching; Zeng, Jia

Computer Science > Information Retrieval

arXiv:1405.4402 (cs)

[Submitted on 17 May 2014 (v1), last revised 6 Dec 2014 (this version, v3)]

Title:Peacock: Learning Long-Tail Topic Features for Industrial Applications

Authors:Yi Wang, Xuemin Zhao, Zhenlong Sun, Hao Yan, Lifeng Wang, Zhihui Jin, Liubin Wang, Yang Gao, Ching Law, Jia Zeng

View PDF

Abstract:Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to $10^3$ topics, which cover difficultly the long-tail semantic word sets. In this paper, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a "big" LDA model with at least $10^5$ topics inferred from $10^9$ search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serving hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.

Comments:	23 pages, 11 figures, ACM Transactions on Intelligent Systems and Technology, 2015
Subjects:	Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1405.4402 [cs.IR]
	(or arXiv:1405.4402v3 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1405.4402
Journal reference:	ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 4, Article 47, 2015

Submission history

From: Jia Zeng [view email]
[v1] Sat, 17 May 2014 14:36:52 UTC (2,606 KB)
[v2] Wed, 3 Dec 2014 09:56:44 UTC (1 KB) (withdrawn)
[v3] Sat, 6 Dec 2014 09:54:43 UTC (3,597 KB)

Computer Science > Information Retrieval

Title:Peacock: Learning Long-Tail Topic Features for Industrial Applications

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Peacock: Learning Long-Tail Topic Features for Industrial Applications

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators