Graph-based Incident Aggregation for Large-Scale Online Service Systems

Chen, Zhuangbin; Liu, Jinyang; Su, Yuxin; Zhang, Hongyu; Wen, Xuemin; Ling, Xiao; Yang, Yongqiang; Lyu, Michael R.

Computer Science > Machine Learning

arXiv:2108.12179 (cs)

[Submitted on 27 Aug 2021]

Title:Graph-based Incident Aggregation for Large-Scale Online Service Systems

Authors:Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xuemin Wen, Xiao Ling, Yongqiang Yang, Michael R. Lyu

View PDF

Abstract:As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations more accurately, we try to recover the complete scope of failures' cascading impact by leveraging fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs). The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.

Comments:	Accepted by 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE'21)
Subjects:	Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2108.12179 [cs.LG]
	(or arXiv:2108.12179v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2108.12179

Submission history

From: Zhuangbin Chen [view email]
[v1] Fri, 27 Aug 2021 08:48:55 UTC (1,281 KB)

Computer Science > Machine Learning

Title:Graph-based Incident Aggregation for Large-Scale Online Service Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Graph-based Incident Aggregation for Large-Scale Online Service Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators