Empowering incident management with query recommendations via large language models
When a production incident fires, engineers write domain-specific language (DSL) queries — like KQL for Azure Monitor — to sift through telemetry and pinpoint the problem. Crafting these queries is slow, error-prone, and demands expertise that varies across engineers.
Xpert automates KQL query recommendation for incident management. Starting from an empirical study of query usage patterns in a large-scale Microsoft cloud system, we built an end-to-end ML framework that leverages LLMs and historical incident data to generate tailored queries for new incidents.
Key contributions:
Empirical study of KQL query usage across thousands of real production incidents
Xpert framework: retrieval-augmented LLM pipeline for query generation
Xcore: a novel metric for evaluating query quality across correctness, specificity, and efficiency
Deployed in production at Microsoft, measurably reducing time-to-query for on-call engineers
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management.Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management.
@inproceedings{10.1145/3597503.3639081,author={Jiang, Yuxuan and Zhang, Chaoyun and He, Shilin and Yang, Zhihao and Ma, Minghua and Qin, Si and Kang, Yu and Dang, Yingnong and Rajmohan, Saravan and Lin, Qingwei and Zhang, Dongmei},title={Xpert: Empowering Incident Management with Query Recommendations via Large Language Models},year={2024},isbn={9798400702174},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3597503.3639081},doi={10.1145/3597503.3639081},booktitle={Proceedings of the IEEE/ACM 46th International Conference on Software Engineering},articleno={92},numpages={13},keywords={incident management, query generation, large language model},location={Lisbon, Portugal},series={ICSE '24},bibtex_show=true}