Xpert

Empowering incident management with query recommendations via large language models

When a production incident fires, engineers write domain-specific language (DSL) queries — like KQL for Azure Monitor — to sift through telemetry and pinpoint the problem. Crafting these queries is slow, error-prone, and demands expertise that varies across engineers.

Xpert automates KQL query recommendation for incident management. Starting from an empirical study of query usage patterns in a large-scale Microsoft cloud system, we built an end-to-end ML framework that leverages LLMs and historical incident data to generate tailored queries for new incidents.

Key contributions:

  • Empirical study of KQL query usage across thousands of real production incidents
  • Xpert framework: retrieval-augmented LLM pipeline for query generation
  • Xcore: a novel metric for evaluating query quality across correctness, specificity, and efficiency
  • Deployed in production at Microsoft, measurably reducing time-to-query for on-call engineers

(Jiang et al., 2024)

Published at ICSE ‘24.

References

2024

  1. xpert.jpg
    Xpert: Empowering Incident Management with Query Recommendations via Large Language Models
    Yuxuan Jiang, Chaoyun Zhang, Shilin He, and 8 more authors
    In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, Apr 2024