Comfey

An agentic framework for triaging incidents in production cloud infrastructure

When production incidents occur in large-scale cloud systems, on-call engineers face a flood of alerts, telemetry, and runbooks. Triaging the root cause quickly is critical but demands deep system knowledge that is hard to encode in static rules.

Comfey is an agentic framework that automates incident triage. It uses LLM-powered agents that dynamically query telemetry data, consult historical incident records, and follow structured reasoning steps to identify likely root causes and surface actionable next steps for engineers.

Key contributions:

  • An agent architecture that decomposes triage into sub-tasks (symptom analysis, hypothesis generation, evidence gathering)
  • Integration with live production telemetry and incident history at Microsoft scale
  • Evaluation against real incidents demonstrating improved triage accuracy and reduced time-to-mitigate

(Yao* et al., 2026)

Published at FSE ‘26 Industry Track.

References

2026

  1. An Agentic Framework for Triaging Incidents in Production Cloud Infrastructure
    Yuhan Yao*Yuxuan Jiang*, Minghua Ma, and 6 more authors
    In Proceedings of the ACM International Conference on the Foundations of Software Engineering, Industry Track, 2026