prof_pic.png

Yuxuan's Homepage

PhD Candidate in CSE · University of Michigan · Systems Correctness & Reliability

Welcome! I am Yuxuan Jiang, a PhD candidate in Computer Science and Engineering at the University of Michigan, advised by Prof. Ryan Huang. I work on system correctness and reliability — building tools that help large-scale distributed, ML, and agentic systems detect when they are behaving incorrectly, even when “correctness” is never explicitly defined.

Previously, I interned at Microsoft Research (Seattle and Beijing) and received my B.Eng. from Zhejiang University and the University of Illinois Urbana-Champaign. I am actively looking for research collaborations and opportunities — feel free to reach out!

news

Mar 26, 2026 Our paper “An Agentic Framework for Triaging Incidents in Production Cloud Infrastructure” (Comfey) is accepted by FSE 2026 Industry Track!
Sep 02, 2025 TrainCheck (OSDI’25) is accepted to appear at PyTorch Conference 2025. See you in San Francisco!
Jun 30, 2025 Officially a PhD candidate! Hats off to Ryan and all the lab folks.
Mar 25, 2025 Our paper “Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks” is accepted by OSDI 2025! See you in Boston!
Dec 11, 2024 Our paper “One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems” is accepted by NSDI 2024! See you in Philadelphia!

latest posts

selected publications

  1. traincheck-logo.png
    Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
    Yuxuan Jiang, Ziming Zhou, Boyu Xu, and 3 more authors
    In Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation, Boston, MA, USA, Jul 2025
  2. An Agentic Framework for Triaging Incidents in Production Cloud Infrastructure
    Yuhan Yao*Yuxuan Jiang*, Minghua Ma, and 6 more authors
    In Proceedings of the ACM International Conference on the Foundations of Software Engineering, Industry Track, Jul 2026
  3. slow-faults.png
    One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
    Ruiming Lu, Yunchi Lu, Yuxuan Jiang, and 2 more authors
    In Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, Philadelphia, PA, USA, Apr 2025