Yuxuan's Homepage

PhD Candidate in CSE · University of Michigan · Systems Correctness & Reliability

prof_pic.png

Welcome! I am Yuxuan Jiang, a PhD candidate in Computer Science and Engineering at the University of Michigan, working on system correctness and reliability for large-scale, distributed, machine-learning, and agentic systems.

My research is motivated by a simple but persistent observation: while code generation and system performance continue to scale rapidly, system correctness still depends heavily on time-consuming, expert-driven review.

I aim to develop general principles and practical techniques that improve correctness in settings where specifications are incomplete or implicit. My work focuses on automated reasoning, proactive checking, and effective bug diagnosis, helping systems detect when they are behaving incorrectly even when “correctness” is not explicitly defined.

I am fortunate to be advised by Prof. Ryan Huang. Previously, I interned at Microsoft Research in Seattle and Beijing. I received my Bachelor of Engineering in Computer Engineering from Zhejiang University and the University of Illinois Urbana-Champaign.

I am actively looking for research collaborations and industry opportunities. Feel free to reach out if our interests overlap!

news

Mar 26, 2026 Our paper “An Agentic Framework for Triaging Incidents in Production Cloud Infrastructure” (Comfey) is accepted by FSE 2026 Industry Track!
Sep 02, 2025 TrainCheck (OSDI’25) is accepted to appear at PyTorch Conference 2025. See you in San Francisco!
Jun 30, 2025 Officially a PhD candidate! Hats off to Ryan and all the lab folks.
Mar 25, 2025 Our paper “Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks” is accepted by OSDI 2025! See you in Boston!
Dec 11, 2024 Our paper “One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems” is accepted by NSDI 2024! See you in Philadelphia!

latest posts

selected publications

  1. traincheck-logo.png
    Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
    Yuxuan Jiang, Ziming Zhou, Boyu Xu, and 3 more authors
    In Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation, Boston, MA, USA, Jul 2025
  2. An Agentic Framework for Triaging Incidents in Production Cloud Infrastructure
    Yuhan Yao*Yuxuan Jiang*, Minghua Ma, and 6 more authors
    In Proceedings of the ACM International Conference on the Foundations of Software Engineering, Industry Track, Jul 2026
  3. slow-faults.png
    One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
    Ruiming Lu, Yunchi Lu, Yuxuan Jiang, and 2 more authors
    In Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, Philadelphia, PA, USA, Apr 2025