TrainCheck | Yuxuan's Homepage

Deep learning training is notoriously opaque: runs can silently diverge due to numerical issues, misconfigured hyperparameters, or subtle framework bugs — and the only signal is a bad loss curve discovered hours or days later.

TrainCheck addresses this by automatically synthesizing and injecting proactive checks into the training loop. These checks encode invariants derived from mathematical properties of optimization algorithms and neural network architectures, flagging violations at runtime without requiring users to write specifications.

Key contributions:

A taxonomy of silent error patterns in DL training (numerical, semantic, configuration)
An automated framework that generates and inserts checks derived from training invariants
Evaluation on real-world training bugs across PyTorch workloads, catching errors that standard logging misses

(Jiang et al., 2025)

Published at OSDI ‘25. Code available on GitHub.

References

2025

Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks

Yuxuan Jiang, Ziming Zhou, Boyu Xu, and 3 more authors

In Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation, Boston, MA, USA, Jul 2025

Bib PDF Code Poster Slides

@inproceedings{MLSilentBug2025OSDI,
  author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
  title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
  booktitle = {Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation},
  series = {OSDI '25},
  month = jul,
  year = {2025},
  location = {Boston, MA, USA},
  bibtex_show = true,
}