TrainCheck

Catching silent errors in deep learning training with automated proactive checks

Deep learning training is notoriously opaque: runs can silently diverge due to numerical issues, misconfigured hyperparameters, or subtle framework bugs — and the only signal is a bad loss curve discovered hours or days later.

TrainCheck addresses this by automatically synthesizing and injecting proactive checks into the training loop. These checks encode invariants derived from mathematical properties of optimization algorithms and neural network architectures, flagging violations at runtime without requiring users to write specifications.

Key contributions:

  • A taxonomy of silent error patterns in DL training (numerical, semantic, configuration)
  • An automated framework that generates and inserts checks derived from training invariants
  • Evaluation on real-world training bugs across PyTorch workloads, catching errors that standard logging misses

(Jiang et al., 2025)

Published at OSDI ‘25. Code available on GitHub.

References

2025

  1. traincheck-logo.png
    Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
    Yuxuan Jiang, Ziming Zhou, Boyu Xu, and 3 more authors
    In Proceedings of the 19th USENIX Conference on Operating Systems Design and Implementation, Boston, MA, USA, Jul 2025