Debian上调试PyTorch代码的实用流程
一 环境准备与快速定位
python3 -m venv venv && source venv/bin/activatepip install torch torchvision torchaudiosudo apt update && sudo apt install gdb valgrind linux-perf python3-pippip install ipdb torchsnooper viztracerimport ipdb; ipdb.set_trace();常用命令:n/s/c/b/l/p。import faulthandler; faulthandler.enable()。print(tensor.shape, tensor.device);用 assert 验证不变量。torch.autograd.set_detect_anomaly(True)(会带来性能开销)。二 可视化与性能分析
from torch.utils.tensorboard import SummaryWriter; writer = SummaryWriter('runs/exp1')writer.add_scalar('Loss/train', loss.item(), global_step=epoch); writer.close()tensorboard --logdir=runswith torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
on_trace_ready=lambda prof: prof.export_chrome_trace("trace.json"),
record_shapes=True
) as prof:
for step, (x, y) in enumerate(train_loader):
...
loss.backward()
optimizer.step()
prof.step() # 务必调用
三 深入调试与离线分析
pip install torchsnooper@torchsnooper.snoop(),自动打印每行代码的张量 shape、dtype、device、requires_grad,适合快速排查维度/设备不匹配。pip install viztracerviztracer my_script.pyfrom viztracer import VizTracer; with VizTracer(log_torch=True) as tracer: train_one_epoch(...)四 外部调试器与系统级工具
pgrep -f pythongdb -p $(pgrep -f python),在 gdb 中执行 bt 查看原生栈;通常需要带调试符号的 PyTorch 或从源码构建以获得可读回溯。sudo perf record -g python train.pysudo perf report五 常见问题速查与排查清单
torch.autograd.set_detect_anomaly(True);检查损失、学习率与数值稳定性(如 clamp/log-sum-exp)。