It seems coordinate_descent_tuning bombs out on rather trivial NanoGPT 117M model:
Error: torch.AcceleratorError: CUDA error: invalid argument
Docker image: nvcr.io/nvidia/pytorch:25.08-py3
Pytorch version: 2.8.0a0+34c6371d24.nv25.08
CUDA version: 13.0
GPU: NVidia B200
Repro steps:
Clone the repo: GitHub - sytelus/nanuGPT: Simple, reliable and well tested training code for quick experiments with transformer based models
Follow simple steps in readme with above docker.
Stack trace:
[rank7]: Traceback (most recent call last):[rank7]: File “/data/runs/shitals/shitals-perf-nemo-2025-09-10_08-07-25_949/source_dir/train.py”, line 10, in [rank7]: train(config)[rank7]: File “/data/runs/shitals/shitals-perf-nemo-2025-09-10_08-07-25_949/source_dir/nanugpt/train.py”, line 256, in train[rank7]: scaler.backward(loss) # type: ignore[rank7]: ^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/data/runs/shitals/shitals-perf-nemo-2025-09-10_08-07-25_949/source_dir/nanugpt/scalers/keller_scaler.py”, line 14, in backward[rank7]: loss.backward()[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_tensor.py”, line 648, in backward[rank7]: torch.autograd.backward([rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/autograd/init.py”, line 354, in backward[rank7]: _engine_run_backward([rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py”, line 829, in _engine_run_backward[rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py”, line 311, in apply[rank7]: return user_fn(self, *args)[rank7]: ^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py”, line 2218, in backward[rank7]: return impl_fn()[rank7]: ^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py”, line 2204, in impl_fn[rank7]: out = CompiledFunction._backward_impl(ctx, all_args)[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py”, line 2324, in _backward_impl[rank7]: out = call_func_at_runtime_with_args([rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py”, line 126, in call_func_at_runtime_with_args[rank7]: out = normalize_as_list(f(args))[rank7]: ^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py”, line 893, in _fn[rank7]: return fn(*args, **kwargs)[rank7]: ^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py”, line 583, in call[rank7]: return self.current_callable(inputs)[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py”, line 2683, in run[rank7]: out = model(new_inputs)[rank7]: ^^^^^^^^^^^^^^^^^[rank7]: File “/tmp/torchinductor_root/ro/crordv3vj5iwzlu4cg3aqthjxatpln7k4lqb6velvwto6okwzdxn.py”, line 354, in call[rank7]: triton_red_fused__log_softmax__log_softmax_backward_data__to_copy_nll_loss_backward_nll_loss_forward_1.run(buf1, primals_3, tangents_1, convert_element_type_7, amax, log, 65536, 50257, stream=stream7)[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1159, in run[rank7]: self.coordinate_descent_tuning(self.launchers[0], *args, **kwargs)[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1088, in coordinate_descent_tuning[rank7]: best_config = self.coordesc_tuner.autotune([rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/coordinate_descent_tuner.py”, line 240, in autotune[rank7]: baseline_timing = self.call_func(func, baseline_config)[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/coordinate_descent_tuner.py”, line 79, in call_func[rank7]: timing = func(config)[rank7]: ^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1070, in benchmark_one_config[rank7]: out = self.bench(launcher, *args, **kwargs)[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 779, in bench[rank7]: cpu_copies = self.copy_args_to_cpu_if_needed(*args, **kwargs)[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 861, in copy_args_to_cpu_if_needed[rank7]: maybe_copy(name, arg)[rank7]: File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 845, in maybe_copy[rank7]: cpu_arg = torch.empty_strided([rank7]: ^^^^^^^^^^^^^^^^^^^^[rank7]: torch.AcceleratorError: CUDA error: invalid argument[rank7]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.