Coordinate_descent_tuning errors out with torch.AcceleratorError: CUDA error: invalid argument

sytelus · September 10, 2025, 8:30am

It seems coordinate_descent_tuning bombs out on rather trivial NanoGPT 117M model:

Error: torch.AcceleratorError: CUDA error: invalid argument

Docker image: nvcr.io/nvidia/pytorch:25.08-py3

Pytorch version: 2.8.0a0+34c6371d24.nv25.08

CUDA version: 13.0

GPU: NVidia B200

Repro steps:
Clone the repo: GitHub - sytelus/nanuGPT: Simple, reliable and well tested training code for quick experiments with transformer based models

Follow simple steps in readme with above docker.

Stack trace:

[rank7]: Traceback (most recent call last):[rank7]:   File “/data/runs/shitals/shitals-perf-nemo-2025-09-10_08-07-25_949/source_dir/train.py”, line 10, in [rank7]:     train(config)[rank7]:   File “/data/runs/shitals/shitals-perf-nemo-2025-09-10_08-07-25_949/source_dir/nanugpt/train.py”, line 256, in train[rank7]:     scaler.backward(loss) # type: ignore[rank7]:     ^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/data/runs/shitals/shitals-perf-nemo-2025-09-10_08-07-25_949/source_dir/nanugpt/scalers/keller_scaler.py”, line 14, in backward[rank7]:     loss.backward()[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_tensor.py”, line 648, in backward[rank7]:     torch.autograd.backward([rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/autograd/init.py”, line 354, in backward[rank7]:     _engine_run_backward([rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py”, line 829, in _engine_run_backward[rank7]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py”, line 311, in apply[rank7]:     return user_fn(self, *args)[rank7]:            ^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py”, line 2218, in backward[rank7]:     return impl_fn()[rank7]:            ^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py”, line 2204, in impl_fn[rank7]:     out = CompiledFunction._backward_impl(ctx, all_args)[rank7]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py”, line 2324, in _backward_impl[rank7]:     out = call_func_at_runtime_with_args([rank7]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py”, line 126, in call_func_at_runtime_with_args[rank7]:     out = normalize_as_list(f(args))[rank7]:                             ^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py”, line 893, in _fn[rank7]:     return fn(*args, **kwargs)[rank7]:            ^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py”, line 583, in call[rank7]:     return self.current_callable(inputs)[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py”, line 2683, in run[rank7]:     out = model(new_inputs)[rank7]:           ^^^^^^^^^^^^^^^^^[rank7]:   File “/tmp/torchinductor_root/ro/crordv3vj5iwzlu4cg3aqthjxatpln7k4lqb6velvwto6okwzdxn.py”, line 354, in call[rank7]:     triton_red_fused__log_softmax__log_softmax_backward_data__to_copy_nll_loss_backward_nll_loss_forward_1.run(buf1, primals_3, tangents_1, convert_element_type_7, amax, log, 65536, 50257, stream=stream7)[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1159, in run[rank7]:     self.coordinate_descent_tuning(self.launchers[0], *args, **kwargs)[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1088, in coordinate_descent_tuning[rank7]:     best_config = self.coordesc_tuner.autotune([rank7]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/coordinate_descent_tuner.py”, line 240, in autotune[rank7]:     baseline_timing = self.call_func(func, baseline_config)[rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/coordinate_descent_tuner.py”, line 79, in call_func[rank7]:     timing = func(config)[rank7]:              ^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 1070, in benchmark_one_config[rank7]:     out = self.bench(launcher, *args, **kwargs)[rank7]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 779, in bench[rank7]:     cpu_copies = self.copy_args_to_cpu_if_needed(*args, **kwargs)[rank7]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 861, in copy_args_to_cpu_if_needed[rank7]:     maybe_copy(name, arg)[rank7]:   File “/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py”, line 845, in maybe_copy[rank7]:     cpu_arg = torch.empty_strided([rank7]:               ^^^^^^^^^^^^^^^^^^^^[rank7]: torch.AcceleratorError: CUDA error: invalid argument[rank7]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

sytelus · September 10, 2025, 10:20am

Interestingly after disabling coordinate_descent_tuning, I still got same error but this time from autotune_to_one_config.

Stack Trace

[rank6]: Traceback (most recent call last):
[rank6]:   File "/data/runs/shitals/shitals-perf-nemo-2025-09-10_10-14-46_784/source_dir/train.py", line 10, in <module>
[rank6]:     train(config)
[rank6]:   File "/data/runs/shitals/shitals-perf-nemo-2025-09-10_10-14-46_784/source_dir/nanugpt/train.py", line 256, in train
[rank6]:     scaler.backward(loss) # type: ignore
[rank6]:     ^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/data/runs/shitals/shitals-perf-nemo-2025-09-10_10-14-46_784/source_dir/nanugpt/scalers/keller_scaler.py", line 14, in backward
[rank6]:     loss.backward()
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 648, in backward
[rank6]:     torch.autograd.backward(
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 354, in backward
[rank6]:     _engine_run_backward(
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank6]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 311, in apply
[rank6]:     return user_fn(self, *args)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2218, in backward
[rank6]:     return impl_fn()
[rank6]:            ^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2204, in impl_fn
[rank6]:     out = CompiledFunction._backward_impl(ctx, all_args)
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2324, in _backward_impl
[rank6]:     out = call_func_at_runtime_with_args(
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
[rank6]:     out = normalize_as_list(f(args))
[rank6]:                             ^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 893, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:            ^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 583, in __call__
[rank6]:     return self.current_callable(inputs)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2683, in run
[rank6]:     out = model(new_inputs)
[rank6]:           ^^^^^^^^^^^^^^^^^
[rank6]:   File "/tmp/torchinductor_root/wt/cwtuelcs4a2ac2zgaqnfrru5ptmzdqxsrrtth5mr7c7tjqwzf5y5.py", line 366, in call
[rank6]:     triton_poi_fused_mm_3.run(buf1, 458752, stream=stream6)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 1153, in run
[rank6]:     self.autotune_to_one_config(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 968, in autotune_to_one_config
[rank6]:     timings = self.benchmark_all_configs(*args, **kwargs)
[rank6]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 943, in benchmark_all_configs
[rank6]:     launcher: self.bench(launcher, *args, **kwargs)
[rank6]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 779, in bench
[rank6]:     cpu_copies = self.copy_args_to_cpu_if_needed(*args, **kwargs)
[rank6]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 861, in copy_args_to_cpu_if_needed
[rank6]:     maybe_copy(name, arg)
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/runtime/triton_heuristics.py", line 845, in maybe_copy
[rank6]:     cpu_arg = torch.empty_strided(
[rank6]:               ^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.AcceleratorError: CUDA error: invalid argument
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank6]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

sytelus · September 10, 2025, 11:00am

Couple of more findings:

This happens with all of the following images, including CUDA 12.9, nvcr.io/nvidia/pytorch:25.08-py3, nvcr.io/nvidia/pytorch:25.06-py3, nvcr.io/nvidia/pytorch:25.05-py3.
It doesn’t happen if device batch size is 12 but bumping it up causes above error!

One other possibility is that this happens on B200s where memory is relatively much larger.

ptrblck · September 10, 2025, 3:21pm

Could you narrow down which kernel fails exactly via cuda-gdb please?
Based on your outputs it seems an Inducotr kernel fails, but it’s unclear to me which op exactly fails.

sytelus · September 11, 2025, 11:05am

Thnaks for your response @ptrblck. I tried debugging on cuda-gdb. Interestingly, it breaks on this error which doesn’t look like kernel error but rather memory allocation error:

warning: Cuda Driver error detected: Failed to allocate physical memory
warning: Cuda Driver error detected: Returning 1 (CUDA_ERROR_INVALID_VALUE) from cuMemHostAlloc
[Switching to Thread 0x7fff323ff6c0 (LWP 27593)]
Cuda Runtime API error detected: cudaHostAlloc returned cudaErrorInvalidValue(CUresult=1): invalid argument

At first blush I thought this might be due to pinned memory of length 0 but even after disabling pinning error still occurred.

Below is the part of debug output if it helps (whole output). I have saved the files as well and can post somewhere if needed.

[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     from torch._inductor.utils import print_performance
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     primals_3 = rand_strided((60, 1024), (1024, 1), device='cuda:0', dtype=torch.int64)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     view = rand_strided((61440, 768), (768, 1), device='cuda:0', dtype=torch.bfloat16)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     mm_default_2 = rand_strided((61440, 50264), (50304, 1), device='cuda:0', dtype=torch.bfloat16)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     amax = rand_strided((61440, 1), (1, 1), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     log = rand_strided((61440, 1), (1, 1), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     convert_element_type_7 = rand_strided((), (), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     permute_3 = rand_strided((50257, 768), (768, 1), device='cuda:0', dtype=torch.bfloat16)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     tangents_1 = rand_strided((), (), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     fn = lambda: call([primals_3, view, mm_default_2, amax, log, convert_element_type_7, permute_3, tangents_1])
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code] if __name__ == "__main__":
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]
[rank0]:V0911 10:18:42.005000 27347 torch/_inductor/codecache.py:1189] [0/0] [__output_code] Output code written to: /tmp/torchinductor_root/fl/cflma6p5e72qss2pb4zms4ed2cfcttdlkmklhjhb7yj5gouhmcrl.py
[rank0]:W0911 10:18:42.013000 27347 torch/_inductor/debug.py:449] [0/0] model__13_backward_42 debug trace: /data/shitals/devbox/GitHubSrc/nanugpt/torch_compile_debug/run_2025_09_11_10_18_04_782379-pid_27347/torchinductor/model__13_backward_42.14

One other thing I tried was to set this:

export TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1

Hope was that bad config autotune will die in separate process and won’t crash main process but this didn’t worked. Any workaround you can think of would be great!

ptrblck · September 11, 2025, 8:49pm

I also don’t see any pinned memory usage in your logs and don’t know where cuMemHostAlloc comes from. Were you able to isolate the kernel including the backtrace via cuda-gdb ?

sytelus · September 11, 2025, 11:10pm

At the above exception, bt followed by info cuda kernels says “No CUDA kernels”. I thought the exception reported above is outside the kernel?

sytelus · September 13, 2025, 9:16am

@ptrblck (and everyone else reading this ), I finally found the cause and now have single script minimal code to repro it consistently! It turns out the CUDA error: invalid argument error was only occurring when (1) I put my loss function call inside .forward() so it gets fused, and (2) the function returned tuple of two tensors where the first tensor is on which I call .backward() while second tensor is a scaler value and doesn’t participate in backprop. The fix was to return a tuple of tensor and Python scaler (instead of tensor) from the loss function (see comment below in code).

Below is the complete code to reproduce this:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.backends.cuda import sdp_kernel

class Block(nn.Module):
    def __init__(self, d_model=768, n_head=12, p=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        self.ln2 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(nn.Linear(d_model, 4 * d_model), nn.GELU(), nn.Linear(4 * d_model, d_model))
        self.n_head = n_head
        self.d_head = d_model // n_head
        self.p = p
    def forward(self, x):
        B, S, D = x.shape
        h = self.ln1(x)
        q, k, v = self.qkv(h).split(D, dim=-1)
        q = q.view(B, S, self.n_head, self.d_head).transpose(1, 2)
        k = k.view(B, S, self.n_head, self.d_head).transpose(1, 2)
        v = v.view(B, S, self.n_head, self.d_head).transpose(1, 2)
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True, dropout_p=self.p if self.training else 0.0)
        y = y.transpose(1, 2).contiguous().view(B, S, D)
        x = x + self.proj(y)
        x = x + self.mlp(self.ln2(x))
        return x

class TinyTransformer(nn.Module):
    def __init__(self, vocab=50000, d_model=768, n_head=12, n_layer=12, seq_len=1024, p=0.1):
        super().__init__()
        self.wte = nn.Embedding(vocab, d_model)
        self.wpe = nn.Embedding(seq_len, d_model)
        self.h = nn.ModuleList([Block(d_model, n_head, p) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab, bias=False)
        self.lm_head.weight = self.wte.weight
    def forward(self, idx, y):
        B, S = idx.shape
        x = self.wte(idx) + self.wpe(torch.arange(S, device=idx.device))
        for blk in self.h:
            x = blk(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        preds = logits.view(-1, logits.size(-1))
        target = y.view(-1)
        loss = F.cross_entropy(preds, target, ignore_index=-1)
        # FIX: to get rid of error, use below instead
        # correct = (preds.argmax(dim=-1) == target).sum().detach().item()
        correct = (preds.argmax(dim=-1) == target).sum()
        return loss, correct

def main():
    from torch._dynamo import config as dconfig
    dconfig.capture_scalar_outputs = True
    sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False)
    device = "cuda"
    B, S, D, C = 60, 1024, 768, 50000
    idx = torch.randint(0, C, (B, S), device=device)
    y = torch.randint(0, C, (B, S), device=device)
    model = TinyTransformer(vocab=C, d_model=D, n_head=12, n_layer=12, seq_len=S, p=0.1).to(device).half()
    model.train()
    model = torch.compile(model, fullgraph=True)
    loss, correct = model(idx, y)
    print(float(loss), correct)
    (loss + 0.0 * correct.float()).backward()
    torch.cuda.synchronize()
    print("ok")

if __name__ == "__main__":
    main()

To reproduce error, use 8xB200 GPU node (unfortunately single GPU doesn’t seem to repro) and use torchrun:

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 above_script.py

Call Stack and Error Message

Interestingly, the error message now has much more info and tells us why cudaHostAlloc failed:

  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1231, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 343, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 100, in g
    return f(*args)
           ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 579, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2033, in forward
    fw_outs = call_func_at_runtime_with_args(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 528, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 722, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 583, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2683, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_root/d6/cd6xwpscfob6tos3lz2rv3dfw6pg6v3vaff4wqb4vzuo56xgorjp.py", line 1568, in call
    buf284 = empty_strided_cuda((61440, 50000), (50048, 1), torch.float16)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.73 GiB. GPU 0 has a total capacity of 178.36 GiB of which 4.44 GiB is free. Process 123 has 23.96 GiB memory in use. Process 121 has 18.12 GiB memory in use. Process 118 has 18.02 GiB memory in use. Including non-PyTorch memory, this process has 18.12 GiB memory in use. Process 117 has 23.96 GiB memory in use. Process 120 has 23.96 GiB memory in use. Process 122 has 23.96 GiB memory in use. Process 124 has 23.96 GiB memory in use. Of the allocated memory 17.34 GiB is allocated by PyTorch, and 21.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Notes

I tried disabling torch._dynamo.config.capture_scalar_outputs as well as setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True but none of that made any difference.

ptrblck · September 13, 2025, 7:38pm

Are you sure you are running into the same issue? Your current code snippet raises an OOM error on the device which seems to differ from the originally reported error.

sytelus · September 14, 2025, 5:33am

I am fairly sure it’s same underlying cause. Just converting one member of tuple as tensor.detach().item(), causes error to go away.I think somehow dynamo engine is going crazy and wants to allocate huge memory? Another fun thing is that error will still occur if export TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1.

Above is very minimal standalone code (just run of the mill 124M param GPT2 model) not needing any datasets or anything else.

EIFY · September 15, 2025, 6:45am

FYI, I got the same torch.AcceleratorError: CUDA error: invalid argument error message during the torch.empty_strided call but I found myself in a more dire situation. Instead of returning logits, loss I tried only returning loss, but even that didn’t help. The only workaround I can find is to unfuse the cross entropy loss calculation and do that out of the compiled model (or…give up torch.compile()). Here is the commit that fixed it. The code is also derived from nanogpt:

Pytorch version: 2.8.0
CUDA version: 12.8
GPU: 8x NVidia H100