Agent skill

debug-cuda-crash

Tutorial for debugging CUDA crashes using API logging

Stars 232
Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/flashinfer-ai/debug-cuda-crash

SKILL.md

Tutorial: Debugging CUDA Crashes with API Logging

This tutorial shows you how to debug CUDA crashes and errors in FlashInfer using the @flashinfer_api logging decorator.

Goal

When your code crashes with CUDA errors (illegal memory access, out-of-bounds, NaN/Inf), use API logging to:

  • Capture input tensors BEFORE the crash occurs
  • Understand what data caused the problem
  • Track tensor shapes, dtypes, and values through your pipeline
  • Detect numerical issues (NaN, Inf, wrong shapes)

Why Use API Logging?

Problem: CUDA errors often crash the program, leaving no debugging information.

Solution: FlashInfer's @flashinfer_api decorator logs inputs BEFORE execution, so you can see what caused the crash even after the program terminates.

Step 1: Enable API Logging

Basic Logging (Function Names Only)

bash
export FLASHINFER_LOGLEVEL=1        # Log function names
export FLASHINFER_LOGDEST=stdout    # Log to console

python my_script.py

Output:

[2025-12-18 10:30:45] FlashInfer API Call: batch_decode_with_padded_kv_cache

Detailed Logging (Inputs/Outputs with Metadata)

bash
export FLASHINFER_LOGLEVEL=3        # Log inputs/outputs with metadata
export FLASHINFER_LOGDEST=debug.log # Save to file

python my_script.py

Output in debug.log:

================================================================================
[2025-12-18 10:30:45] FlashInfer API Logging - System Information
================================================================================
FlashInfer version: 0.6.0
CUDA toolkit version: 12.1
GPU 0: NVIDIA H100 PCIe
  Compute capability: 9.0 (SM90)
PyTorch version: 2.1.0
================================================================================

================================================================================
[2025-12-18 10:30:46] FlashInfer API Call: batch_decode_with_padded_kv_cache
--------------------------------------------------------------------------------
Positional input arguments:
  arg[0]:
    Tensor(
      shape=(32, 8, 128)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
Keyword input arguments:
  kv_cache=
    Tensor(
      shape=(1024, 2, 8, 128)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )

Full Logging (With Tensor Statistics)

bash
export FLASHINFER_LOGLEVEL=5        # Log with min/max/mean/nan/inf
export FLASHINFER_LOGDEST=debug.log

python my_script.py

Additional output:

  Tensor(
    shape=(32, 8, 128)
    dtype=torch.bfloat16
    device=cuda:0
    requires_grad=False
    is_contiguous=True
    min=-3.125000
    max=4.250000
    mean=0.015625
    nan_count=0
    inf_count=0
  )

Step 2: Reproduce the Crash

Example: Shape Mismatch

Your code crashes with:

RuntimeError: CUDA error: an illegal memory access was encountered

Enable logging and run again:

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=crash_log.txt

python my_script.py

The log shows inputs before the crash:

[2025-12-18 10:32:15] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
  arg[0]:
    Tensor(
      shape=(32, 8, 128)      # Query tensor
      ...
    )
Keyword input arguments:
  kv_cache=
    Tensor(
      shape=(1024, 2, 8, 64)  # ❌ Wrong! Should be (..., 128) not (..., 64)
      ...
    )

Found the bug: head_dim mismatch (64 vs 128)

Step 3: Common CUDA Errors and How to Debug

Error 1: Illegal Memory Access

Error Message:

RuntimeError: CUDA error: an illegal memory access was encountered

Enable logging:

bash
export FLASHINFER_LOGLEVEL=3
python my_script.py

What to check in logs:

  • ✅ Tensor shapes match expected dimensions
  • ✅ All tensors are on CUDA (not CPU)
  • ✅ Tensor strides are reasonable
  • is_contiguous=True (if required)

Common causes:

  • Wrong tensor dimensions
  • CPU tensor passed to GPU kernel
  • Incorrect stride patterns

Error 2: NaN or Inf Values

Error Message:

RuntimeError: Function ... returned nan or inf

Enable statistics logging:

bash
export FLASHINFER_LOGLEVEL=5        # Level 5 shows nan_count, inf_count
python my_script.py

What to check in logs:

Tensor(
  ...
  min=-1234567.000000     # ❌ Suspiciously large
  max=9876543.000000      # ❌ Suspiciously large
  mean=nan                # ❌ NaN detected
  nan_count=128           # ❌ 128 NaN values!
  inf_count=0
)

Common causes:

  • Division by zero in previous operation
  • Numerical overflow/underflow
  • Uninitialized memory

Error 3: Out of Memory

Error Message:

RuntimeError: CUDA out of memory

Enable logging:

bash
export FLASHINFER_LOGLEVEL=3
python my_script.py

What to check in logs:

  • ✅ Tensor shapes (are they unexpectedly large?)
  • ✅ Batch size
  • ✅ Sequence length

Example:

Tensor(
  shape=(1024, 8192, 128, 128)  # ❌ Way too large! Should be (1024, 128, 128)?
  ...
)

Error 4: Wrong Dtype

Error Message:

RuntimeError: expected scalar type BFloat16 but found Float16

Enable logging:

bash
export FLASHINFER_LOGLEVEL=3
python my_script.py

What to check in logs:

Tensor(
  dtype=torch.float16     # ❌ Should be torch.bfloat16
  ...
)

Step 4: Multi-Process Debugging

When running with multiple GPUs/processes, use %i pattern:

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug_rank_%i.txt    # %i = process ID

torchrun --nproc_per_node=4 my_script.py

This creates separate logs:

  • debug_rank_12345.txt (process 12345)
  • debug_rank_12346.txt (process 12346)
  • debug_rank_12347.txt (process 12347)
  • debug_rank_12348.txt (process 12348)

Now you can debug each rank independently.

Step 5: Advanced Debugging with compute-sanitizer

For harder bugs, combine API logging with CUDA tools:

Use compute-sanitizer (Memory Checker)

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log

compute-sanitizer --tool memcheck python my_script.py

Output shows exact memory errors:

========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at 0x1234 in ScaleKernel<float>
=========     by thread (256,0,0) in block (10,0,0)
=========     Address 0x7f1234567890 is out of bounds

Check debug.log to see what inputs caused this kernel to fail.

Use cuda-gdb (Debugger)

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log

cuda-gdb --args python my_script.py

In gdb:

(cuda-gdb) run
(cuda-gdb) where     # Show stack trace when it crashes

Check debug.log for the inputs that led to the crash.

Step 6: Kernel-Level Debugging with printf()

You can use printf() inside CUDA kernels for debugging:

Basic Usage

cpp
__global__ void MyKernel(const float* input, float* output, int n) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  // Print from one thread to avoid spam
  if (threadIdx.x == 0 && blockIdx.x == 0) {
    printf("n=%d, input[0]=%f\n", n, input[0]);
  }

  if (idx < n) {
    output[idx] = input[idx] * 2.0f;
  }
}

Important: Flush printf buffer after kernel:

python
my_kernel(input, output)
torch.cuda.synchronize()  # ← Flushes printf output

⚠️ Warp-Specialized Kernels: Choosing the Right Print Thread

Problem: threadIdx.x == 0 doesn't work for all warps (warp starting at thread 32 won't have thread 0).

Solution: Choose one representative thread per specialization group.

cpp
__global__ void WarpSpecializedKernel(...) {
  // Define your group's representative thread
  // e.g., first thread of each warp: threadIdx.x % 32 == 0
  // e.g., first thread of each 4-warp group: threadIdx.x % 128 == 0

  if (is_group_representative) {
    printf("Group %d processing\n", group_id);
  }
}

Common mistake ❌:

cpp
// ❌ Only warp 0 will print!
if (threadIdx.x == 0) {
  printf("Warp %d processing\n", threadIdx.x / 32);
}

Quick Reference

Kernel Type Print Condition Notes
Simple kernel threadIdx.x == 0 One thread per block
Warp-specialized One thread per group Depends on kernel design

Other Kernel Debugging Tools

cpp
// Assert for invariants
assert(value >= 0.0f && "Value must be non-negative");

// Compile-time checks
static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be multiple of warp size");

Environment Variables Reference

Variable Values Description
FLASHINFER_LOGLEVEL 0 No logging (default)
1 Function names only
3 Inputs/outputs with metadata
5 + Tensor statistics (min/max/mean/nan/inf)
FLASHINFER_LOGDEST stdout Log to console (default)
stderr Log to stderr
<path> Log to file
log_%i.txt Multi-process: %i = process ID

Best Practices

1. Always Start with Level 3

bash
export FLASHINFER_LOGLEVEL=3

Level 3 provides tensor metadata (shape, dtype, device) without overwhelming output.

2. Use Level 5 for Numerical Issues

bash
export FLASHINFER_LOGLEVEL=5

Only use level 5 when debugging NaN/Inf problems (adds statistics).

3. Log to File for Crashes

bash
export FLASHINFER_LOGDEST=crash_log.txt

Console output may be lost when program crashes. File logs persist.

4. Compare Before/After

Enable logging and compare:

  • Last successful API call (inputs logged, outputs logged) ✅
  • First failed API call (inputs logged, no outputs) ❌ ← This is where it crashed!

5. Disable Logging in Production

bash
unset FLASHINFER_LOGLEVEL   # or export FLASHINFER_LOGLEVEL=0

Logging has zero overhead when disabled (decorator returns original function).

Troubleshooting

No Logs Appearing

Problem: Set FLASHINFER_LOGLEVEL=3 but no logs appear

Solutions:

  1. Check if API has the decorator: Not all FlashInfer APIs have @flashinfer_api yet (work in progress)

  2. Verify environment variable:

    bash
    echo $FLASHINFER_LOGLEVEL    # Should print "3"
    
  3. Check log destination:

    bash
    echo $FLASHINFER_LOGDEST     # Should print path or "stdout"
    

Too Much Output

Problem: Level 5 produces too much output

Solution: Use level 3 instead:

bash
export FLASHINFER_LOGLEVEL=3   # Skip tensor statistics

Statistics Skipped in CUDA Graph

Warning: [statistics skipped: CUDA graph capture in progress]

What it means: Level 5 statistics are automatically skipped during CUDA graph capture (to avoid synchronization)

This is normal: The framework protects you from graph capture issues.

Quick Examples

Debug Shape Mismatch

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=stdout
python my_script.py
# Check tensor shapes in output

Debug NaN/Inf

bash
export FLASHINFER_LOGLEVEL=5         # Show statistics
export FLASHINFER_LOGDEST=debug.log
python my_script.py
# Check nan_count and inf_count in debug.log

Debug Multi-GPU Training

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=rank_%i.log   # Separate log per rank
torchrun --nproc_per_node=8 train.py
# Check rank_*.log files

Combine with Memory Checker

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=inputs.log
compute-sanitizer --tool memcheck python my_script.py
# inputs.log shows what data caused the memory error

Example: Full Debug Session

Your code crashes:

python
import torch
from flashinfer import batch_decode_with_padded_kv_cache

q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device="cuda")
kv = torch.randn(1024, 2, 8, 64, dtype=torch.bfloat16, device="cuda")  # Wrong dim!

output = batch_decode_with_padded_kv_cache(q, kv)  # ❌ Crashes

Enable logging:

bash
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
python test.py

Check debug.log:

[2025-12-18 10:45:23] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
  arg[0]:
    Tensor(
      shape=(32, 8, 128)
      dtype=torch.bfloat16
      ...
    )
  arg[1]:
    Tensor(
      shape=(1024, 2, 8, 64)    # ❌ Found it! Last dim should be 128
      dtype=torch.bfloat16
      ...
    )

Fix the bug:

python
kv = torch.randn(1024, 2, 8, 128, dtype=torch.bfloat16, device="cuda")  # ✅ Fixed

Success!

bash
python test.py
# No crash, outputs logged successfully

Summary

  1. Enable logging before the crash:

    bash
    export FLASHINFER_LOGLEVEL=3
    export FLASHINFER_LOGDEST=debug.log
    
  2. Run your code - inputs are logged BEFORE crash

  3. Check the log - last API call shows what caused the crash

  4. Fix the issue based on logged input metadata

  5. Disable logging when done:

    bash
    export FLASHINFER_LOGLEVEL=0
    

Related Documentation

  • See CLAUDE.md "API Logging with @flashinfer_api" for technical details
  • See flashinfer/api_logging.py for implementation
  • See CUDA documentation for compute-sanitizer and cuda-gdb

Expand your agent's capabilities with these related and highly-rated skills.

aiskillstore/marketplace

perigon-backend

Perigon ASP.NET Core + EF Core + Aspire conventions

232 15
Explore
aiskillstore/marketplace

perigon-agent

Pointers for Copilot/agents to apply Perigon conventions

232 15
Explore
aiskillstore/marketplace

perigon-angular

Angular 21+ standalone/Material/signal conventions for Perigon WebApp

232 15
Explore
aiskillstore/marketplace

fastapi-mastery

Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.

232 15
Explore
aiskillstore/marketplace

context7-efficient

Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.

232 15
Explore
aiskillstore/marketplace

browser-use

Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.

232 15
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results