Agent skill
debug-cuda-crash
Tutorial for debugging CUDA crashes using API logging
Install this agent skill to your Project
npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/flashinfer-ai/debug-cuda-crash
SKILL.md
Tutorial: Debugging CUDA Crashes with API Logging
This tutorial shows you how to debug CUDA crashes and errors in FlashInfer using the @flashinfer_api logging decorator.
Goal
When your code crashes with CUDA errors (illegal memory access, out-of-bounds, NaN/Inf), use API logging to:
- Capture input tensors BEFORE the crash occurs
- Understand what data caused the problem
- Track tensor shapes, dtypes, and values through your pipeline
- Detect numerical issues (NaN, Inf, wrong shapes)
Why Use API Logging?
Problem: CUDA errors often crash the program, leaving no debugging information.
Solution: FlashInfer's @flashinfer_api decorator logs inputs BEFORE execution, so you can see what caused the crash even after the program terminates.
Step 1: Enable API Logging
Basic Logging (Function Names Only)
export FLASHINFER_LOGLEVEL=1 # Log function names
export FLASHINFER_LOGDEST=stdout # Log to console
python my_script.py
Output:
[2025-12-18 10:30:45] FlashInfer API Call: batch_decode_with_padded_kv_cache
Detailed Logging (Inputs/Outputs with Metadata)
export FLASHINFER_LOGLEVEL=3 # Log inputs/outputs with metadata
export FLASHINFER_LOGDEST=debug.log # Save to file
python my_script.py
Output in debug.log:
================================================================================
[2025-12-18 10:30:45] FlashInfer API Logging - System Information
================================================================================
FlashInfer version: 0.6.0
CUDA toolkit version: 12.1
GPU 0: NVIDIA H100 PCIe
Compute capability: 9.0 (SM90)
PyTorch version: 2.1.0
================================================================================
================================================================================
[2025-12-18 10:30:46] FlashInfer API Call: batch_decode_with_padded_kv_cache
--------------------------------------------------------------------------------
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
Keyword input arguments:
kv_cache=
Tensor(
shape=(1024, 2, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
Full Logging (With Tensor Statistics)
export FLASHINFER_LOGLEVEL=5 # Log with min/max/mean/nan/inf
export FLASHINFER_LOGDEST=debug.log
python my_script.py
Additional output:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
min=-3.125000
max=4.250000
mean=0.015625
nan_count=0
inf_count=0
)
Step 2: Reproduce the Crash
Example: Shape Mismatch
Your code crashes with:
RuntimeError: CUDA error: an illegal memory access was encountered
Enable logging and run again:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=crash_log.txt
python my_script.py
The log shows inputs before the crash:
[2025-12-18 10:32:15] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128) # Query tensor
...
)
Keyword input arguments:
kv_cache=
Tensor(
shape=(1024, 2, 8, 64) # ❌ Wrong! Should be (..., 128) not (..., 64)
...
)
Found the bug: head_dim mismatch (64 vs 128)
Step 3: Common CUDA Errors and How to Debug
Error 1: Illegal Memory Access
Error Message:
RuntimeError: CUDA error: an illegal memory access was encountered
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
- ✅ Tensor shapes match expected dimensions
- ✅ All tensors are on CUDA (not CPU)
- ✅ Tensor strides are reasonable
- ✅
is_contiguous=True(if required)
Common causes:
- Wrong tensor dimensions
- CPU tensor passed to GPU kernel
- Incorrect stride patterns
Error 2: NaN or Inf Values
Error Message:
RuntimeError: Function ... returned nan or inf
Enable statistics logging:
export FLASHINFER_LOGLEVEL=5 # Level 5 shows nan_count, inf_count
python my_script.py
What to check in logs:
Tensor(
...
min=-1234567.000000 # ❌ Suspiciously large
max=9876543.000000 # ❌ Suspiciously large
mean=nan # ❌ NaN detected
nan_count=128 # ❌ 128 NaN values!
inf_count=0
)
Common causes:
- Division by zero in previous operation
- Numerical overflow/underflow
- Uninitialized memory
Error 3: Out of Memory
Error Message:
RuntimeError: CUDA out of memory
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
- ✅ Tensor shapes (are they unexpectedly large?)
- ✅ Batch size
- ✅ Sequence length
Example:
Tensor(
shape=(1024, 8192, 128, 128) # ❌ Way too large! Should be (1024, 128, 128)?
...
)
Error 4: Wrong Dtype
Error Message:
RuntimeError: expected scalar type BFloat16 but found Float16
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
Tensor(
dtype=torch.float16 # ❌ Should be torch.bfloat16
...
)
Step 4: Multi-Process Debugging
When running with multiple GPUs/processes, use %i pattern:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug_rank_%i.txt # %i = process ID
torchrun --nproc_per_node=4 my_script.py
This creates separate logs:
debug_rank_12345.txt(process 12345)debug_rank_12346.txt(process 12346)debug_rank_12347.txt(process 12347)debug_rank_12348.txt(process 12348)
Now you can debug each rank independently.
Step 5: Advanced Debugging with compute-sanitizer
For harder bugs, combine API logging with CUDA tools:
Use compute-sanitizer (Memory Checker)
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
compute-sanitizer --tool memcheck python my_script.py
Output shows exact memory errors:
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
========= at 0x1234 in ScaleKernel<float>
========= by thread (256,0,0) in block (10,0,0)
========= Address 0x7f1234567890 is out of bounds
Check debug.log to see what inputs caused this kernel to fail.
Use cuda-gdb (Debugger)
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
cuda-gdb --args python my_script.py
In gdb:
(cuda-gdb) run
(cuda-gdb) where # Show stack trace when it crashes
Check debug.log for the inputs that led to the crash.
Step 6: Kernel-Level Debugging with printf()
You can use printf() inside CUDA kernels for debugging:
Basic Usage
__global__ void MyKernel(const float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Print from one thread to avoid spam
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("n=%d, input[0]=%f\n", n, input[0]);
}
if (idx < n) {
output[idx] = input[idx] * 2.0f;
}
}
Important: Flush printf buffer after kernel:
my_kernel(input, output)
torch.cuda.synchronize() # ← Flushes printf output
⚠️ Warp-Specialized Kernels: Choosing the Right Print Thread
Problem: threadIdx.x == 0 doesn't work for all warps (warp starting at thread 32 won't have thread 0).
Solution: Choose one representative thread per specialization group.
__global__ void WarpSpecializedKernel(...) {
// Define your group's representative thread
// e.g., first thread of each warp: threadIdx.x % 32 == 0
// e.g., first thread of each 4-warp group: threadIdx.x % 128 == 0
if (is_group_representative) {
printf("Group %d processing\n", group_id);
}
}
Common mistake ❌:
// ❌ Only warp 0 will print!
if (threadIdx.x == 0) {
printf("Warp %d processing\n", threadIdx.x / 32);
}
Quick Reference
| Kernel Type | Print Condition | Notes |
|---|---|---|
| Simple kernel | threadIdx.x == 0 |
One thread per block |
| Warp-specialized | One thread per group | Depends on kernel design |
Other Kernel Debugging Tools
// Assert for invariants
assert(value >= 0.0f && "Value must be non-negative");
// Compile-time checks
static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be multiple of warp size");
Environment Variables Reference
| Variable | Values | Description |
|---|---|---|
FLASHINFER_LOGLEVEL |
0 |
No logging (default) |
1 |
Function names only | |
3 |
Inputs/outputs with metadata | |
5 |
+ Tensor statistics (min/max/mean/nan/inf) | |
FLASHINFER_LOGDEST |
stdout |
Log to console (default) |
stderr |
Log to stderr | |
<path> |
Log to file | |
log_%i.txt |
Multi-process: %i = process ID |
Best Practices
1. Always Start with Level 3
export FLASHINFER_LOGLEVEL=3
Level 3 provides tensor metadata (shape, dtype, device) without overwhelming output.
2. Use Level 5 for Numerical Issues
export FLASHINFER_LOGLEVEL=5
Only use level 5 when debugging NaN/Inf problems (adds statistics).
3. Log to File for Crashes
export FLASHINFER_LOGDEST=crash_log.txt
Console output may be lost when program crashes. File logs persist.
4. Compare Before/After
Enable logging and compare:
- Last successful API call (inputs logged, outputs logged) ✅
- First failed API call (inputs logged, no outputs) ❌ ← This is where it crashed!
5. Disable Logging in Production
unset FLASHINFER_LOGLEVEL # or export FLASHINFER_LOGLEVEL=0
Logging has zero overhead when disabled (decorator returns original function).
Troubleshooting
No Logs Appearing
Problem: Set FLASHINFER_LOGLEVEL=3 but no logs appear
Solutions:
-
Check if API has the decorator: Not all FlashInfer APIs have
@flashinfer_apiyet (work in progress) -
Verify environment variable:
bashecho $FLASHINFER_LOGLEVEL # Should print "3" -
Check log destination:
bashecho $FLASHINFER_LOGDEST # Should print path or "stdout"
Too Much Output
Problem: Level 5 produces too much output
Solution: Use level 3 instead:
export FLASHINFER_LOGLEVEL=3 # Skip tensor statistics
Statistics Skipped in CUDA Graph
Warning: [statistics skipped: CUDA graph capture in progress]
What it means: Level 5 statistics are automatically skipped during CUDA graph capture (to avoid synchronization)
This is normal: The framework protects you from graph capture issues.
Quick Examples
Debug Shape Mismatch
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=stdout
python my_script.py
# Check tensor shapes in output
Debug NaN/Inf
export FLASHINFER_LOGLEVEL=5 # Show statistics
export FLASHINFER_LOGDEST=debug.log
python my_script.py
# Check nan_count and inf_count in debug.log
Debug Multi-GPU Training
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=rank_%i.log # Separate log per rank
torchrun --nproc_per_node=8 train.py
# Check rank_*.log files
Combine with Memory Checker
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=inputs.log
compute-sanitizer --tool memcheck python my_script.py
# inputs.log shows what data caused the memory error
Example: Full Debug Session
Your code crashes:
import torch
from flashinfer import batch_decode_with_padded_kv_cache
q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device="cuda")
kv = torch.randn(1024, 2, 8, 64, dtype=torch.bfloat16, device="cuda") # Wrong dim!
output = batch_decode_with_padded_kv_cache(q, kv) # ❌ Crashes
Enable logging:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
python test.py
Check debug.log:
[2025-12-18 10:45:23] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
...
)
arg[1]:
Tensor(
shape=(1024, 2, 8, 64) # ❌ Found it! Last dim should be 128
dtype=torch.bfloat16
...
)
Fix the bug:
kv = torch.randn(1024, 2, 8, 128, dtype=torch.bfloat16, device="cuda") # ✅ Fixed
Success!
python test.py
# No crash, outputs logged successfully
Summary
-
Enable logging before the crash:
bashexport FLASHINFER_LOGLEVEL=3 export FLASHINFER_LOGDEST=debug.log -
Run your code - inputs are logged BEFORE crash
-
Check the log - last API call shows what caused the crash
-
Fix the issue based on logged input metadata
-
Disable logging when done:
bashexport FLASHINFER_LOGLEVEL=0
Related Documentation
- See CLAUDE.md "API Logging with @flashinfer_api" for technical details
- See
flashinfer/api_logging.pyfor implementation - See CUDA documentation for compute-sanitizer and cuda-gdb
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
perigon-backend
Perigon ASP.NET Core + EF Core + Aspire conventions
perigon-agent
Pointers for Copilot/agents to apply Perigon conventions
perigon-angular
Angular 21+ standalone/Material/signal conventions for Perigon WebApp
fastapi-mastery
Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.
context7-efficient
Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.
browser-use
Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.
Didn't find tool you were looking for?