Agent skill

cuda-debugging

Expert skill for GPU debugging using CUDA-GDB and NVIDIA Compute Sanitizer. Detect memory errors, race conditions, uninitialized memory access, validate atomic operations, analyze kernel synchronization issues, and generate debugging reports with recommendations.

View SKILL.md on GitHub Repository

Stars 514

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/gpu-programming/skills/cuda-debugging

Metadata

Additional technical details for this skill

author: babysitter-sdk
version: 1.0.0
category: debugging
backlog id: SK-010

SKILL.md

cuda-debugging

You are cuda-debugging - a specialized skill for GPU debugging and error detection using NVIDIA's Compute Sanitizer and CUDA-GDB tools. This skill provides expert capabilities for identifying and resolving correctness issues in CUDA programs.

Overview

This skill enables AI-powered GPU debugging operations including:

Executing compute-sanitizer memory checks (memcheck)
Detecting race conditions with racecheck tool
Identifying memory leaks and invalid accesses
Using CUDA-GDB for kernel debugging
Analyzing kernel synchronization issues
Validating atomic operation correctness
Detecting uninitialized memory access (initcheck)
Generating debugging reports with actionable recommendations

Prerequisites

NVIDIA CUDA Toolkit 11.0+ with compute-sanitizer
CUDA-GDB for interactive debugging
GPU with debugging support (compute capability 3.5+)
Debug build of CUDA application (-G -lineinfo flags)
Optional: Nsight Visual Studio Code Extension

Capabilities

1. Memory Error Detection (Memcheck)

Detect memory access errors and leaks:

bash

# Basic memory check
compute-sanitizer --tool memcheck ./cuda_program

# With detailed error reporting
compute-sanitizer --tool memcheck --report-api-errors all ./cuda_program

# Log errors to file
compute-sanitizer --tool memcheck --log-file memcheck.log ./cuda_program

# Check for memory leaks
compute-sanitizer --tool memcheck --leak-check full ./cuda_program

# Track allocations
compute-sanitizer --tool memcheck --track-alloc-dealloc yes ./cuda_program

Common memory errors detected:

Out-of-bounds global memory access
Misaligned memory access
Invalid global memory access
Memory leaks (device allocations not freed)
Double free errors
Invalid device pointer operations

2. Race Condition Detection (Racecheck)

Detect shared memory data access hazards:

bash

# Basic race check
compute-sanitizer --tool racecheck ./cuda_program

# With detailed analysis
compute-sanitizer --tool racecheck --racecheck-report all ./cuda_program

# Save analysis to file
compute-sanitizer --tool racecheck --save racecheck.nvsanreport ./cuda_program

# Analyze previous run
compute-sanitizer --tool racecheck --import racecheck.nvsanreport --print-analysis ./cuda_program

Race condition types detected:

Write-after-read (WAR) hazards
Write-after-write (WAW) hazards
Read-after-write (RAW) hazards
Bank conflicts in shared memory
Synchronization-related races

3. Uninitialized Memory Detection (Initcheck)

Detect uninitialized global memory access:

bash

# Basic initcheck
compute-sanitizer --tool initcheck ./cuda_program

# Track all memory accesses
compute-sanitizer --tool initcheck --track-unused-memory yes ./cuda_program

# With error details
compute-sanitizer --tool initcheck --show-backtrace yes ./cuda_program

4. Synchronization Validation (Synccheck)

Detect illegal synchronization in CUDA code:

bash

# Basic synccheck
compute-sanitizer --tool synccheck ./cuda_program

# With detailed reporting
compute-sanitizer --tool synccheck --show-backtrace all ./cuda_program

Synchronization issues detected:

Divergent __syncthreads() calls
Invalid thread block synchronization
Illegal cooperative groups usage
Missing synchronization barriers

5. CUDA-GDB Debugging Commands

Interactive debugging with CUDA-GDB:

bash

# Launch CUDA-GDB
cuda-gdb ./cuda_program

# Common debugging commands
(cuda-gdb) set cuda memcheck on        # Enable memory checking
(cuda-gdb) set cuda break_on_launch    # Break at kernel launch
(cuda-gdb) break kernel_name           # Set breakpoint at kernel
(cuda-gdb) run                         # Start execution

# Thread navigation
(cuda-gdb) info cuda threads           # List all GPU threads
(cuda-gdb) cuda thread (0,0,0) (0,0,0) # Switch to specific thread
(cuda-gdb) cuda block                  # Show current block
(cuda-gdb) cuda kernel                 # Show current kernel

# Memory inspection
(cuda-gdb) print *d_array@10           # Print device array
(cuda-gdb) print __shared_memory__     # Inspect shared memory
(cuda-gdb) info cuda devices           # List CUDA devices

# Stepping through code
(cuda-gdb) cuda step                   # Step one warp instruction
(cuda-gdb) cuda next                   # Step over function calls
(cuda-gdb) continue                    # Continue execution

6. Common Debugging Patterns

Pattern 1: Memory Bounds Checking

cuda

// Add bounds checking to kernel
__global__ void safeKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bounds check
    if (idx >= n) return;

    // Safe access
    data[idx] = data[idx] * 2.0f;
}

Pattern 2: Shared Memory Synchronization

cuda

__global__ void reductionKernel(float* input, float* output, int n) {
    __shared__ float sdata[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Load to shared memory
    sdata[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();  // Required before reading shared memory

    // Reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();  // Required after each reduction step
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

Pattern 3: Atomic Operation Validation

cuda

// Validate atomic operations
__global__ void atomicTest(int* counter, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Use atomicAdd for thread-safe increment
        atomicAdd(counter, 1);
    }
}

// Verify result on host
int h_counter;
cudaMemcpy(&h_counter, d_counter, sizeof(int), cudaMemcpyDeviceToHost);
assert(h_counter == n);  // Should equal number of threads

7. Error Code Handling

Comprehensive CUDA error checking:

cuda

// Error checking macro
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));

// Check for kernel errors
myKernel<<<blocks, threads>>>(d_data, n);
CUDA_CHECK(cudaGetLastError());       // Check launch errors
CUDA_CHECK(cudaDeviceSynchronize());  // Check execution errors

8. Debugging Report Generation

Generate comprehensive debugging reports:

bash

# Full debugging session
compute-sanitizer --tool memcheck \
    --report-api-errors all \
    --show-backtrace yes \
    --log-file debug_report.txt \
    ./cuda_program 2>&1 | tee debug_output.log

# Summary report generation
echo "=== CUDA Debugging Report ===" > debug_summary.md
echo "Date: $(date)" >> debug_summary.md
echo "" >> debug_summary.md
echo "## Memory Check Results" >> debug_summary.md
compute-sanitizer --tool memcheck ./cuda_program 2>&1 >> debug_summary.md
echo "" >> debug_summary.md
echo "## Race Check Results" >> debug_summary.md
compute-sanitizer --tool racecheck ./cuda_program 2>&1 >> debug_summary.md

MCP Server Integration

This skill can leverage the following MCP servers:

Server	Description	Installation
claude-debugs-for-you	Interactive debugging via Claude	GitHub

Best Practices

Debugging Build Configuration

makefile

# Debug build flags
DEBUG_FLAGS = -G -lineinfo -Xcompiler -rdynamic -O0

# Release build with symbols
RELEASE_FLAGS = -O3 -lineinfo

# Compile for debugging
nvcc $(DEBUG_FLAGS) -o program_debug program.cu

# Compile for profiling (with symbols)
nvcc $(RELEASE_FLAGS) -o program_release program.cu

Debugging Strategy

Start with memcheck - Catches most common errors
Run racecheck if results are inconsistent - Finds synchronization bugs
Use initcheck for data corruption - Finds uninitialized reads
Profile after correctness - Don't optimize buggy code

Common Pitfalls

Issue	Symptom	Solution
Uncoalesced access	Memory errors at specific offsets	Align data to 128 bytes
Missing sync	Intermittent wrong results	Add `__syncthreads()`
Out of bounds	Access violation errors	Add bounds checking
Uninitialized shared memory	Random values	Initialize before use

Process Integration

This skill integrates with the following processes:

gpu-debugging-techniques.js - Comprehensive debugging workflows
gpu-performance-regression-testing.js - Correctness verification
atomic-operations-synchronization.js - Synchronization validation

Output Format

When executing operations, provide structured output:

json

{
  "operation": "memory-check",
  "status": "errors_found",
  "tool": "compute-sanitizer",
  "summary": {
    "total_errors": 3,
    "memory_errors": 2,
    "leak_errors": 1
  },
  "errors": [
    {
      "type": "Invalid __global__ read",
      "size": 4,
      "address": "0x7f1234567890",
      "location": {
        "file": "kernel.cu",
        "line": 42,
        "function": "processData"
      },
      "thread": "(128, 0, 0)",
      "block": "(3, 0, 0)"
    }
  ],
  "recommendations": [
    "Add bounds check at line 42",
    "Verify array size matches grid dimensions"
  ],
  "artifacts": ["debug_report.txt", "memcheck.log"]
}

Error Handling

Common Issues

Error	Cause	Resolution
`Invalid __global__ read`	Out-of-bounds access	Add bounds checking
`Potential WAW hazard`	Missing synchronization	Add `__syncthreads()`
`Memory leak`	Missing cudaFree	Free all allocations
`Uninitialized __global__ read`	Reading before write	Initialize memory

Constraints

Debug builds are significantly slower than release builds
Compute-sanitizer adds overhead; don't use in production
Some race conditions may not appear consistently
GPU must support debugging (sm_35+)
CUDA-GDB requires X11 forwarding for remote debugging

Maintainer

a5c-ai Core maintainer

Source details

Full Name: a5c-ai/babysitter
Branch: main
Path in repo: library/specializations/gpu-programming/skills/cuda-debugging
License: MIT License
Topics: claude-code agent-skills claude-code-skills ai-agents claude-skills vibe-coding agentic-workflow agentic-ai ai-automation agent-orchestration babysitter trustworthy-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

a5c-ai/babysitter

gsd-tools

Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).

514 31

Explore

a5c-ai/babysitter

model-profile-resolution

Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.

514 31

Explore

a5c-ai/babysitter

verification-suite

Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.

514 31

Explore

a5c-ai/babysitter

state-management

STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.

514 31

Explore

a5c-ai/babysitter

git-integration

Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.

514 31

Explore

a5c-ai/babysitter

frontmatter-parsing

YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.

514 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

cuda-debugging

Overview

Prerequisites

Capabilities

1. Memory Error Detection (Memcheck)

2. Race Condition Detection (Racecheck)

3. Uninitialized Memory Detection (Initcheck)

4. Synchronization Validation (Synccheck)

5. CUDA-GDB Debugging Commands

6. Common Debugging Patterns

Pattern 1: Memory Bounds Checking

Pattern 2: Shared Memory Synchronization

Pattern 3: Atomic Operation Validation

7. Error Code Handling

8. Debugging Report Generation

MCP Server Integration

Best Practices

Debugging Build Configuration

Debugging Strategy

Common Pitfalls

Process Integration

Output Format

Error Handling

Common Issues

Constraints

Recommended Agent Skills

gsd-tools

model-profile-resolution

verification-suite

state-management

git-integration

frontmatter-parsing