Agent skill

cuda-toolkit

Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.

View SKILL.md on GitHub Repository

Stars 514

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/gpu-programming/skills/cuda-toolkit

Metadata

Additional technical details for this skill

author: babysitter-sdk
version: 1.0.0
category: cuda-development
backlog id: SK-001

SKILL.md

cuda-toolkit

You are cuda-toolkit - a specialized skill for NVIDIA CUDA toolkit integration, providing expert capabilities for kernel development, compilation, and debugging workflows.

Overview

This skill enables AI-powered CUDA development operations including:

Execute nvcc compilation with optimization flags analysis
Generate and validate CUDA kernel code with proper thread indexing
Analyze PTX/SASS assembly output for optimization insights
Configure execution parameters (grid/block dimensions)
Handle CUDA error codes and diagnostic messages
Generate host-device memory management code
Support multiple CUDA compute capabilities (sm_XX)
Validate kernel launch bounds and resource usage

Prerequisites

NVIDIA CUDA Toolkit 11.0+
nvcc compiler
GPU with compute capability 3.5+
Optional: cuobjdump for binary analysis

Capabilities

1. NVCC Compilation

Compile CUDA programs with various optimization flags:

bash

# Basic compilation
nvcc -o program program.cu

# Optimized release build
nvcc -O3 -use_fast_math -o program program.cu

# Debug build with line info
nvcc -G -lineinfo -o program_debug program.cu

# Specify compute capability
nvcc -arch=sm_80 -o program program.cu

# Generate PTX for multiple architectures
nvcc -gencode arch=compute_70,code=sm_70 \
     -gencode arch=compute_80,code=sm_80 \
     -o program program.cu

# Verbose compilation
nvcc -v --ptxas-options=-v -o program program.cu

2. Kernel Code Generation

Generate properly structured CUDA kernels:

cuda

// Thread indexing patterns
__global__ void kernel1D(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;
    }
}

__global__ void kernel2D(float* data, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < width && y < height) {
        int idx = y * width + x;
        data[idx] = data[idx] * 2.0f;
    }
}

__global__ void kernel3D(float* data, int dimX, int dimY, int dimZ) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;
    if (x < dimX && y < dimY && z < dimZ) {
        int idx = z * dimX * dimY + y * dimX + x;
        data[idx] = data[idx] * 2.0f;
    }
}

3. Launch Configuration

Calculate optimal launch parameters:

cuda

// Launch configuration helper
void launchKernel(float* d_data, int n) {
    int blockSize = 256;  // Common optimal block size
    int numBlocks = (n + blockSize - 1) / blockSize;

    // Limit blocks to device maximum
    int deviceId;
    cudaGetDevice(&deviceId);
    cudaDeviceProp props;
    cudaGetDeviceProperties(&props, deviceId);
    numBlocks = min(numBlocks, props.maxGridSize[0]);

    kernel1D<<<numBlocks, blockSize>>>(d_data, n);
}

// Query optimal block size
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, kernel1D, 0, 0);

4. PTX/SASS Analysis

Analyze generated assembly:

bash

# Generate PTX
nvcc -ptx -o program.ptx program.cu

# View PTX
cat program.ptx

# Generate SASS (device assembly)
cuobjdump -sass program > program.sass

# Analyze register usage
nvcc --ptxas-options=-v program.cu 2>&1 | grep -E "registers|memory"

# Dump detailed resource usage
cuobjdump --dump-resource-usage program

5. Memory Management

Generate proper memory management code:

cuda

// Host-device memory transfer pattern
void processData(float* h_input, float* h_output, int n) {
    float *d_input, *d_output;
    size_t size = n * sizeof(float);

    // Allocate device memory
    cudaMalloc(&d_input, size);
    cudaMalloc(&d_output, size);

    // Copy input to device
    cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);

    // Launch kernel
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    processKernel<<<numBlocks, blockSize>>>(d_input, d_output, n);

    // Copy output to host
    cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_input);
    cudaFree(d_output);
}

// Pinned memory for faster transfers
float* h_pinned;
cudaMallocHost(&h_pinned, size);
// ... use h_pinned ...
cudaFreeHost(h_pinned);

6. Error Handling

Comprehensive error checking:

cuda

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));

// Check kernel errors
myKernel<<<blocks, threads>>>(d_data, n);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());

7. Compute Capability Support

Target specific GPU architectures:

bash

# SM versions and features
# sm_50 - Maxwell (dynamic parallelism)
# sm_60 - Pascal (unified memory, FP16)
# sm_70 - Volta (tensor cores, independent thread scheduling)
# sm_75 - Turing (RT cores, INT8 tensor cores)
# sm_80 - Ampere (TF32, sparse tensor cores)
# sm_86 - Ampere consumer
# sm_89 - Ada Lovelace
# sm_90 - Hopper (transformer engine, TMA)

# Compile for specific capability
nvcc -arch=sm_80 -code=sm_80 program.cu

# Fat binary for multiple architectures
nvcc -gencode arch=compute_70,code=sm_70 \
     -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_90,code=sm_90 \
     -o program program.cu

8. Launch Bounds Validation

Validate resource constraints:

cuda

// Specify launch bounds for occupancy
__global__ void __launch_bounds__(256, 4)
boundedKernel(float* data, int n) {
    // Kernel limited to 256 threads, compiler targets 4 blocks/SM
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) data[idx] *= 2.0f;
}

// Query and validate resources
void validateLaunch() {
    cudaFuncAttributes attr;
    cudaFuncGetAttributes(&attr, boundedKernel);

    printf("Registers: %d\n", attr.numRegs);
    printf("Shared memory: %zu bytes\n", attr.sharedSizeBytes);
    printf("Max threads per block: %d\n", attr.maxThreadsPerBlock);
}

Process Integration

This skill integrates with the following processes:

cuda-kernel-development.js - Kernel development workflow
cuda-stream-concurrency.js - Stream management
custom-cuda-operator-development.js - Custom operator creation
dynamic-parallelism-implementation.js - Dynamic parallelism

Output Format

When executing operations, provide structured output:

json

{
  "operation": "compile",
  "status": "success",
  "compiler": "nvcc",
  "flags": ["-O3", "-arch=sm_80"],
  "output": {
    "binary": "program",
    "ptx": "program.ptx"
  },
  "resources": {
    "registers_per_thread": 32,
    "shared_memory_per_block": 4096,
    "max_threads_per_block": 1024
  },
  "warnings": [],
  "artifacts": ["program", "program.ptx"]
}

Dependencies

CUDA Toolkit 11.0+
nvcc compiler
cuobjdump (optional)

Constraints

Kernel code must include proper bounds checking
Launch configurations must respect device limits
Memory operations must check for errors
PTX analysis requires debug symbols for meaningful output

Maintainer

a5c-ai Core maintainer

Source details

Full Name: a5c-ai/babysitter
Branch: main
Path in repo: library/specializations/gpu-programming/skills/cuda-toolkit
License: MIT License
Topics: claude-code agent-skills claude-code-skills ai-agents claude-skills vibe-coding agentic-workflow agentic-ai ai-automation agent-orchestration babysitter trustworthy-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

a5c-ai/babysitter

gsd-tools

Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).

514 31

Explore

a5c-ai/babysitter

model-profile-resolution

Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.

514 31

Explore

a5c-ai/babysitter

verification-suite

Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.

514 31

Explore

a5c-ai/babysitter

state-management

STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.

514 31

Explore

a5c-ai/babysitter

git-integration

Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.

514 31

Explore

a5c-ai/babysitter

frontmatter-parsing

YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.

514 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

cuda-toolkit

Overview

Prerequisites

Capabilities

1. NVCC Compilation

2. Kernel Code Generation

3. Launch Configuration

4. PTX/SASS Analysis

5. Memory Management

6. Error Handling

7. Compute Capability Support

8. Launch Bounds Validation

Process Integration

Output Format

Dependencies

Constraints

Recommended Agent Skills

gsd-tools

model-profile-resolution

verification-suite

state-management

git-integration

frontmatter-parsing