Agent skill

nsight-profiler

Expert skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. Configure profiling sessions, analyze kernel reports, interpret occupancy metrics, roofline model data, memory bandwidth bottlenecks, and warp execution efficiency.

Stars 514
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/gpu-programming/skills/nsight-profiler

Metadata

Additional technical details for this skill

author
babysitter-sdk
version
1.0.0
category
performance-profiling
backlog id
SK-002

SKILL.md

nsight-profiler

You are nsight-profiler - a specialized skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. This skill provides expert capabilities for performance analysis and optimization of GPU applications.

Overview

This skill enables AI-powered GPU profiling operations including:

  • Configure and execute Nsight Systems profiling sessions
  • Analyze Nsight Compute kernel reports
  • Interpret occupancy metrics and SM utilization
  • Parse and visualize roofline model data
  • Identify memory bandwidth bottlenecks
  • Analyze warp execution efficiency
  • Generate optimization recommendations from profiler data
  • Compare kernel performance across different configurations

Prerequisites

  • NVIDIA Nsight Systems 2023.1+
  • NVIDIA Nsight Compute 2023.1+
  • CUDA Toolkit 11.0+
  • GPU with compute capability 7.0+ (for full profiling features)

Capabilities

1. Nsight Systems Profiling

System-wide performance analysis:

bash
# Basic system profile
nsys profile -o report ./cuda_program

# Profile with CUDA API tracing
nsys profile -t cuda,nvtx,osrt -o report ./cuda_program

# Capture GPU metrics
nsys profile --gpu-metrics-device=all -o report ./cuda_program

# Profile specific duration
nsys profile -d 10 -o report ./cuda_program

# Export to multiple formats
nsys export -t sqlite,json report.nsys-rep

# Generate summary statistics
nsys stats report.nsys-rep

2. Nsight Compute Profiling

Detailed kernel analysis:

bash
# Profile all kernels
ncu -o profile ./cuda_program

# Profile specific kernel
ncu --kernel-name myKernel -o profile ./cuda_program

# Full metric collection
ncu --set full -o profile ./cuda_program

# Roofline analysis
ncu --set roofline -o profile ./cuda_program

# Memory analysis
ncu --section MemoryWorkloadAnalysis -o profile ./cuda_program

# Compare two runs
ncu --import baseline.ncu-rep --diff ./cuda_program

3. Occupancy Analysis

Analyze and optimize occupancy:

bash
# Collect occupancy metrics
ncu --section Occupancy -o occupancy ./cuda_program

# Key metrics to analyze:
# - Achieved Occupancy
# - Theoretical Occupancy
# - Block Limit (registers, shared memory, warps)
# - Occupancy Limiter
cuda
// Query occupancy in code
int numBlocks;
int blockSize = 256;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &numBlocks, myKernel, blockSize, sharedMemSize);

float occupancy = (numBlocks * blockSize) /
    (float)deviceProp.maxThreadsPerMultiProcessor;
printf("Theoretical Occupancy: %.2f%%\n", occupancy * 100);

4. Roofline Model Analysis

Performance bound analysis:

bash
# Generate roofline data
ncu --set roofline -o roofline ./cuda_program

# Key metrics:
# - Achieved FLOP/s
# - Achieved Memory Bandwidth
# - Arithmetic Intensity (FLOP/byte)
# - Ridge Point

Interpretation guide:

  • Below memory roofline: Memory bound
  • Below compute roofline: Compute bound
  • At peak: Optimal utilization

5. Memory Bandwidth Analysis

Identify memory bottlenecks:

bash
# Memory analysis sections
ncu --section MemoryWorkloadAnalysis \
    --section MemoryWorkloadAnalysis_Chart \
    --section MemoryWorkloadAnalysis_Tables \
    -o memory ./cuda_program

Key metrics:

  • Global Load/Store Throughput
  • L1/L2 Cache Hit Rate
  • Shared Memory Bandwidth
  • Memory Transactions per Request

6. Warp Execution Analysis

Analyze warp efficiency:

bash
# Warp state analysis
ncu --section WarpStateStatistics -o warp ./cuda_program

# Scheduler statistics
ncu --section SchedulerStatistics -o scheduler ./cuda_program

Key metrics:

  • Warp Cycles Per Issued Instruction
  • Eligible Warps Per Active Cycle
  • Active Warps Per Scheduler
  • Stall Reasons (memory, sync, execution)

7. Kernel Comparison

Compare kernel variants:

bash
# Baseline capture
ncu -o baseline ./program_v1

# Compare with new version
ncu --import baseline.ncu-rep --diff ./program_v2

# Generate comparison report
ncu --import baseline.ncu-rep \
    --import optimized.ncu-rep \
    --page diff --csv > comparison.csv

8. Performance Recommendations

Automated analysis:

bash
# Get optimization recommendations
ncu --section SpeedOfLight \
    --section SpeedOfLight_RooflineChart \
    -o speedoflight ./cuda_program

# Export with recommendations
ncu --import profile.ncu-rep --page details --csv > details.csv

Common Profiling Workflows

Workflow 1: Initial Performance Assessment

bash
# Step 1: System overview
nsys profile -t cuda -o system_overview ./program
nsys stats system_overview.nsys-rep

# Step 2: Identify hot kernels
ncu --launch-skip 10 --launch-count 5 -o hot_kernels ./program

# Step 3: Deep dive on bottleneck kernel
ncu --kernel-name hotKernel --set full -o detailed ./program

Workflow 2: Memory Optimization

bash
# Analyze memory access patterns
ncu --section SourceCounters \
    --section MemoryWorkloadAnalysis \
    --kernel-name targetKernel \
    -o memory_analysis ./program

# Check for coalescing issues
ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,\
l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum \
    -o coalescing ./program

Workflow 3: Occupancy Optimization

bash
# Profile with occupancy focus
ncu --section Occupancy \
    --section LaunchStatistics \
    -o occupancy ./program

# Test different block sizes
for bs in 64 128 256 512 1024; do
    ncu --section Occupancy -o occ_$bs ./program --block-size $bs
done

Process Integration

This skill integrates with the following processes:

  • performance-profiling-analysis.js - Performance analysis workflow
  • occupancy-optimization.js - Occupancy optimization
  • warp-efficiency-optimization.js - Warp efficiency
  • gpu-memory-optimization.js - Memory optimization

Output Format

When executing operations, provide structured output:

json
{
  "operation": "kernel-profile",
  "tool": "nsight-compute",
  "kernel": "matrixMultiply",
  "metrics": {
    "duration_us": 125.4,
    "achieved_occupancy": 0.78,
    "theoretical_occupancy": 1.0,
    "compute_throughput_pct": 65.2,
    "memory_throughput_pct": 89.3,
    "roofline": {
      "arithmetic_intensity": 12.5,
      "achieved_gflops": 4500,
      "peak_gflops": 8000,
      "bound": "compute"
    }
  },
  "recommendations": [
    "Increase block size to improve occupancy",
    "Consider loop unrolling to reduce instruction overhead"
  ],
  "artifacts": ["profile.ncu-rep", "summary.csv"]
}

Dependencies

  • Nsight Systems 2023.1+
  • Nsight Compute 2023.1+
  • CUDA Toolkit 11.0+

Constraints

  • Full profiling requires root/admin privileges
  • Some metrics only available on specific GPU architectures
  • Profiling adds overhead; results may differ from production
  • Nsight Compute profiles one kernel invocation at a time by default

Expand your agent's capabilities with these related and highly-rated skills.

a5c-ai/babysitter

gsd-tools

Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).

514 31
Explore
a5c-ai/babysitter

model-profile-resolution

Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.

514 31
Explore
a5c-ai/babysitter

verification-suite

Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.

514 31
Explore
a5c-ai/babysitter

state-management

STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.

514 31
Explore
a5c-ai/babysitter

git-integration

Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.

514 31
Explore
a5c-ai/babysitter

frontmatter-parsing

YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.

514 31
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results