Agent skill

benchmark-kernel

Guide for benchmarking FlashInfer kernels with CUPTI timing

Stars 232
Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/flashinfer-ai/benchmark-kernel

SKILL.md

Tutorial: Benchmarking FlashInfer Kernels

This tutorial shows you how to accurately benchmark FlashInfer kernels.

Goal

Measure the performance of FlashInfer kernels:

  • Get accurate GPU kernel execution time
  • Compare multiple backends (FlashAttention2/3, cuDNN, CUTLASS, TensorRT-LLM)
  • Generate reproducible benchmark results
  • Save results to CSV for analysis

Timing Methods

FlashInfer supports two timing methods:

  1. CUPTI (Preferred): Hardware-level profiling for most accurate GPU kernel time

    • Measures pure GPU compute time without host-device overhead
    • Requires cupti-python >= 13.0.0 (CUDA 13+)
  2. CUDA Events (Fallback): Standard CUDA event timing

    • Automatically used if CUPTI is not available
    • Good accuracy, slight overhead from host synchronization

The framework automatically uses CUPTI if available, otherwise falls back to CUDA events.

Installation

Install CUPTI (Recommended)

For the most accurate benchmarking:

bash
pip install -U cupti-python

Requirements: CUDA 13+ (CUPTI version 13+)

Without CUPTI

If you don't install CUPTI, the framework will:

  • Print a warning: CUPTI is not installed. Falling back to CUDA events.
  • Automatically use CUDA events for timing
  • Still provide good benchmark results

Method 1: Using flashinfer_benchmark.py (Recommended)

Step 1: Choose Your Test Routine

Available routines:

  • Attention: BatchDecodeWithPagedKVCacheWrapper, BatchPrefillWithPagedKVCacheWrapper, BatchPrefillWithRaggedKVCacheWrapper, BatchMLAPagedAttentionWrapper
  • GEMM: bmm_fp8, gemm_fp8_nt_groupwise, group_gemm_fp8_nt_groupwise, mm_fp4
  • MOE: trtllm_fp4_block_scale_moe, trtllm_fp8_block_scale_moe, trtllm_fp8_per_tensor_scale_moe, cutlass_fused_moe

Step 2: Run a Single Benchmark

Example - Benchmark decode attention:

bash
# CUPTI will be used automatically if installed
python benchmarks/flashinfer_benchmark.py \
    --routine BatchDecodeWithPagedKVCacheWrapper \
    --backends fa2 fa2_tc cudnn \
    --page_size 16 \
    --batch_size 32 \
    --s_qo 1 \
    --s_kv 2048 \
    --num_qo_heads 32 \
    --num_kv_heads 8 \
    --head_dim_qk 128 \
    --head_dim_vo 128 \
    --q_dtype bfloat16 \
    --kv_dtype bfloat16 \
    --num_iters 30 \
    --dry_run_iters 5 \
    --refcheck \
    -vv

Example - Benchmark FP8 GEMM:

bash
python benchmarks/flashinfer_benchmark.py \
    --routine bmm_fp8 \
    --backends cudnn cublas cutlass \
    --batch_size 256 \
    --m 1 \
    --n 1024 \
    --k 7168 \
    --input_dtype fp8_e4m3 \
    --mat2_dtype fp8_e4m3 \
    --out_dtype bfloat16 \
    --refcheck \
    -vv \
    --generate_repro_command

Timing behavior:

  • ✅ If CUPTI installed: Uses CUPTI (most accurate)
  • ⚠️ If CUPTI not installed: Automatically falls back to CUDA events with warning
  • 🔧 To force CUDA events: Add --use_cuda_events flag

Step 3: Understand the Output

[INFO] FlashInfer version: 0.6.0
[VVERBOSE] gpu_name = 'NVIDIA_H100_PCIe'
[PERF] fa2            :: median time 0.145 ms; std 0.002 ms; achieved tflops 125.3 TFLOPs/sec; achieved tb_per_sec 1.87 TB/sec
[PERF] fa2_tc         :: median time 0.138 ms; std 0.001 ms; achieved tflops 131.5 TFLOPs/sec; achieved tb_per_sec 1.96 TB/sec
[PERF] cudnn          :: median time 0.142 ms; std 0.001 ms; achieved tflops 127.8 TFLOPs/sec; achieved tb_per_sec 1.91 TB/sec

Key metrics:

  • median time: Median kernel execution time (lower is better)
  • std: Standard deviation (lower means more consistent)
  • achieved tflops: Effective TFLOPS throughput
  • achieved tb_per_sec: Memory bandwidth utilization

Step 4: Run Batch Benchmarks

Create a test list file my_benchmarks.txt:

bash
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 32 --s_kv 2048 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 64 --s_kv 4096 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine bmm_fp8 --backends cudnn cutlass --batch_size 256 --m 1 --n 1024 --k 7168 --input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 --out_dtype bfloat16

Run all tests:

bash
python benchmarks/flashinfer_benchmark.py \
    --testlist my_benchmarks.txt \
    --output_path results.csv \
    --generate_repro_command \
    --refcheck

Results are saved to results.csv with all metrics and reproducer commands.

Step 5: Common Flags

Flag Description Default
--num_iters Measurement iterations 30
--dry_run_iters Warmup iterations 5
--refcheck Verify output correctness False
--allow_output_mismatch Continue on mismatch False
--use_cuda_events Force CUDA events (skip CUPTI) False
--no_cuda_graph Disable CUDA graph False
-vv Very verbose output -
--generate_repro_command Print reproducer command False
--case_tag Tag for CSV output None

Method 2: Using bench_gpu_time() in Python

For custom benchmarking in your own code:

Step 1: Write Your Benchmark Script

python
import torch
from flashinfer.testing import bench_gpu_time

# Setup your kernel
def my_kernel_wrapper(q, k, v):
    # Your kernel call here
    return output

# Create test inputs
device = torch.device("cuda")
q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device=device)
k = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)
v = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)

# Benchmark - CUPTI preferred, CUDA events if CUPTI unavailable
median_time, std_time = bench_gpu_time(
    my_kernel_wrapper,
    args=(q, k, v),
    enable_cupti=True,          # Prefer CUPTI, fallback to CUDA events
    num_iters=30,               # Number of iterations
    dry_run_iters=5,            # Warmup iterations
)

print(f"Kernel time: {median_time:.3f} ms ± {std_time:.3f} ms")

# Calculate FLOPS if you know the operation count
flops = ...  # Your FLOP count
tflops = (flops / 1e12) / (median_time / 1000)
print(f"Achieved: {tflops:.2f} TFLOPS/sec")

Note: If CUPTI is not installed, you'll see a warning and the function will automatically use CUDA events instead.

Step 2: Run Your Benchmark

bash
python my_benchmark.py

Output with CUPTI:

Kernel time: 0.145 ms ± 0.002 ms
Achieved: 125.3 TFLOPS/sec

Output without CUPTI (automatic fallback):

[WARNING] CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events.
Kernel time: 0.147 ms ± 0.003 ms
Achieved: 124.1 TFLOPS/sec

Step 3: Advanced Options

python
# Cold L2 cache benchmarking (optional)
median_time, std_time = bench_gpu_time(
    my_kernel,
    args=(x, y),
    enable_cupti=True,          # Will use CUDA events if CUPTI unavailable
    cold_l2_cache=True,         # Flush L2 or rotate buffers automatically
    num_iters=30
)

# Force CUDA events (skip CUPTI even if installed)
median_time, std_time = bench_gpu_time(
    my_kernel,
    args=(x, y),
    enable_cupti=False,         # Explicitly use CUDA events
    num_iters=30
)

Troubleshooting

CUPTI Warning Message

Warning: CUPTI is not installed. Falling back to CUDA events.

What it means: CUPTI is not available, using CUDA events instead

Impact: Less accurate for very fast kernels (5-50 us) due to synchronization overhead, but becomes negligible for longer-running kernels

Solution (optional): Install CUPTI for best accuracy:

bash
pip install -U cupti-python

If installation fails, check:

  • CUDA version >= 13
  • Compatible cupti-python version

You can still run benchmarks without CUPTI - the framework handles this automatically.

Inconsistent Results

Problem: Large standard deviation or varying results

Solutions:

  1. Increase warmup iterations:

    bash
    --dry_run_iters 10
    
  2. Increase measurement iterations:

    bash
    --num_iters 50
    
  3. Use cold L2 cache (in Python):

    python
    bench_gpu_time(..., rotate_buffers=True)
    
  4. Disable GPU boost (advanced):

    bash
    sudo nvidia-smi -lgc <base_clock>
    

Reference Check Failures

Error: [ERROR] Output mismatch between backends

What it means: Different backends produce different results

Solutions:

  1. Allow mismatch and continue:

    bash
    --allow_output_mismatch
    
  2. Check numerical tolerance: Some backends use different precisions (FP32 vs FP16)

  3. Investigate the difference:

    bash
    -vv  # Very verbose mode shows tensor statistics
    

Backend Not Supported

Error: [WARNING] fa3 for routine ... is not supported on compute capability X.X

Solution: Check the backend support matrix in benchmarks/README.md or remove that backend from --backends list

Best Practices

  1. Install CUPTI for best accuracy (but not required):

    bash
    pip install -U cupti-python
    
  2. Use reference checking to verify correctness:

    bash
    --refcheck
    
  3. Use verbose mode to see input shapes and dtypes:

    bash
    -vv
    
  4. Generate reproducer commands for sharing results:

    bash
    --generate_repro_command
    
  5. Run multiple iterations for statistical significance:

    bash
    --num_iters 30 --dry_run_iters 5
    
  6. Save results to CSV for later analysis:

    bash
    --output_path results.csv
    
  7. Compare multiple backends to find the best:

    bash
    --backends fa2 fa3 cudnn cutlass
    

Quick Examples

Decode Attention (H100)

bash
python benchmarks/flashinfer_benchmark.py \
    --routine BatchDecodeWithPagedKVCacheWrapper \
    --backends fa2 fa2_tc cudnn trtllm-gen \
    --page_size 16 --batch_size 128 --s_kv 8192 \
    --num_qo_heads 64 --num_kv_heads 8 \
    --head_dim_qk 128 --head_dim_vo 128 \
    --refcheck -vv --generate_repro_command

Prefill Attention (Multi-head)

bash
python benchmarks/flashinfer_benchmark.py \
    --routine BatchPrefillWithRaggedKVCacheWrapper \
    --backends fa2 fa3 cudnn cutlass \
    --batch_size 16 --s_qo 1024 --s_kv 1024 \
    --num_qo_heads 128 --num_kv_heads 128 \
    --head_dim_qk 192 --head_dim_vo 128 \
    --causal --random_actual_seq_len \
    --q_dtype bfloat16 --kv_dtype bfloat16 \
    --refcheck -vv

FP8 GEMM (Batched)

bash
python benchmarks/flashinfer_benchmark.py \
    --routine bmm_fp8 \
    --backends cudnn cublas cutlass \
    --batch_size 256 --m 1 --n 1024 --k 7168 \
    --input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 \
    --out_dtype bfloat16 \
    --refcheck -vv

MOE (DeepSeek-style routing)

bash
python benchmarks/flashinfer_benchmark.py \
    --routine trtllm_fp8_block_scale_moe \
    --backends trtllm \
    --num_tokens 1024 --hidden_size 5120 \
    --intermediate_size 13824 --num_experts 256 \
    --top_k 8 --n_group 8 --topk_group 1 \
    --routing_method deepseek_v3 \
    --routed_scaling_factor 2.5 \
    --use_routing_bias \
    -vv

Summary: CUPTI vs CUDA Events

Aspect CUPTI (Preferred) CUDA Events (Fallback)
Accuracy Highest (hardware-level) Good (slight overhead)
Installation pip install cupti-python Built-in with CUDA
Requirements CUDA 13+ Any CUDA version
Fallback N/A Automatic if CUPTI unavailable
When to use Always (if available) When CUPTI can't be installed

Recommendation: Install CUPTI for best results, but benchmarks work fine without it.

Next Steps

  • Profile kernels with nsys or ncu for detailed analysis
  • Debug performance issues using FLASHINFER_LOGLEVEL=3
  • Compare with baselines using reference implementations
  • Optimize kernels based on profiling results

Related Documentation

  • See benchmarks/README.md for full flag documentation
  • See benchmarks/samples/sample_testlist.txt for more examples
  • See CLAUDE.md "Benchmarking" section for technical details

Expand your agent's capabilities with these related and highly-rated skills.

aiskillstore/marketplace

perigon-backend

Perigon ASP.NET Core + EF Core + Aspire conventions

232 15
Explore
aiskillstore/marketplace

perigon-agent

Pointers for Copilot/agents to apply Perigon conventions

232 15
Explore
aiskillstore/marketplace

perigon-angular

Angular 21+ standalone/Material/signal conventions for Perigon WebApp

232 15
Explore
aiskillstore/marketplace

fastapi-mastery

Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.

232 15
Explore
aiskillstore/marketplace

context7-efficient

Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.

232 15
Explore
aiskillstore/marketplace

browser-use

Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.

232 15
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results