Agent skill
rtl-debugging
RTL design debugging methodology and reasoning process. Use when investigating test failures, assertion violations, scoreboard mismatches, or analyzing verification results to identify RTL bugs.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/rtl-debugging
SKILL.md
RTL Debugging Methodology
Systematic approach for debugging RTL from verification results and test scenarios.
When to Use This Skill
- Analyzing UVM test failures to identify RTL bugs
- Investigating assertion violations in simulation
- Debugging scoreboard mismatches between expected and actual behavior
- Triaging multiple test failures to find common root causes
- Understanding why specific test scenarios fail while others pass
- Analyzing coverage holes related to bugs
Debugging Workflow (推論プロセス)
1. Analyze Test Failure Pattern
Objective: Understand which tests fail and why
Questions to answer:
- Which tests pass and which fail? (Pattern analysis)
- Do failures occur in specific test scenarios only?
- Is the failure deterministic or random? (Check with different seeds)
- At what phase does the test fail? (Build, run, scoreboard check)
Evidence sources:
- Test execution logs (sim/logs/)
- Regression test results (sim/reports/)
- UVM report summary (UVM_ERROR, UVM_FATAL locations)
- DSIM Collect Verification Evidence
Objective: Gather all available evidence from verification components
Verification evidence sources:
From Assertions:
UVM_ERROR @ 1250ns: Assertion 'a_axi_wdata_stable' failed
Location: sim/assertions/axi4_protocol_checker.sv:45
Property: wdata must remain stable when wvalid=1 and wready=0
→ RTL violates AXI4 protocol specification
From Scoreboard:
UVM_ERROR: [SCOREBOARD] Data mismatch detected
Expected: 0xDEADBEEF
Actual: 0xDEADBEE0
Address: 0x1000
Time: 1250ns
→ LSB nibble corrupted, check datapath width or masking logic
From Monitor:
UVM_WARNING: [MONITOR] Unexpected transaction observed
Type: WRITE
Address: 0x1004 (expected: 0x1000)
PossiMap Evidence to RTL Problem Domain
**Objective**: Translate verification failures to RTL problem categories
**Evidence-to-Problem mapping**:
| Verification Evidence | RTL Problem Domain | Investigation Focus |
|----------------------|-------------------|---------------------|
| **Assertion: Protocol violation** | Interface logic | Check handshake FSM, signal timing |
| **Scoreboard: Data mismatch** | Datapath logic | Check ALU, mux select, forwarding |
| **Scoreboard: Missing transaction** | Control logic | Check enable signals, FSM transitions |
| **Scoreboard: Extra transaction** | Control logic | Check termination conditions, counters |
| **Monitor: Wrong address** | Address generation | Check increment/decrement logic, offset calculation |
| **Monitor: Wrong timing** | Pipeline control | Check stall logic, valid/ready propagation |
| **Assertion: X-propagation** | Reset/initialization | Check reset assignments, case completeness |
**Test scenario analysis**:
Failing scenario: Back-to-back writes with no idle cycles Passing scenario: Writes with 2-cycle gaps
Hypothesis generation:
- Pipeline hazard when no bubble between transactions
- Backpressure handling assumes idle cycles
- State machine doesn't handle consecutive valid inputs
- Register forwarding path missing for zero-latency case
**Objective**: Create minimal test to isolate root cause
**Experiment design strategies**:
**Modify existing failing test**:
```systemverilog
// Original failing test: Back-to-back writes
sequence.add_transaction(WRITE, addr=0x1000, data=0xAA);
sequence.add_transaction(WRITE, addr=0x1004, data=0xBB); // ← FAILS
// Experiment 1: Add gap between transactions
sequence.add_transaction(WRITE, addr=0x1000, data=0xAA);
sequence.add_idle_cycles(2);
sequence.add_transaction(WRITE, addr=0x1004, data=0xBB); // ← PASS?
// If passes: Confirms pipeline hazard hypothesis
// Experiment 2: Same address back-to-back
sequence.add_transaction(WRITE, addr=0x1000, data=0xAA);
sequence.add_transaction(WRITE, addr=0x1000, data=0xBB); // ← PASS/FAIL?
// If passes: Problem is address-generation specific
Create minimal directed test:
// Hypothesis: Burst counter overflows at length=16
class minimal_burst_test extends base_test;
virtual task run_phase(uvm_phase phase);
phase.raise_objection(this);
// Test exactly at boundary
send_burst(addr=0x0, length=15); // Should work
send_burst(addr=0x0, length=16); // Should fail
send_burst(addr=0x0, length=17); // Should fail
phase.drop_objection(this);
endtask
endclass
Add debug assertions:
// Insert temporary assertion at suspected problem point
bind axi_slave_fsm debug_assertions (
.clk(clk),
.state(current_state),
.wvalid(wvalid),
.wready(wready)
);
Trace from Verification to RTL Root Cause
**Objective**: Navigate from high-level test failure to specific RTL bug
**Top-down tracing workflow**:
-
Test Failure └─ axiuart_burst_test fails with scoreboard mismatch
-
Scoreboard Analysis └─ Expected data: 0xBB, Actual: 0xAA └─ Second write returned first write's data
-
Monitor Analysis (check transactions observed) └─ WRITE(addr=0x1000, data=0xAA) @ 1000ns - acknowledged └─ WRITE(addr=0x1004, data=0xBB) @ 1002ns - acknowledged └─ READ(addr=0x1004) @ 1010ns - returned 0xAA (wrong!)
-
Waveform Analysis at 1002ns (second write) └─ axi_wdata = 0xBB ✓ └─ axi_waddr = 0x1004 ✓ └─ write_enable = 1'b1 ✓ └─ But: register_select still points to 0x1000 ✗
-
RTL Module Analwith Test Suite
Objective: Confirm fix resolves issue without breaking other tests
Verification workflow:
Step 1: Re-run failing test
# Run specific test that previously failed
run_uvm_simulation --test axiuart_burst_test --seed 12345
# Expected: PASS
Step 2: Run related tests (test suite partitioning)
# Run all tests that exercise same RTL module
run_uvm_simulation --regression smoke_suite
# Focus: Tests with write transactions, address decoding
Step 3: Full regression
### By Test Failure Type
| Failure Type | Root Cause Category | Investigation Focus |
|-------------|---------------------|---------------------|
| **Scoreboard mismatch: wrong data** | Datapath error | Trace data from source to sink, check mux selects, forwarding |
| **Scoreboard mismatch: missing transaction** | Control flow error | Check FSM transitions, enable signals, counter termination |
| **Scoreboard mismatch: extra transaction** | Control flow error | Check counter overflow, FSM looping, duplicate strobes |
| **Assertion: Protocol violation** | Interface timing | Check handshake sequences, stability requirements, backpressure |
| **Assertion: Stability violation** | Combinational logic | Check for unintended signal changes, glitches, race conditions |
| **Assertion: X-propagation** | Initialization error | Check reset coverage, case statement completeness, undriven signals |
| **Timeout: No response** | Deadlock or FSM stuck | Check FSM for unreachable transitions, missing conditions |
| **UVM_FATAL: Null object** | Verification code bug | Not RTL issue - check testbench configuration |
### By Test Pass/Fail Pattern
**Pattern: Only random tests fail, directed tests pass**
- **Hypothesis**: Corner case not covered by directed tests
- **Action**: Analyze failing random test stimulus for common characteristics
- **Example**: Random test hits burst length=256, directed tests only ≤16
**Pattern: All tests with feature X fail, others pass**
- **Hypothesis**: Feature X has RTL bug
- **Action**: Focus debug on RTL module implementing feature X
- **Example**: All interrupt tests fail → debug interrupt controller
**Pattern: Intermittent failures with different seeds**
- **Hypothesis**: Race condition or initialization dependency
- *From Verification Evidence to RTL Root Cause
### Scoreboard-Driven Investigation
**Scoreboard reports data mismatch**:
Step 1: Identify transaction with mismatch Monitor: WRITE(addr=0x1000, data=0xDEADBEEF) @ 1000ns Scoreboard: Expected 0xDEADBEEF at 0x1000 Monitor: READ(addr=0x1000) → 0xDEADBEE0 @ 1100ns Mismatch: LSB nibble changed 0xF → 0x0
Step 2: Hypothesize based on bit pattern
- All bits except LSB nibble correct → byte masking issue
- LSB nibble zeroed → possible width/alignment problem
Step 3: Check waveform at write cycle (1000ns) axi_wdata[31:0] = 0xDEADBEEF ✓ write_strobe[3:0] = 4'b1111 ✓ register_wdata[31:0] = 0xDEADBEE0 ✗ ← BUG IS HERE
Step 4: Trace write path axi_wdata → data_align_unit → register_wdata Check data_align_unit for LSB nibble handling
Step 5: Find root cause in RTL // Bug found in data_align_unit assign register_wdata = {axi_wdata[31:4], 4'b0000}; // ← Hardcoded zero!
### Assertion-Driven Investigation
**Assertion reports protocol violation**:
Assertion 'a_axi_wdata_stable' failed @ 1250ns Property: (wvalid && !wready) |=> $stable(wdata)
Step 1: Understand assertion semantics
- wdata must not change when wvalid=1 and wready=0
- This is AXI4 protocol requirement
Step 2: Check waveform at violation timestamp @1249ns: wvalid=1, wready=0, wdata=0xAAAA @1250ns: wvalid=1, wready=0, wdata=0xBBBB ← Changed illegally
Step 3: Find source of wdata in RTL assign wdata = write_fifo_dout;
Step 4: Check FIFO read logic assign fifo_read_en = wvalid && wready; ✓ Correct condition
Step 5: Check for other paths affecting wdata // Found: Debug logic bypassing FIFO! assign wdata = debug_mode ? debug_data : write_fifo_dout; // debug_mode changed during backpressure → violation
### Test Suite Differential Analysis
**Multiple tests analysis**:
| Test Name | Scenario | Result | Common Attribute |
|-----------|----------|--------|------------------|
| basic_write | Single write | ✓ PASS | Burst length = 1 |
| burst4_write | 4-beat burst | ✓ PASS | Burst length = 4 |
| bDebugging Techniques from Test Results
### Regression Test Triage
**Analyze multiple test results to find common root cause**:
Regression suite: 42 tests total
- 38 PASS
- 4 FAIL: axiuart_burst16, axiuart_burst32, axiuart_wrap16, axiuart_wrap32
Pattern recognition:
- All failures involve burst length ≥ 16
- Both INCR and WRAP burst types affected
- Burst length ≤ 8 always passes
Common root cause hypothesis:
- Burst counter width insufficient for length ≥ 16
- Not specific to burst type (INCR vs WRAP)
- Not data-pattern dependent
Single fix expected to resolve all 4 failures.
### Minimal Reproducing Test
**Create simplest test that triggers bug**:
```systemverilog
// Original failing test: 200 lines, 10 minutes runtime
class axiuart_burst16_test extends base_test;
// Complex randomization, multiple sequences, ...
endclass
// Minimal reproducer: 15 lines, 10 seconds runtime
class minimal_burst16_test extends base_test;
task run_phase(uvm_phase phase);
axi_seq seq = axi_seq::type_id::create("seq");
phase.raise_objection(this);
// Single burst-16 transaction
seq.addr = 32'h1000;
seq.burst_length = 16; // Minimal case that fails
seq.start(env.agent.sequencer);
phase.drop_objection(this);
endtask
endclass
// Run: Still fails with same root cause
// Benefit: Faster debug iteration (10s vs 10min)
Test Modification Experiments
Systematically modify test to isolate variable: Debugging Pitfalls
Don't Debug Without Test Evidence
❌ Wrong: "I think the problem is in module X, let me check the code" ✅ Right: "Test Y failed with scoreboard mismatch at time T, let me analyze the evidence"
Don't Ignore Test Pass/Fail Patterns
❌ Wrong: Debug first failure in isolation, ignore other tests ✅ Right: Analyze which tests pass/fail to identify common characteristics
Don't Trust Single Test Result
❌ Wrong: Test passed once → bug is fixed ✅ Right: Run regression suite (multiple seeds, scenarios) to confirm fix
Don't Modify RTL Without Evidence
❌ Wrong: Change RTL based on intuition, hope test passes ✅ Right: Trace from test failure → scoreboard → monitor → waveform → RTL
Don't Create Tests Without Purpose
❌ Wrong: Write random tests hoping to find bugs ✅ Right: Analyze coverage holes, create tests targeting untested scenarios
Don't Skip Regression After Fix
❌ Wrong: Failing test now passes → Done ✅ Right: Run full regression to ensure fix doesn't break other tests // Final conclusion: Pure burst length issue, check counter width
### Coverage-Guided Root Cause Analysis
**Use coverage to identify untested paths related to bug**:
```systemverilog
// Coverage report after test failures
covergroup cg_burst_length;
cp_length: coverpoint burst_length {
bins short[] = {[1:8]}; // 100% hit
bins boundary = {15, 16}; // 16 causes failures
bins long[] = {[17:256]}; // 0% hit ← Never tested!
}
endgroup
// Analysis:
// - Tests never tried burst_length > 16
// - Bug might affect all values ≥ 16, not just 16
// - After fix, add test for burst_length=256 to verify
from Test Failures
### From Scoreboard Timestamp to Waveform
**Workflow**:
-
Test log shows scoreboard error at simulation time 1250ns UVM_ERROR: [SCOREBOARD] Data mismatch at addr=0x1000
-
Set waveform viewer to time 1250ns
-
Identify relevant signals from monitor transaction:
- axi_awaddr (write address channel)
- axi_wdata (write data channel)
- Internal register_file signals
-
Check transaction timing: @1240ns: awvalid=1, awaddr=0x1000, awready=1 (address accepted) @1242ns: wvalid=1, wdata=0xBEEF, wready=1 (data accepted) @1250ns: register_file[0] = 0xBEE0 ← Should be 0xBEEF
-
Trace internal path: axi_wdata (0xBEEF) → write_data_reg (0xBEEF) → data_align (0xBEE0) ← BUG HERE
### Backward Tracing from Assertion
**Assertion fires, trace backward to root cause**:
Assertion violation @ 1250ns: a_valid_stable: (valid && !ready) |=> $stable(data)
Waveform analysis: @1249ns: valid=1, ready=0, data=0xAAAA @1250ns: valid=1, ready=0, data=0xBBBB ← Violated $stable()
Trace data signal backward: data ← output_mux output_mux ← select between fifo_out and bypass_data mux_select changed at 1250ns ← WHY?
Trace mux_selefrom verification results is evidence-driven investigation:
- Analyze test failure patterns - Which tests fail? What do they have in common?
- Collect verification evidence - Scoreboard, assertions, monitors, logs
- Map evidence to RTL problem domain - Translate test failure to RTL category
- Design targeted experiments - Create minimal tests to isolate root cause
- Trace from verification to RTL - Navigate from test → scoreboard → waveform → RTL
- Verify with test suite - Confirm fix with regression, add prevention tests
Key principle: Test results guide investigation. Start from verification evidence (test failures, assertion violations, scoreboard mismatches), not RTL code reading
By Affected Component
Datapath issues:
- Check operand widths, sign extension, overflow handling
- Verify bypass/forwarding conditions
- Trace data flow from source to destination
Control logic issues:
- Draw state transition diagram from code
- Verify all states are reachable
- Check for conflicting control signals
Interface issues:
- Review protocol timing diagrams
- Check handshake signal relationships (valid before ready, stable until accepted)
- Verify backpressure handling
Hypothesis Generation Strategies
Backwards Tracing
Start at the failure point and work backwards:
- Identify the first wrong signal at failure timestamp
- Find all signals that directly drive it (combinational or registered)
- Check if those signals are correct one cycle earlier
- Repeat until you find where correct values become incorrect
Dependency Analysis
Map signal dependencies:
output_wrong [time=1250ns]
├─ driven by: alu_result (combinational)
│ ├─ operand_a (registered at 1249ns) ✓ correct
│ ├─ operand_b (registered at 1249ns) ✗ INCORRECT
│ └─ operation (registered at 1249ns) ✓ correct
└─ operand_b driven by: bypass_mux
├─ mem_result (registered at 1248ns) ✓ correct
├─ ex_result (registered at 1249ns) ✗ INCORRECT
└─ bypass_select ✗ WRONG MUX SELECT ← ROOT CAUSE
Differential Diagnosis
Compare working vs failing cases:
| Aspect | Working Case | Failing Case | Insight |
|---|---|---|---|
| Input pattern | 0x00000001 | 0x80000000 | MSB triggers bug |
| Execution path | State A→B→C | State A→B→D | Transition B→D buggy |
| Timing | No stalls | Pipeline stall | Stall logic incorrect |
Verification Techniques
Assertion-Based Isolation
Insert temporary assertions to partition the design:
// Check: Does problem occur before or after this pipeline stage?
property p_debug_stage2_input;
@(posedge clk) stage2_valid |-> stage2_input inside {[0:1000]};
endproperty
assert property (p_debug_stage2_input)
else $error("Problem exists at stage2 input");
Minimal Reproducer
Reduce test case to absolute minimum:
- Start with failing test
- Remove stimulus that doesn't affect failure
- Shorten simulation time to just before failure
- Remove unrelated RTL modules
- Result: ~20 line testbench, ~50 line RTL
Benefits: Faster iteration, easier to share, clearer root cause
Force/Release Experiments
Test hypotheses by overriding signals:
// Hypothesis: Bug disappears if bypass is disabled
initial begin
#100ns;
force top.cpu.bypass_enable = 1'b0;
// Observe if problem still occurs
end
Caution: Only for debugging, never in production code
Coverage-Guided Debugging
Use coverage holes to identify untested scenarios:
covergroup cg_state_transitions @(posedge clk);
cp_current: coverpoint state;
cp_next: coverpoint state_next;
cross cp_current, cp_next; // Are all transitions covered?
endgroup
If bug occurs: Check if failing scenario corresponds to coverage hole
Common Pitfalls
Don't Trust Assumptions
❌ Wrong: "Signal X is always stable, so I won't check it" ✅ Right: Add assertion to verify assumption, then proceed
Don't Skip Symptom Observation
❌ Wrong: Jump straight to suspected module and start modifying ✅ Right: Observe exact failure in waveform, then form hypothesis
Don't Fix Symptoms
❌ Wrong: Add logic to mask the symptom without understanding root cause ✅ Right: Trace to root cause, fix it, verify symptom disappears
Don't Test Multiple Changes
❌ Wrong: Make 3 changes simultaneously, rerun simulation ✅ Right: Change one thing at a time, verify effect
Waveform Analysis Patterns
Cause → Effect Tracing
- Find the symptom signal at failure timestamp
- Look 1-2 cycles back for potential causes
- Check if cause signals deviated from expected
- Repeat backwards until finding the origin
Critical Path Analysis
Identify longest combinational path:
// Use $time in always_comb to detect long paths
always_comb begin
logic [31:0] temp1, temp2, temp3;
temp1 = input_a & input_b; // 1 gate delay
temp2 = temp1 | input_c; // 1 gate delay
temp3 = temp2 ^ input_d; // 1 gate delay
output_z = temp3 + input_e; // 1 gate delay
// Total: 4 gate delays - may violate timing
end
Clock Domain Crossing Detection
Look for signals crossing without proper synchronization:
Clock A domain: signal_a toggles at time 1250ns
Clock B domain: signal_b samples signal_a at 1251ns
↑ METASTABILITY RISK if clocks unrelated
Integration with Other Skills
- dsim-debugging: Use when DSIM tool itself has issues (environment, waves, logs)
- rtl-coding-standards: Apply when fixing identified bugs to maintain code quality
- assertion-design: Create permanent assertions for bugs found during debugging
- mcp-workflow: Use MCP commands to compile/run debug experiments quickly
Summary
RTL debugging is systematic reasoning:
- Reproduce the problem reliably
- Observe symptoms without assumptions
- Generate hypotheses based on evidence
- Test each hypothesis independently
- Narrow down to single root cause
- Verify fix and prevent regression
Key principle: Evidence over intuition. Always trace from observed symptoms to root cause using waveforms and assertions.
Didn't find tool you were looking for?