MechInterp Cross-Model Matcher

Match features between the Ultra (24K features) and Full (2K features) SAE models to understand feature correspondence and discover monosemantic representations.

Purpose

The cross-model matcher skill:

Finds corresponding features across models
Computes similarity based on top token overlap
Identifies features unique to each model
Helps validate interpretations across model scales

When to Use

Use this skill when you:

Have interpreted a feature in one model and want to find its counterpart
Want to validate that a pattern exists across model scales
Need to understand what the Ultra model decomposes that Full doesn't

Usage

Programmatic

python

from splatnlp.mechinterp.analysis import FeatureMatcher
from splatnlp.mechinterp.skill_helpers import load_context

# Load source context (the model with your known feature)
source_ctx = load_context("ultra")

# Initialize matcher (automatically loads target model)
matcher = FeatureMatcher(source_ctx)

# Find matches for an Ultra feature in the Full model
report = matcher.find_matches(
    source_feature=18712,
    n_candidates=500,  # How many Full features to check
    n_top_matches=10   # How many matches to return
)

# View results
print(f"Searched {report.n_candidates_tested} candidates")
print(f"Best correlation: {report.best_correlation:.3f}")

for match in report.matches:
    print(f"\nFull feature {match.target_feature}:")
    print(f"  Token overlap: {match.top_token_overlap:.3f}")
    print(f"  Shared tokens: {match.shared_top_tokens[:5]}")
    print(f"  Notes: {match.notes}")

Detailed Comparison

python

# Compare two specific features in detail
comparison = matcher.compare_features(
    source_fid=18712,  # Ultra feature
    target_fid=1024,   # Full feature
)

print(f"Jaccard similarity: {comparison['jaccard_similarity']:.3f}")
print(f"Shared tokens: {comparison['shared_tokens'][:10]}")
print(f"Ultra-only tokens: {comparison['source_only_tokens'][:10]}")
print(f"Full-only tokens: {comparison['target_only_tokens'][:10]}")

Matching Metrics

Token Overlap (Jaccard Similarity)

Compares top tokens between features:

overlap = |source_top ∩ target_top| / |source_top ∪ target_top|

> 0.3: Strong match - likely same underlying concept
0.1 - 0.3: Moderate match - related but not identical
< 0.1: Weak match - probably different concepts

Interpretation

High overlap suggests:

Features detect the same pattern
Ultra feature may be a "refinement" of Full feature
Good candidate for cross-model validation

Low overlap with similar activation patterns suggests:

Ultra model has decomposed the Full feature
Multiple Ultra features may combine to match one Full feature

Example: Finding Ultra Decomposition

python

# Example: A Full model feature that might be polysemantic
full_ctx = load_context("full")
matcher = FeatureMatcher(full_ctx)  # Source = Full

# Find what Ultra features correspond to Full feature 512
report = matcher.find_matches(source_feature=512)

# If multiple Ultra features match, the Full feature may be polysemantic
if len([m for m in report.matches if m.combined_score > 0.1]) > 3:
    print("Full feature 512 appears to be polysemantic")
    print("Ultra decomposition:")
    for m in report.matches[:5]:
        print(f"  Ultra {m.target_feature}: {m.shared_top_tokens[:3]}")

Workflow Integration

Start with interpreted feature: Begin with a feature you understand
Find matches: Use this skill to find counterparts
Validate interpretation: Check if matches have similar behavior
Document correspondence: Update research state with cross-model links
Investigate decomposition: If Ultra splits a Full feature, analyze each part

Limitations

Token overlap is a proxy; true matching would require shared activation data
Different expansion factors mean different granularity
Some features may not have clear counterparts

Search AI Tools

mechinterp-crossmodel-matcher

Install this agent skill to your Project

SKILL.md

MechInterp Cross-Model Matcher

Purpose

When to Use

Usage

Programmatic

Detailed Comparison

Matching Metrics

Token Overlap (Jaccard Similarity)

Interpretation

Example: Finding Ultra Decomposition

Workflow Integration

Limitations

See Also