Agent skill
mteb-retrieve
This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/mteb-retrieve
SKILL.md
MTEB Retrieve
Overview
This skill guides semantic similarity retrieval tasks where documents must be ranked by their similarity to a query using embedding models. These tasks typically involve loading documents, computing embeddings, calculating similarity scores, and identifying documents at specific ranks.
Workflow
Step 1: Data Inspection and Preprocessing
Before computing embeddings, thoroughly inspect the input data format:
- Examine raw file contents - Read a sample of lines to understand the actual format
- Identify formatting artifacts - Look for:
- Line number prefixes (e.g.,
1→,2→,11→) - Index markers or delimiters
- Whitespace padding or alignment characters
- Header rows or metadata lines
- Line number prefixes (e.g.,
- Clean the data - Remove any non-semantic content:
- Strip line numbers and prefixes using regex (e.g.,
re.sub(r'^\s*\d+→', '', line)) - Remove leading/trailing whitespace
- Filter empty lines
- Strip line numbers and prefixes using regex (e.g.,
- Validate preprocessing - Print sample cleaned documents to verify they contain only semantic content
Example preprocessing pattern:
import re
def clean_line(line):
# Remove line number prefix like " 1→" or "11→"
cleaned = re.sub(r'^\s*\d+[→\t]', '', line)
return cleaned.strip()
documents = [clean_line(line) for line in raw_lines if clean_line(line)]
Step 2: Model Selection
Select an appropriate embedding model for the content language and domain:
- Check model language - Models often have language indicators in their names:
zh= Chinese (e.g.,bge-small-zh-v1.5)en= English (e.g.,bge-small-en-v1.5)- No suffix often means multilingual or English
- Match model to content - Using a Chinese-optimized model for English text (or vice versa) produces suboptimal embeddings
- Consider model size - Larger models generally produce better embeddings but are slower
Step 3: Embedding Computation
When computing embeddings:
- Normalize embeddings - Use
normalize_embeddings=Trueto enable cosine similarity via dot product - Batch processing - For large document sets, process in batches to manage memory
- Verify dimensions - Confirm embedding dimensions match expectations for the model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model-name')
doc_embeddings = model.encode(documents, normalize_embeddings=True)
query_embedding = model.encode(query, normalize_embeddings=True)
Step 4: Similarity Computation and Ranking
- Compute similarities - Use dot product for normalized embeddings (equivalent to cosine similarity)
- Handle ties - Be aware that identical similarity scores produce arbitrary ordering
- Use correct indexing - For k-th highest, use index
k-1after sorting in descending order
import numpy as np
similarities = np.dot(doc_embeddings, query_embedding)
sorted_indices = np.argsort(similarities)[::-1] # Descending order
# For 5th highest: index 4 (0-indexed)
fifth_highest_idx = sorted_indices[4]
fifth_highest_doc = documents[fifth_highest_idx]
Step 5: Result Verification
Before writing final results, verify correctness:
- Print document count - Confirm expected number of documents were loaded
- Show sample documents - Display first few cleaned documents to verify preprocessing
- Display top-k results - Print at least the top 5-10 documents with their similarity scores
- Cross-check output format - Ensure the output contains only the semantic content, not formatting artifacts
# Verification checklist
print(f"Total documents: {len(documents)}")
print(f"Sample document: {documents[0][:100]}...")
print("\nTop 10 by similarity:")
for i in range(min(10, len(sorted_indices))):
idx = sorted_indices[i]
print(f" {i+1}. [{similarities[idx]:.4f}] {documents[idx][:50]}...")
Common Pitfalls
Data Format Issues
- Line number prefixes - Input files often include line numbers (e.g.,
1→Text) that corrupt embeddings if not removed - Invisible characters - Watch for tabs, non-breaking spaces, or Unicode formatting characters
- Mixed encodings - Explicitly specify file encoding (
encoding='utf-8')
Model Mismatches
- Language mismatch - Using language-specific models on wrong-language content
- Version confusion - Ensure model revision matches expected behavior
Indexing Errors
- Off-by-one errors - k-th highest uses index
k-1in 0-indexed arrays - Original vs sorted indices - Track the mapping between sorted positions and original document indices
Verification Gaps
- No sanity checks - Always verify document count, sample content, and score distribution
- Missing tie handling - Document when ties exist and how they affect results
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?