Agent skill

mteb-retrieve

This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/mteb-retrieve

SKILL.md

MTEB Retrieve

Overview

This skill guides semantic similarity retrieval tasks where documents must be ranked by their similarity to a query using embedding models. These tasks typically involve loading documents, computing embeddings, calculating similarity scores, and identifying documents at specific ranks.

Workflow

Step 1: Data Inspection and Preprocessing

Before computing embeddings, thoroughly inspect the input data format:

Examine raw file contents - Read a sample of lines to understand the actual format
Identify formatting artifacts - Look for:
- Line number prefixes (e.g., 1→, 2→, 11→)
- Index markers or delimiters
- Whitespace padding or alignment characters
- Header rows or metadata lines
Clean the data - Remove any non-semantic content:
- Strip line numbers and prefixes using regex (e.g., re.sub(r'^\s*\d+→', '', line))
- Remove leading/trailing whitespace
- Filter empty lines
Validate preprocessing - Print sample cleaned documents to verify they contain only semantic content

Example preprocessing pattern:

python

import re

def clean_line(line):
    # Remove line number prefix like "  1→" or "11→"
    cleaned = re.sub(r'^\s*\d+[→\t]', '', line)
    return cleaned.strip()

documents = [clean_line(line) for line in raw_lines if clean_line(line)]

Step 2: Model Selection

Select an appropriate embedding model for the content language and domain:

Check model language - Models often have language indicators in their names:
- zh = Chinese (e.g., bge-small-zh-v1.5)
- en = English (e.g., bge-small-en-v1.5)
- No suffix often means multilingual or English
Match model to content - Using a Chinese-optimized model for English text (or vice versa) produces suboptimal embeddings
Consider model size - Larger models generally produce better embeddings but are slower

Step 3: Embedding Computation

When computing embeddings:

Normalize embeddings - Use normalize_embeddings=True to enable cosine similarity via dot product
Batch processing - For large document sets, process in batches to manage memory
Verify dimensions - Confirm embedding dimensions match expectations for the model

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('model-name')
doc_embeddings = model.encode(documents, normalize_embeddings=True)
query_embedding = model.encode(query, normalize_embeddings=True)

Step 4: Similarity Computation and Ranking

Compute similarities - Use dot product for normalized embeddings (equivalent to cosine similarity)
Handle ties - Be aware that identical similarity scores produce arbitrary ordering
Use correct indexing - For k-th highest, use index k-1 after sorting in descending order

python

import numpy as np

similarities = np.dot(doc_embeddings, query_embedding)
sorted_indices = np.argsort(similarities)[::-1]  # Descending order

# For 5th highest: index 4 (0-indexed)
fifth_highest_idx = sorted_indices[4]
fifth_highest_doc = documents[fifth_highest_idx]

Step 5: Result Verification

Before writing final results, verify correctness:

Print document count - Confirm expected number of documents were loaded
Show sample documents - Display first few cleaned documents to verify preprocessing
Display top-k results - Print at least the top 5-10 documents with their similarity scores
Cross-check output format - Ensure the output contains only the semantic content, not formatting artifacts

python

# Verification checklist
print(f"Total documents: {len(documents)}")
print(f"Sample document: {documents[0][:100]}...")
print("\nTop 10 by similarity:")
for i in range(min(10, len(sorted_indices))):
    idx = sorted_indices[i]
    print(f"  {i+1}. [{similarities[idx]:.4f}] {documents[idx][:50]}...")

Common Pitfalls

Data Format Issues

Line number prefixes - Input files often include line numbers (e.g., 1→Text) that corrupt embeddings if not removed
Invisible characters - Watch for tabs, non-breaking spaces, or Unicode formatting characters
Mixed encodings - Explicitly specify file encoding (encoding='utf-8')

Model Mismatches

Language mismatch - Using language-specific models on wrong-language content
Version confusion - Ensure model revision matches expected behavior

Indexing Errors

Off-by-one errors - k-th highest uses index k-1 in 0-indexed arrays
Original vs sorted indices - Track the mapping between sorted positions and original document indices

Verification Gaps

No sanity checks - Always verify document count, sample content, and score distribution
Missing tie handling - Document when ties exist and how they affect results

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/mteb-retrieve
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

MTEB Retrieve

Overview

Workflow

Step 1: Data Inspection and Preprocessing

Step 2: Model Selection

Step 3: Embedding Computation

Step 4: Similarity Computation and Ranking

Step 5: Result Verification

Common Pitfalls

Data Format Issues

Model Mismatches

Indexing Errors

Verification Gaps

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state