Agent skill

chunking-strategies

Document chunking strategies for RAG systems. Use when implementing document processing pipelines to determine optimal chunking approaches based on document type and retrieval requirements.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/chunking-strategies

SKILL.md

Chunking Strategies Skill

This skill provides chunking strategies for RAG document processing.

Chunking Methods

1. Fixed-Size Chunking

python
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

2. Semantic Chunking

Split on natural boundaries (sentences, paragraphs).

python
def semantic_chunk(text: str, max_tokens: int = 500):
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = count_tokens(para)
        if current_tokens + para_tokens > max_tokens:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [para]
            current_tokens = para_tokens
        else:
            current_chunk.append(para)
            current_tokens += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))
    return chunks

3. Recursive Chunking

Hierarchical splitting on multiple separators.

python
SEPARATORS = ["\n\n", "\n", ". ", " "]

def recursive_chunk(text: str, max_size: int, separators: list[str]):
    if len(text) <= max_size:
        return [text]

    sep = separators[0] if separators else ""
    chunks = []
    parts = text.split(sep)

    for part in parts:
        if len(part) <= max_size:
            chunks.append(part)
        elif len(separators) > 1:
            chunks.extend(recursive_chunk(part, max_size, separators[1:]))
        else:
            chunks.append(part[:max_size])

    return chunks

Chunking by Document Type

Document Type Recommended Strategy Chunk Size
Technical docs Semantic (headers) 500-1000 tokens
Legal documents Semantic (sections) 1000-2000 tokens
Code Function/class based 200-500 tokens
Conversations Message boundaries 100-300 tokens
General text Recursive 300-500 tokens

Chunk Enrichment

python
@dataclass
class EnrichedChunk:
    content: str
    metadata: dict
    summary: str  # LLM-generated
    keywords: list[str]
    parent_id: str  # For hierarchical retrieval

Best Practices

  • Add overlap between chunks (10-20%)
  • Preserve semantic boundaries
  • Include metadata (source, position)
  • Consider hierarchical chunking for long docs
  • Test retrieval quality with different sizes

Didn't find tool you were looking for?

Be as detailed as possible for better results