Agent skill

semtools

High-performance semantic search and document parsing toolkit. Use PROACTIVELY for searching across documentation, YAML manifests, configuration files, or any text where semantic understanding is needed. Particularly effective when exploring unfamiliar codebases, finding conceptually related content, or when exact keywords don't match the desired information.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/semtools

SKILL.md

Semtools - Semantic Search & Document Parsing

Overview

Semtools is a high-performance CLI toolkit providing semantic search capabilities using AI embeddings and document parsing for unsupported formats. Unlike traditional text search (grep), semtools understands semantic meaning, enabling discovery of conceptually related content even when exact keywords don't match.

When to Use This Skill

Use semtools PROACTIVELY when:

Documentation & Configuration Exploration

Searching across docs/ for relevant guides, architecture docs, or reference materials
Finding examples in YAML manifests (ArgoCD applications, Helm values, Crossplane XRDs)
Locating configuration patterns in config/, terraform/, or application directories
Exploring unfamiliar parts of the codebase to understand patterns and conventions

Semantic Understanding Requirements

When keyword search (grep) returns too many irrelevant results or misses important content
When searching for concepts that might be described using different terminology
When exploring how a feature or pattern is implemented across the codebase
When the exact wording is unknown but the concept is clear

Document Analysis

Parsing and searching PDF documentation, vendor guides, or research papers
Analyzing large collections of heterogeneous documents
Extracting information from Office documents (DOCX, PPTX)

Examples of When to Use

"Find all examples of Vault authentication configuration" (may be described as "Vault auth", "authentication to Vault", "Vault login", etc.)
"Search for documentation about Crossplane composition patterns" (semantic understanding of "composition" vs "XRD" vs "managed resources")
"Locate YAML manifests using external secrets" (finds various ways of referencing external secrets)
"Find similar error handling patterns" (understands conceptual similarity, not just keyword matches)

Installation & Setup

Semtools requires Rust and a LlamaCloud API key for parsing operations.

bash

# Install semtools
cargo install semtools

# Set API key (required for parse command)
export LLAMA_CLOUD_API_KEY=your_api_key

# Add to shell profile for persistence
echo 'export LLAMA_CLOUD_API_KEY=your_api_key' >> ~/.zshrc

Core Workflow: Search Documentation

The primary workflow when searching documentation or configuration files:

Step 1: Decide on Workspace Usage

Create a workspace for repeated searches or large file collections:

bash

# Create/activate workspace (recommended for multiple searches)
workspace use cluster-ops-docs
export SEMTOOLS_WORKSPACE=cluster-ops-docs

Skip workspace for one-off quick searches.

Step 2: Execute Semantic Search

Use recommended flags for consistent, high-quality results:

bash

search "vault OIDC authentication" docs/ \
  --ignore-case \
  --n-lines 30 \
  --max-distance 0.3

Flag explanation:

--ignore-case (or -i): Case-insensitive matching (ALWAYS use this)
--n-lines 30 (or -n 30): Context lines (default 3 is too small, use 30-50)
--max-distance 0.3 (or -m 0.3): Similarity threshold (0.3 is good default)

Step 3: Refine if Needed

Adjust search parameters based on results:

bash

# Too many results? Tighten similarity threshold
search "query" docs/ -i -n 30 -m 0.2

# Too few results? Loosen similarity threshold
search "query" docs/ -i -n 30 -m 0.4

# Need more context? Increase lines
search "query" docs/ -i -n 50 -m 0.3

Common Search Patterns

Pattern 1: Quick Documentation Search

For one-off searches without workspace overhead:

bash

search "crossplane composition patterns" docs/ -i -n 30 -m 0.3

Pattern 2: Research Mode with Workspace

For exploring a topic with multiple related searches:

bash

# Setup workspace once
workspace use architecture-research
export SEMTOOLS_WORKSPACE=architecture-research

# Multiple searches (fast after first one)
search "service mesh networking" docs/ -i -n 30 -m 0.3
search "cilium cluster mesh" docs/ -i -n 30 -m 0.3
search "BGP configuration" docs/ -i -n 30 -m 0.3

Pattern 3: Multi-Directory Configuration Search

Search across multiple directories for configuration patterns:

bash

search "external secrets configuration" apps/ argocd/ crossplane/ \
  -i -n 40 -m 0.3

Pattern 4: YAML Manifest Search

Find specific patterns in Kubernetes/Helm manifests:

bash

# Find ingress configurations with authentication
search "ingress with oauth2 proxy" apps/ -i -n 30 -m 0.3

# Find Crossplane resource examples
search "vault kubernetes role composition" crossplane/ -i -n 30 -m 0.3

# Find ArgoCD sync configurations
search "sync waves and hooks" argocd/ -i -n 25 -m 0.3

Pattern 5: Pre-filter with grep

Combine exact-match filtering with semantic search:

bash

# Find files mentioning "vault", then semantic search within
grep -l "vault" apps/*/values.yaml | \
  xargs search "authentication configuration" -i -n 30 -m 0.3

# Pre-filter by section, then semantic search
cat docs/reference/*.md | grep -A 50 "## Authentication" | \
  search "OIDC provider setup" -i -n 20 -m 0.3

Document Parsing Workflow

Parse unsupported file formats (PDF, DOCX, PPTX) into searchable markdown.

Step 1: Create Workspace for PDFs

bash

workspace use vendor-docs
export SEMTOOLS_WORKSPACE=vendor-docs

Step 2: Parse Documents

bash

# Parse PDFs (results cached to ~/.parse/)
parse docs/vendor/*.pdf

# Parse multiple formats
parse whitepaper.pdf guide.docx presentation.pptx

Step 3: Search Parsed Content

bash

# Search cached parsed files
search "API authentication" ~/.parse/*.pdf.md -i -n 30 -m 0.3

# Or parse and search in pipeline
parse document.pdf | xargs cat | search "security model" -i -n 40 -m 0.3

Step 4: Repeated Searches

Subsequent searches use cached embeddings (very fast):

bash

search "rate limiting" ~/.parse/ -i -n 30 -m 0.3
search "error handling" ~/.parse/ -i -n 30 -m 0.3

Advanced Workflows

Workflow: Complex Research Pipeline

Chain Unix tools for powerful document processing:

bash

# Find YAML files, pre-filter with grep, then semantic search
find apps/ argocd/ -name "*.yaml" | \
  xargs cat | \
  grep -i "ingress" | \
  search "TLS certificate management" -i -n 30 -m 0.3

Workflow: PDF Document Analysis

Analyze large PDF collections efficiently:

bash

# Setup dedicated workspace
workspace use research-papers
export SEMTOOLS_WORKSPACE=research-papers

# Parse all PDFs once (cached)
parse papers/*.pdf

# Run multiple semantic queries (fast)
search "kubernetes security" ~/.parse/ -i -n 50 -m 0.2
search "container isolation" ~/.parse/ -i -n 50 -m 0.3
search "network policies" ~/.parse/ -i -n 50 -m 0.3

# Clean up when done
workspace prune

Workflow: Codebase Pattern Discovery

Discover how patterns are implemented across the codebase:

bash

# Find authentication implementations
search "authentication middleware" src/ -i -n 40 -m 0.3

# Find error handling patterns
search "error propagation patterns" src/ lib/ -i -n 30 -m 0.3

# Find similar API endpoint structures
search "REST endpoint with validation" src/api/ -i -n 35 -m 0.3

Understanding Distance Thresholds

The --max-distance parameter controls semantic similarity (cosine distance):

Threshold	Precision	Recall	Use When
0.2	Very High	Low	Need only highly relevant, precise matches
0.3	High	Medium	Recommended default, good balance
0.4	Medium	High	Exploratory search, broader results
0.5+	Low	Very High	Very broad search, discovery mode

General guidance:

Start with 0.3 (good default)
If too many irrelevant results → decrease to 0.2
If missing relevant results → increase to 0.4
For discovery/exploration → use 0.4-0.5

Best Practices

1. Always Use These Three Flags

For consistent, high-quality results, always use:

bash

search "query" path/ -i -n 30 -m 0.3

Why:

-i: Don't know how terms are capitalized in files
-n 30: Default 3 is rarely sufficient context (use 30-50)
-m 0.3: Good precision/recall balance

2. Use Workspaces for Large Operations

Create workspace before searching large file collections or running multiple related searches:

bash

workspace use project-name
export SEMTOOLS_WORKSPACE=project-name
# First search generates embeddings (slower)
# Subsequent searches use cache (much faster)

Benefits:

Dramatically faster repeated searches
Cost-effective (avoids re-embedding)
Automatic cache invalidation when files change

3. Pre-parse PDFs Once

Parse PDFs as a one-time operation, then search cached markdown:

bash

# Parse once
parse documents/*.pdf

# Search many times (fast)
search "topic 1" ~/.parse/ -i -n 30 -m 0.3
search "topic 2" ~/.parse/ -i -n 30 -m 0.3

4. Chain with grep for Exact Pre-filtering

Combine exact-match filtering with semantic understanding:

bash

# Find files with exact keyword, then semantic search
grep -l "kubernetes" docs/*.md | \
  xargs search "authentication patterns" -i -n 30 -m 0.3

5. Adjust Context Lines Based on Content Type

Different content types need different context amounts:

bash

# Code/configs: 30-40 lines
search "query" src/ -i -n 35 -m 0.3

# Documentation: 40-50 lines
search "query" docs/ -i -n 45 -m 0.3

# Dense technical docs: 50+ lines
search "query" papers/ -i -n 60 -m 0.3

When NOT to Use Semtools

Don't use semtools when:

1. Simple Exact-Match Search

Use grep or Grep tool for literal string matching:

bash

# Use grep, not semtools
grep -r "function_name" src/
grep "exact error message" logs/*.txt

2. Code Structure Search

Use ast-grep for AST-based code pattern matching:

bash

# Use ast-grep, not semtools
ast-grep --pattern 'async function $NAME($$$)'

3. File Name Search

Use find or Glob tool for file path patterns:

bash

# Use find/glob, not semtools
find . -name "*.yaml"
glob "apps/**/*.values.yaml"

4. Real-time Operations

Semtools requires embedding time; use grep for instant results:

bash

# For instant results, use grep
grep "ERROR" logs/current.log

5. Very Small or Single-File Searches

Use native tools for focused, single-file operations:

bash

# Use grep for single files
grep "keyword" single_file.txt

Tool Selection Guide

Need	Tool	Reason
Exact string match	`grep`	Fast, instant, literal matching
Code structure/AST	`ast-grep`	Understands syntax, structural patterns
File name patterns	`find`/`Glob`	Fast directory traversal
Semantic/concept search	semtools	Understands meaning, finds related content
Exploring unfamiliar code	semtools	Discovers patterns via semantic similarity
Multi-format docs	semtools	Parses PDFs, DOCX, searches semantically

Troubleshooting

Search Returns No Results

Increase distance threshold:

bash

search "query" files/ -i -n 30 -m 0.4  # or 0.5

Rephrase query using different terminology
Check file paths exist and are readable:
bash
```
ls docs/  # verify path
```
Verify workspace status:
bash
```
workspace status
```

Search is Slow

Use workspace for repeated searches:

bash

workspace use project
export SEMTOOLS_WORKSPACE=project

Pre-parse PDFs to avoid parsing repeatedly
Reduce scope - search specific directories, not entire repo

Parse Fails

Check API key:

bash

echo $LLAMA_CLOUD_API_KEY  # should output your key

Verify network connection (LlamaParse requires internet)
Check file format (PDF, DOCX, PPTX supported)

Embeddings Seem Stale

bash

# Remove stale cache entries
workspace prune

# Or create fresh workspace
workspace use new-workspace
export SEMTOOLS_WORKSPACE=new-workspace

Too Many Irrelevant Results

Tighten distance threshold:
bash
```
search "query" files/ -i -n 30 -m 0.2
```
Use more specific query terms

Pre-filter with grep:

bash

grep -l "exact_term" *.yaml | xargs search "broader query" -i -n 30 -m 0.3

Cost Considerations

Semtools uses LlamaCloud API for:

Document parsing (PDF, DOCX, etc.)
Embedding generation for semantic search

Cost optimization strategies:

Use workspaces to cache embeddings (avoids re-embedding)
Pre-parse documents once, search multiple times
Parse → cache → search workflow minimizes API calls

Example costs:

910 ACL research papers analyzed for ~$0.68 (from docs)

Workspace Management

Create and Activate Workspace

bash

workspace use my-project
export SEMTOOLS_WORKSPACE=my-project

Add to shell profile for persistence:

bash

echo 'export SEMTOOLS_WORKSPACE=my-project' >> ~/.zshrc

Check Workspace Status

bash

workspace status
# Shows: active workspace, cached files, size, last modified

Clean Up Workspace

bash

# Remove stale/missing files from cache
workspace prune

Workspace Directory

Workspaces stored at: ~/.semtools/workspaces/<workspace-name>/

Examples for This Codebase

Search Architecture Documentation

bash

search "service mesh design" docs/architecture/ -i -n 40 -m 0.3

Find Crossplane XRD Patterns

bash

search "composition with managed resources" crossplane/ -i -n 30 -m 0.3

Locate Vault Configuration Examples

bash

search "vault kubernetes authentication role" apps/ crossplane/ terraform/ \
  -i -n 30 -m 0.3

Search Terraform Modules

bash

search "authentik group provisioning" terraform/modules/ -i -n 30 -m 0.3

Find ArgoCD Application Patterns

bash

search "argocd sync waves and hooks" argocd/ -i -n 30 -m 0.3

Explore Helm Values

bash

search "ingress configuration with TLS" apps/*/values.yaml -i -n 35 -m 0.3

Search Across Documentation Types

bash

# Search guides, reference docs, and architecture docs
search "secret management patterns" docs/ -i -n 40 -m 0.3

Quick Reference Card

Essential Pattern

bash

# Standard search command
search "<query>" <path> -i -n 30 -m 0.3

Create Workspace

bash

workspace use <name>
export SEMTOOLS_WORKSPACE=<name>

Parse Documents

bash

parse *.pdf
search "query" ~/.parse/ -i -n 30 -m 0.3

Adjust Precision/Recall

bash

-m 0.2  # High precision (strict)
-m 0.3  # Balanced (default)
-m 0.4  # High recall (broad)

Resources

For detailed CLI documentation, see references/cli_reference.md.

External resources:

GitHub: https://github.com/run-llama/semtools
Documentation: https://github.com/run-llama/semtools/blob/main/README.md
Examples: https://github.com/run-llama/semtools/tree/main/examples

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/semtools
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Semtools - Semantic Search & Document Parsing

Overview

When to Use This Skill

Documentation & Configuration Exploration

Semantic Understanding Requirements

Document Analysis

Examples of When to Use

Installation & Setup

Core Workflow: Search Documentation

Step 1: Decide on Workspace Usage

Step 2: Execute Semantic Search

Step 3: Refine if Needed

Common Search Patterns

Pattern 1: Quick Documentation Search

Pattern 2: Research Mode with Workspace

Pattern 3: Multi-Directory Configuration Search

Pattern 4: YAML Manifest Search

Pattern 5: Pre-filter with grep

Document Parsing Workflow

Step 1: Create Workspace for PDFs

Step 2: Parse Documents

Step 3: Search Parsed Content

Step 4: Repeated Searches

Advanced Workflows

Workflow: Complex Research Pipeline

Workflow: PDF Document Analysis

Workflow: Codebase Pattern Discovery

Understanding Distance Thresholds

Best Practices

1. Always Use These Three Flags

2. Use Workspaces for Large Operations

3. Pre-parse PDFs Once

4. Chain with grep for Exact Pre-filtering

5. Adjust Context Lines Based on Content Type

When NOT to Use Semtools

1. Simple Exact-Match Search

2. Code Structure Search

3. File Name Search

4. Real-time Operations

5. Very Small or Single-File Searches

Tool Selection Guide

Troubleshooting

Search Returns No Results

Search is Slow

Parse Fails

Embeddings Seem Stale

Too Many Irrelevant Results

Cost Considerations

Workspace Management

Create and Activate Workspace

Check Workspace Status

Clean Up Workspace

Workspace Directory

Examples for This Codebase

Search Architecture Documentation

Find Crossplane XRD Patterns

Locate Vault Configuration Examples

Search Terraform Modules

Find ArgoCD Application Patterns

Explore Helm Values

Search Across Documentation Types

Quick Reference Card

Essential Pattern

Create Workspace

Parse Documents

Adjust Precision/Recall

Resources

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state