Agent skill
using-spacy-nlp
Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.
Install this agent skill to your Project
npx add-skill https://github.com/SpillwaveSolutions/spacy-nlp-agentic-skill/tree/main/skills/using-spacy-nlp
SKILL.md
spaCy NLP
Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.
Contents
- Quick Start
- Installation
- Text Processing
- Training Classifiers
- Troubleshooting
- Production Deployment
Scope
In Scope:
- spaCy 3.x installation and text processing
- TextCategorizer training for document classification
- Production deployment and optimization patterns
Out of Scope (use other tools/skills):
- Training custom NER models (different workflow)
- spaCy 2.x (deprecated, incompatible with 3.x)
- Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
- Custom tokenizers or language models
Quick Start
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
# Entities
for ent in doc.ents:
print(ent.text, ent.label_)
# Tokens with attributes
for token in doc:
print(token.text, token.pos_, token.dep_)
Installation
Standard Setup
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
Model Selection
| Model | Size | Speed | Use Case |
|---|---|---|---|
en_core_web_sm |
12 MB | Fastest | Prototyping, speed-critical |
en_core_web_md |
40 MB | Fast | General use with word vectors |
en_core_web_lg |
560 MB | Fast | Semantic similarity tasks |
en_core_web_trf |
438 MB | Slow | Maximum accuracy (GPU) |
Verify Installation
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")
For detailed installation options (conda, GPU, transformers): See references/installation.md
Text Processing
Basic Pipeline
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")
# Tokenization + attributes
for token in doc:
print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")
Named Entity Recognition
for ent in doc.ents:
print(ent.text, ent.label_) # "Apple Inc." ORG, "Steve Jobs" PERSON
For entity types, filtering, and span details: See references/basic-usage.md
Batch Processing (Critical for Production)
# WRONG - slow
for text in texts:
doc = nlp(text) # Don't do this
# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
process(doc)
# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))
Disable Unused Components
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])
For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md
Training Classifiers
Train custom text classifiers with TextCategorizer.
Workflow Overview
- Prepare data → Run
scripts/prepare_training_data.py - Generate config → Run
scripts/generate_config.pyor useassets/config_textcat.cfg - Validate →
python -m spacy debug data config.cfg(catches issues before training) - Train →
python -m spacy train config.cfg --output ./output - Evaluate → Run
scripts/evaluate_model.py - Use →
nlp = spacy.load("./output/model-best")
Data Format
Training data uses spaCy's DocBin format. Example input (JSON):
[
{"text": "Quarterly revenue exceeded expectations", "label": "Business"},
{"text": "Fixed null pointer exception in parser", "label": "Programming"},
{"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]
Convert with script:
python scripts/prepare_training_data.py \
--input data.json \
--output-train train.spacy \
--output-dev dev.spacy \
--split 0.8
Training Command
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"
# Or use template
cp assets/config_textcat.cfg config.cfg
# Train
python -m spacy train config.cfg --output ./output
# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0
Using Trained Model
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}") # DevOps: 94.2%
For detailed training guide: See references/text-classification.md
Troubleshooting
Model Not Found (E050)
OSError: [E050] Can't find model 'en_core_web_sm'
Fix:
python -m spacy download en_core_web_sm
Alternative (avoids path issues):
import en_core_web_sm
nlp = en_core_web_sm.load()
Memory Issues
Symptoms: OOM errors, slow processing
Fixes:
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])
# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
doc = nlp(chunk)
# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
for doc in nlp.pipe(batch):
process(doc)
GPU Not Working
import spacy
# Must call BEFORE loading model
if spacy.prefer_gpu():
print("Using GPU")
else:
print("GPU not available")
nlp = spacy.load("en_core_web_trf") # Now loads on GPU
Version Compatibility
spaCy 2.x models do not work with spaCy 3.x. Check compatibility:
python -m spacy validate
For more troubleshooting: See references/troubleshooting.md
Production Deployment
Package Model
python -m spacy package ./output/model-best ./packages \
--name my_classifier \
--version 1.0.0
pip install ./packages/en_my_classifier-1.0.0/
FastAPI Server
Use the production template:
python scripts/serve_model.py --model ./output/model-best --port 8000
Or customize from template:
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("en_my_classifier")
@app.post("/classify")
async def classify(text: str):
with nlp.memory_zone():
doc = nlp(text)
return {
"category": max(doc.cats, key=doc.cats.get),
"scores": doc.cats
}
Performance Optimization
| Technique | Speedup | When to Use |
|---|---|---|
| Disable components | 2-3x | Don't need all annotations |
nlp.pipe() |
5-10x | Processing multiple texts |
| Multiprocessing | 2-4x | CPU-bound, many cores |
| GPU | 2-5x | Transformer models |
For evaluation metrics and hyperparameter tuning: See references/production.md
Scripts Reference
| Script | Purpose | Usage |
|---|---|---|
prepare_training_data.py |
Convert JSON to DocBin | python scripts/prepare_training_data.py --input data.json |
generate_config.py |
Create training config | python scripts/generate_config.py --categories "A,B,C" |
evaluate_model.py |
Detailed metrics | python scripts/evaluate_model.py --model ./output/model-best |
serve_model.py |
FastAPI server | python scripts/serve_model.py --model ./model --port 8000 |
Assets Reference
| Asset | Purpose | Usage |
|---|---|---|
config_textcat.cfg |
Base training config | Copy and customize for your labels |
training_data_template.json |
Data format example | Reference for preparing your data |
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
grading-claude-agents-md
Grades and improves CLAUDE.md (Claude Code) and AGENTS.md (Codex/OpenCode) configuration files. Use when asked to grade, score, evaluate, audit, review, improve, fix, optimize, or refactor agent config files. Triggers on 'grade my CLAUDE.md', 'score my AGENTS.md', 'is my CLAUDE.md too big', 'improve my agent config', 'fix my CLAUDE.md', 'optimize context usage', 'reduce tokens in CLAUDE.md', or 'audit my config files'. Automatically grades both files if present, generates improvement plan, and implements changes on approval.
publishing-astro-websites
Comprehensive guidance for building and deploying static websites with the Astro framework. This skill should be used when asked to "create astro site", "deploy astro to firebase", "set up content collections", "add mermaid diagrams to astro", "configure astro i18n", "build static blog", or "astro markdown setup". Covers SSG fundamentals, Content Collections, Markdown/MDX, partial hydration, islands architecture, and deployment to Netlify, Vercel, GitHub Pages, or GCP/Firebase.
configuring-agent-brain
Installation and configuration skill for Agent Brain document search system. Use when asked to "install agent brain", "setup agent brain", "configure agent brain", "setting up document search", "installing agent-brain packages", "configuring API keys", "initializing project for search", "troubleshooting agent brain", "pip install agent-brain", "agent brain not working", "agent brain setup error", "configure embeddings provider", "setup ollama for agent brain", or "agent brain environment variables". Covers package installation, provider configuration, project initialization, and server management.
using-agent-brain
Expert Agent Brain skill for document search with BM25 keyword, semantic vector, hybrid, graph, and multi retrieval modes. Use when asked to "search documentation", "query domain", "find in docs", "bm25 search", "hybrid search", "semantic search", "graph search", "multi search", "find dependencies", "code relationships", "searching knowledge base", "querying indexed documents", "finding code references", "exploring codebase", "what calls this function", "find imports", "trace dependencies", "brain search", "brain query", "knowledge base search", "cache management", "clear embedding cache", "cache hit rate", or "cache status". Supports multi-instance architecture with automatic server discovery. GraphRAG mode enables relationship-aware queries for code dependencies and entity connections. Pluggable providers for embeddings (OpenAI, Cohere, Ollama) and summarization (Anthropic, OpenAI, Gemini, Grok, Ollama). Supports multiple runtimes (Claude Code, OpenCode, Gemini CLI) with shared .agent-brain/ data directory.
uat-testing
installing-local
Didn't find tool you were looking for?