Agent skill

using-spacy-nlp

Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.

Stars 1
Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/SpillwaveSolutions/spacy-nlp-agentic-skill/tree/main/skills/using-spacy-nlp

SKILL.md

spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Contents

  • Quick Start
  • Installation
  • Text Processing
  • Training Classifiers
  • Troubleshooting
  • Production Deployment

Scope

In Scope:

  • spaCy 3.x installation and text processing
  • TextCategorizer training for document classification
  • Production deployment and optimization patterns

Out of Scope (use other tools/skills):

  • Training custom NER models (different workflow)
  • spaCy 2.x (deprecated, incompatible with 3.x)
  • Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
  • Custom tokenizers or language models

Quick Start

python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

bash
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

Model Size Speed Use Case
en_core_web_sm 12 MB Fastest Prototyping, speed-critical
en_core_web_md 40 MB Fast General use with word vectors
en_core_web_lg 560 MB Fast Semantic similarity tasks
en_core_web_trf 438 MB Slow Maximum accuracy (GPU)

Verify Installation

python
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md


Text Processing

Basic Pipeline

python
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

python
for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

python
# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

python
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md


Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

  1. Prepare data → Run scripts/prepare_training_data.py
  2. Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
  3. Validatepython -m spacy debug data config.cfg (catches issues before training)
  4. Trainpython -m spacy train config.cfg --output ./output
  5. Evaluate → Run scripts/evaluate_model.py
  6. Usenlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

json
[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

bash
python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

bash
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

python
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md


Troubleshooting

Model Not Found (E050)

OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

bash
python -m spacy download en_core_web_sm

Alternative (avoids path issues):

python
import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

python
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

python
import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

bash
python -m spacy validate

For more troubleshooting: See references/troubleshooting.md


Production Deployment

Package Model

bash
python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

bash
python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

python
from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

Technique Speedup When to Use
Disable components 2-3x Don't need all annotations
nlp.pipe() 5-10x Processing multiple texts
Multiprocessing 2-4x CPU-bound, many cores
GPU 2-5x Transformer models

For evaluation metrics and hyperparameter tuning: See references/production.md


Scripts Reference

Script Purpose Usage
prepare_training_data.py Convert JSON to DocBin python scripts/prepare_training_data.py --input data.json
generate_config.py Create training config python scripts/generate_config.py --categories "A,B,C"
evaluate_model.py Detailed metrics python scripts/evaluate_model.py --model ./output/model-best
serve_model.py FastAPI server python scripts/serve_model.py --model ./model --port 8000

Assets Reference

Asset Purpose Usage
config_textcat.cfg Base training config Copy and customize for your labels
training_data_template.json Data format example Reference for preparing your data

Expand your agent's capabilities with these related and highly-rated skills.

SpillwaveSolutions/grading-claude-agents-md-agentic-skill

grading-claude-agents-md

Grades and improves CLAUDE.md (Claude Code) and AGENTS.md (Codex/OpenCode) configuration files. Use when asked to grade, score, evaluate, audit, review, improve, fix, optimize, or refactor agent config files. Triggers on 'grade my CLAUDE.md', 'score my AGENTS.md', 'is my CLAUDE.md too big', 'improve my agent config', 'fix my CLAUDE.md', 'optimize context usage', 'reduce tokens in CLAUDE.md', or 'audit my config files'. Automatically grades both files if present, generates improvement plan, and implements changes on approval.

3 0
Explore
SpillwaveSolutions/publishing-astro-websites-agentic-skill

publishing-astro-websites

Comprehensive guidance for building and deploying static websites with the Astro framework. This skill should be used when asked to "create astro site", "deploy astro to firebase", "set up content collections", "add mermaid diagrams to astro", "configure astro i18n", "build static blog", or "astro markdown setup". Covers SSG fundamentals, Content Collections, Markdown/MDX, partial hydration, islands architecture, and deployment to Netlify, Vercel, GitHub Pages, or GCP/Firebase.

9 1
Explore
SpillwaveSolutions/agent-brain

configuring-agent-brain

Installation and configuration skill for Agent Brain document search system. Use when asked to "install agent brain", "setup agent brain", "configure agent brain", "setting up document search", "installing agent-brain packages", "configuring API keys", "initializing project for search", "troubleshooting agent brain", "pip install agent-brain", "agent brain not working", "agent brain setup error", "configure embeddings provider", "setup ollama for agent brain", or "agent brain environment variables". Covers package installation, provider configuration, project initialization, and server management.

85 16
Explore
SpillwaveSolutions/agent-brain

using-agent-brain

Expert Agent Brain skill for document search with BM25 keyword, semantic vector, hybrid, graph, and multi retrieval modes. Use when asked to "search documentation", "query domain", "find in docs", "bm25 search", "hybrid search", "semantic search", "graph search", "multi search", "find dependencies", "code relationships", "searching knowledge base", "querying indexed documents", "finding code references", "exploring codebase", "what calls this function", "find imports", "trace dependencies", "brain search", "brain query", "knowledge base search", "cache management", "clear embedding cache", "cache hit rate", or "cache status". Supports multi-instance architecture with automatic server discovery. GraphRAG mode enables relationship-aware queries for code dependencies and entity connections. Pluggable providers for embeddings (OpenAI, Cohere, Ollama) and summarization (Anthropic, OpenAI, Gemini, Grok, Ollama). Supports multiple runtimes (Claude Code, OpenCode, Gemini CLI) with shared .agent-brain/ data directory.

85 16
Explore
SpillwaveSolutions/agent-brain

uat-testing

85 16
Explore
SpillwaveSolutions/agent-brain

installing-local

85 16
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results