Agent skill
normalize
Normalize text to handle PDF/Unicode encoding issues. Converts Windows-1252, curly quotes, em/en dashes, ligatures, directional formatting, zero-width chars, and more to clean ASCII.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/normalize
Metadata
Additional technical details for this skill
- project path
- /home/graham/workspace/experiments/pi-mono
- short description
- Clean PDF/Unicode text to ASCII
SKILL.md
Text Normalize
Comprehensive text normalization for handling PDF and Unicode encoding issues.
Quick Start
# Normalize text from stdin
echo "Hello\u2019world" | .pi/skills/normalize/run.sh
# Normalize a file
.pi/skills/normalize/run.sh document.txt
# Normalize with output file
.pi/skills/normalize/run.sh document.txt -o clean.txt
# Treat argument as text (not filename)
.pi/skills/normalize/run.sh -t "Hello\u201cworld\u201d"
# Show statistics
.pi/skills/normalize/run.sh document.txt --stats
What It Normalizes
| Category | Examples | Normalized To |
|---|---|---|
| Whitespace | Non-breaking, em/en space, hair space | Regular space |
| Hyphens | En dash, em dash, minus sign, figure dash | ASCII hyphen - |
| Quotes | Curly quotes, guillemets, primes | Straight ' and " |
| Windows-1252 | \x93, \x94, \x92 |
", ", ' |
| Ligatures | fi, fl, ffi, ffl | Expanded letters |
| Bullets | Various bullet points | Hyphen - |
| Zero-width | ZWSP, ZWNJ, ZWJ, BOM | Removed |
| Directional | LTR/RTL marks | Removed |
| Control chars | C0/C1 (except newline/tab) | Removed |
| Line breaks | intro-\nduction |
introduction |
Pipeline Integration
This skill is based on the same normalization used in the extractor pipeline's s02_marker_extractor.py. The code is kept in sync with text_toolz patterns.
Python Usage
from normalize import normalize_text
# Clean text for pattern matching
text = "1.\u00a0Introduction" # Non-breaking space
clean = normalize_text(text) # "1. Introduction"
Normalization Steps
- Windows-1252 conversion - Handle legacy MS Office encoding
- NFKC normalization - Unicode compatibility decomposition
- Remove directional formatting - LTR/RTL marks
- Remove control characters - C0/C1 (preserve newlines)
- Normalize whitespace - All special spaces to ASCII
- Normalize hyphens - All dash variants to
- - Normalize quotes - Curly to straight
- Normalize dots - Ellipsis, leader dots
- Normalize bullets - All bullet types to
- - Expand ligatures - fi/fl/ffi/ffl
- Fix line-break hyphens - Join hyphenated words
- Collapse whitespace - Multiple spaces to single
Based On
- text_toolz library patterns
- extractor pipeline s02 normalization
- NFKC Unicode standard
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?