Agent skill

normalize

Normalize text to handle PDF/Unicode encoding issues. Converts Windows-1252, curly quotes, em/en dashes, ligatures, directional formatting, zero-width chars, and more to clean ASCII.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/normalize

Metadata

Additional technical details for this skill

project path: /home/graham/workspace/experiments/pi-mono
short description: Clean PDF/Unicode text to ASCII

SKILL.md

Text Normalize

Comprehensive text normalization for handling PDF and Unicode encoding issues.

Quick Start

bash

# Normalize text from stdin
echo "Hello\u2019world" | .pi/skills/normalize/run.sh

# Normalize a file
.pi/skills/normalize/run.sh document.txt

# Normalize with output file
.pi/skills/normalize/run.sh document.txt -o clean.txt

# Treat argument as text (not filename)
.pi/skills/normalize/run.sh -t "Hello\u201cworld\u201d"

# Show statistics
.pi/skills/normalize/run.sh document.txt --stats

What It Normalizes

Category	Examples	Normalized To
Whitespace	Non-breaking, em/en space, hair space	Regular space
Hyphens	En dash, em dash, minus sign, figure dash	ASCII hyphen `-`
Quotes	Curly quotes, guillemets, primes	Straight `'` and `"`
Windows-1252	`\x93`, `\x94`, `\x92`	`"`, `"`, `'`
Ligatures	fi, fl, ffi, ffl	Expanded letters
Bullets	Various bullet points	Hyphen `-`
Zero-width	ZWSP, ZWNJ, ZWJ, BOM	Removed
Directional	LTR/RTL marks	Removed
Control chars	C0/C1 (except newline/tab)	Removed
Line breaks	`intro-\nduction`	`introduction`

Pipeline Integration

This skill is based on the same normalization used in the extractor pipeline's s02_marker_extractor.py. The code is kept in sync with text_toolz patterns.

Python Usage

python

from normalize import normalize_text

# Clean text for pattern matching
text = "1.\u00a0Introduction"  # Non-breaking space
clean = normalize_text(text)   # "1. Introduction"

Normalization Steps

Windows-1252 conversion - Handle legacy MS Office encoding
NFKC normalization - Unicode compatibility decomposition
Remove directional formatting - LTR/RTL marks
Remove control characters - C0/C1 (preserve newlines)
Normalize whitespace - All special spaces to ASCII
Normalize hyphens - All dash variants to -
Normalize quotes - Curly to straight
Normalize dots - Ellipsis, leader dots
Normalize bullets - All bullet types to -
Expand ligatures - fi/fl/ffi/ffl
Fix line-break hyphens - Join hyphenated words
Collapse whitespace - Multiple spaces to single

Based On

text_toolz library patterns
extractor pipeline s02 normalization
NFKC Unicode standard

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/normalize
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Text Normalize

Quick Start

What It Normalizes

Pipeline Integration

Python Usage

Normalization Steps

Based On

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state