Agent skill

sanitize-text

Normalize raw text by removing excessive whitespace, non-printable characters, and standardizing unicode. Use this to clean up text extracted from PDFs or DOCX files before processing with LLMs.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/sanitize-text

SKILL.md

Sanitize Text

Overview

This skill cleans and normalizes raw text. It is essential for preprocessing text extracted from documents like PDFs, which often contain encoding artifacts, excessive whitespace, or weird control characters.

Usage

Sanitize Script

Syntax:

bash
python3 .agent/skills/sanitize-text/scripts/sanitize.py <input_file> [--output <output_file>]

Arguments:

  • input_file: Path to the file containing raw text.
  • --output: (Optional) Path to write cleaned text to. If omitted, prints to stdout.

Example:

bash
python3 .agent/skills/sanitize-text/scripts/sanitize.py raw_resume.txt --output clean_resume.txt

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results