Agent skill

doc-to-markdown

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

View SKILL.md on GitHub Repository

Stars 744

Forks 112

Install this agent skill to your Project

npx add-skill https://github.com/daymade/claude-code-skills/tree/main/doc-to-markdown

SKILL.md

Doc to Markdown

Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.

Architecture: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).

Quick Start

bash

# DOCX → Markdown (one command, zero manual fixes)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media

# PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

# Run tests
uv run --with pytest pytest scripts/test_convert.py -v

Dual Mode

Mode	Speed	Quality	Use Case
Quick (default)	Fast	Good	Drafts, simple documents
Heavy	Slower	Best	Final documents, complex layouts

Tool Selection

Format	Quick Mode	Heavy Mode
PDF	pymupdf4llm	pymupdf4llm + markitdown
DOCX	pandoc + post-processing	pandoc + markitdown
PPTX	markitdown	markitdown + pandoc
XLSX	markitdown	markitdown

DOCX Post-Processing (automatic)

When converting DOCX via pandoc, 8 cleanups are applied automatically:

Problem	Fix	Test coverage
Grid tables (`+:---+`)	Single-column → blockquote, multi-column → pipe table	`TestPostprocessPipeline`
Simple tables ( `---- ----`)	Multi-column images → pipe table with captions	`TestSimpleTable`
Image path nesting (`media/media/`)	Flatten to `media/`, absolute → relative	`test_stats_tracking`
Pandoc attributes (`{width="..."}`)	Removed	`test_pandoc_attributes_removed`
CJK bold spacing (`粗体中文`)	Add space around `**` for CJK bold spans	`TestCjkBoldSpacing` (15 cases)
Indented dashed code blocks	→ fenced ``` with language detection	`test_code_block_with_language`
Escaped brackets (`\[...\]`)	→ `[...]`	`test_escaped_brackets_fixed`
Double-bracket links (`[[text]](url)`)	→ `[text](url)`	`test_double_bracket_links_fixed`

CJK Bold Spacing — why and how

DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around ** to recognize bold boundaries.

Rule: if a **content** span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.

Before: 打开**飞书**，就可以    → some renderers fail to bold
After:  打开 **飞书** ，就可以  → universally renders correctly

Heavy Mode Workflow

Heavy Mode runs multiple tools in parallel and selects the best segments:

Parallel Execution: Run all applicable tools simultaneously
Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
Quality Scoring: Score each segment based on completeness and structure
Intelligent Merge: Select best version of each segment across tools

Merge Criteria

Segment Type	Selection Criteria
Tables	More rows/columns, proper header separator
Images	Alt text present, local paths preferred
Headings	Proper hierarchy, appropriate length
Lists	More items, nested structure preserved
Paragraphs	Content completeness

Image Extraction

bash

# Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

# Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

Output:

Images: assets/img_page1_1.png, assets/img_page2_1.jpg
Metadata: assets/images_metadata.json (page, position, dimensions)

Quality Validation

bash

# Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

# Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

Quality Metrics

Metric	Pass	Warn	Fail
Text Retention	>95%	85-95%	<85%
Table Retention	100%	90-99%	<90%
Image Retention	100%	80-99%	<80%

Merge Outputs Manually

bash

# Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md

# Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

Path Conversion (Windows/WSL)

bash

# Windows to WSL conversion
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
# Output: /mnt/c/Users/name/Documents/file.pdf

Common Issues

"No conversion tools available"

bash

# Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc

FontBBox warnings during PDF conversion

Harmless font parsing warnings, output is still correct

Images missing from output

Use Heavy Mode for better image preservation
Or extract separately with scripts/extract_pdf_images.py

Tables broken in output

Use Heavy Mode - it selects the most complete table version
Or validate with scripts/validate_output.py

Bundled Scripts

Script	Purpose
`convert.py`	Main orchestrator with Quick/Heavy mode + DOCX post-processing
`test_convert.py`	31 tests covering all post-processing functions
`merge_outputs.py`	Merge multiple markdown outputs
`validate_output.py`	Quality validation with HTML report
`extract_pdf_images.py`	PDF image extraction with metadata
`convert_path.py`	Windows to WSL path converter

References

references/benchmark-2026-03-22.md - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
references/heavy-mode-guide.md - Detailed Heavy Mode documentation
references/tool-comparison.md - Tool capabilities comparison
references/conversion-examples.md - Batch operation examples

Maintainer

daymade Core maintainer

Source details

Full Name: daymade/claude-code-skills
Branch: main
Path in repo: doc-to-markdown
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

daymade/claude-code-skills

excel-automation

Create, parse, and control Excel files on macOS. Professional formatting with openpyxl, complex xlsm parsing with stdlib zipfile+xml for investment bank financial models, and Excel window control via AppleScript. Use when creating formatted Excel reports, parsing financial models that openpyxl cannot handle, or automating Excel on macOS.

744 112

Explore

daymade/claude-code-skills

claude-code-history-files-finder

Finds and recovers content from Claude Code session history files. This skill should be used when searching for deleted files, tracking changes across sessions, analyzing conversation history, or recovering code from previous Claude interactions. Triggers include mentions of "session history", "recover deleted", "find in history", "previous conversation", or ".claude/projects".

744 112

Explore

daymade/claude-code-skills

claude-skills-troubleshooting

Diagnose and resolve Claude Code plugin and skill issues. This skill should be used when plugins are installed but not showing in available skills list, skills are not activating as expected, or when troubleshooting enabledPlugins configuration in settings.json. Triggers include "plugin not working", "skill not showing", "installed but disabled", or "enabledPlugins" issues.

744 112

Explore

daymade/claude-code-skills

fixing-claude-export-conversations

Fixes broken line wrapping in Claude Code exported conversation files (.txt), reconstructing tables, paragraphs, paths, and tool calls that were hard-wrapped at fixed column widths. Includes an automated validation suite (generic, file-agnostic checks). Triggers when the user has a Claude Code export file with broken formatting, mentions "fix export", "fix conversation", "exported conversation", "make export readable", references a file matching YYYY-MM-DD-HHMMSS-*.txt, or has a .txt file with broken tables, split paths, or mangled tool output from Claude Code.

744 112

Explore

daymade/claude-code-skills

continue-claude-work

Recover actionable context from local `.claude` session artifacts and continue interrupted work without running `claude --resume`. This skill should be used when the user provides a Claude session ID, asks to continue prior work from local history, or wants to inspect `.claude` files before resuming implementation.

744 112

Explore

daymade/claude-code-skills

promptfoo-evaluation

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

744 112

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Doc to Markdown

Quick Start

Dual Mode

Tool Selection

DOCX Post-Processing (automatic)

CJK Bold Spacing — why and how

Heavy Mode Workflow

Merge Criteria

Image Extraction

Quality Validation

Quality Metrics

Merge Outputs Manually

Path Conversion (Windows/WSL)

Common Issues

Bundled Scripts

References

Recommended Agent Skills

excel-automation

claude-code-history-files-finder

claude-skills-troubleshooting

fixing-claude-export-conversations

continue-claude-work

promptfoo-evaluation