required_canon_version: >=3.0.0

Skill: pdf-to-markdown

Version: 0.1.0

Status: Draft

Trigger

Use when converting PDF documents to Markdown format, typically for documentation purposes or to make PDF content more accessible and editable.

Inputs

input.json with the following structure:

json

{
  "pdf_path": "path/to/document.pdf",
  "output_path": "path/to/output.md",
  "options": {
    "extract_images": false,
    "preserve_formatting": true,
    "page_breaks": "---"
  }
}

Fields:

pdf_path (required, string): Absolute or relative path to input PDF file
output_path (required, string): Path where Markdown file will be written
options.extract_images (optional, boolean): Whether to extract embedded images (default: false)
options.preserve_formatting (optional, boolean): Attempt to preserve text formatting (default: true)
options.page_breaks (optional, string): String to insert between pages (default: "---")

Outputs

Creates a Markdown file at the specified output_path containing:
- Extracted text from the PDF
- Headers converted from document structure
- Tables converted to Markdown tables
- Optional page break markers between pages
- Preserved whitespace and basic formatting

Output Format:

markdown

# Document Title

Section header

Paragraph text with **bold** and *italic* formatting.

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |

---

Page 2 content continues...

Constraints

Input PDF must be readable and not password-protected
Output path must be within project root (enforced by GuardedWriter)
Cannot write outside allowed locations (BUILD/, CONTRACTS/_runs/, etc.)
Deterministic output: same input PDF always produces same Markdown
Must use GuardedWriter for all file writes (write firewall enforcement)
Images are extracted only when explicitly requested

Dependencies

pdfplumber>=0.9.0 - PDF text and structure extraction
Standard library only (no additional dependencies for basic operation)

Fixtures

fixtures/basic/ - Simple PDF conversion test
fixtures/multi-page/ - Multi-page document with page breaks
fixtures/tables/ - PDF containing tables for table extraction

Error Handling

Returns exit code 1 on errors with descriptive message
Handles common PDF errors:
- File not found
- Invalid PDF format
- Password-protected PDF (not supported)
- Encoding issues in text extraction

required_canon_version: >=3.0.0

Search AI Tools

pdf-to-markdown

Install this agent skill to your Project

SKILL.md

Skill: pdf-to-markdown

Trigger

Inputs

Fields:

Outputs

Output Format:

Constraints

Dependencies

Fixtures

Error Handling

Recommended Agent Skills

mcp-adapter

pipeline-dag-scheduler

cortex-summaries

llm-packer-smoke

artifact-escape-hatch

mcp-message-board