Agent skills
extracting-pdf-text

Agent skill

extracting-pdf-text

Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.

View SKILL.md on GitHub Repository

Stars 77

Forks 14

Install this agent skill to your Project

npx add-skill https://github.com/letta-ai/skills/tree/main/tools/extracting-pdf-text

SKILL.md

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type	Best Approach	Script
Simple text PDF	PyMuPDF	`scripts/extract_pymupdf.py`
PDF with tables	pdfplumber	`scripts/extract_pdfplumber.py`
Scanned/image PDF (local)	pytesseract	`scripts/extract_with_ocr.py`
Complex layout, highest accuracy	Mistral OCR API	`scripts/extract_mistral_ocr.py`
End-to-end RAG pipeline	marker-pdf	`pip install marker-pdf`

Recommended Workflow

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

bash

uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

bash

uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

bash

uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

bash

export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

Outputs clean markdown
Preserves document structure (headings, lists, tables)
Handles images, math equations, multilingual text
95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

Preserves semantic structure (headings become context boundaries)
Tables remain readable
Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

Maintainer

letta-ai Core maintainer

Source details

Full Name: letta-ai/skills
Branch: main
Path in repo: tools/extracting-pdf-text

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

letta-ai/skills

yelp-search

Search Yelp for local businesses, get contact info, ratings, and hours. Use when finding services (cleaners, groomers, restaurants, etc.), looking up business phone numbers to text, or checking ratings before booking. Triggers on queries about finding businesses, restaurants, services, or "look up on Yelp".

77 14

Explore

letta-ai/skills

morph-warpgrep

Integration guide for Morph's WarpGrep (fast agentic code search) and Fast Apply (10,500 tok/s code editing). Use when building coding agents that need fast, accurate code search or need to apply AI-generated edits to code efficiently. Particularly useful for large codebases, deep logic queries, bug tracing, and code path analysis.

77 14

Explore

letta-ai/skills

obsidian-cli

Work with Obsidian vaults using the official Obsidian CLI. Read, create, append, search, and manage notes, daily notes, properties, tags, tasks, sync, and more from the terminal. Use when the user mentions Obsidian, notes, vault, daily notes, or when working with markdown knowledge bases. Requires Obsidian desktop app running with CLI enabled in Settings > General.

77 14

Explore

letta-ai/skills

mcp-builder

Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).

77 14

Explore

letta-ai/skills

google-workspace

Connect to Gmail and Google Calendar via OAuth 2.0. Use when users want to search/read emails, create drafts, search calendar events, check availability, or schedule meetings. Triggers on queries about email, inbox, calendar, schedule, or meetings.

77 14

Explore

letta-ai/skills

linear

Manage Linear issues via GraphQL API. List, filter, update, prioritize, comment, and search issues. Use when the user asks about Linear, issues, project management, or backlog.

77 14

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations

Recommended Agent Skills

yelp-search

morph-warpgrep

obsidian-cli

mcp-builder

google-workspace

linear