Agent skill

pdf

Boîte à outils complète pour la manipulation de PDF : extraction de texte et tableaux, création de nouveaux PDF, fusion/découpage de documents et gestion de formulaires. Quand Claude doit remplir un formulaire PDF ou traiter, générer ou analyser des documents PDF de manière programmatique et à grande échelle.

View SKILL.md on GitHub Repository

Stars 2

Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/Dedalus-ERP-PAS/hexagone-foundation-skills/tree/main/skills/pdf

SKILL.md

PDF Processing Guide

Overview

This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.

Quick Start

python

from pypdf import PdfReader, PdfWriter

# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# Extract text
text = ""
for page in reader.pages:
    text += page.extract_text()

Python Libraries

pypdf - Basic Operations

Merge PDFs

python

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

python

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Extract Metadata

python

reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")

Rotate Pages

python

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

pdfplumber - Text and Table Extraction

Extract Text with Layout

python

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

python

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

Advanced Table Extraction

python

import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:  # Check if table is not empty
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

# Combine all tables
if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)

reportlab - Create PDFs

Basic PDF Creation

python

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")

# Add a line
c.line(100, height - 140, 400, height - 140)

# Save
c.save()

Create PDF with Multiple Pages

python

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))

body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)

story.append(PageBreak())

# Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))

# Build PDF
doc.build(story)

Command-Line Tools

pdftotext (poppler-utils)

bash

# Extract text
pdftotext input.pdf output.txt

# Extract text preserving layout
pdftotext -layout input.pdf output.txt

# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt  # Pages 1-5

qpdf

bash

# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1  # Rotate page 1 by 90 degrees

# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

pdftk (if available)

bash

# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf

# Split
pdftk input.pdf burst

# Rotate
pdftk input.pdf rotate 1east output rotated.pdf

Common Tasks

Extract Text from Scanned PDFs

python

# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path

# Convert PDF to images
images = convert_from_path('scanned.pdf')

# OCR each page
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)

Add Watermark

python

from pypdf import PdfReader, PdfWriter

# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]

# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Extract Images

bash

# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.

Password Protection

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# Add password
writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Quick Reference

Task	Best Tool	Command/Code
Merge PDFs	pypdf	`writer.add_page(page)`
Split PDFs	pypdf	One page per file
Extract text	pdfplumber	`page.extract_text()`
Extract tables	pdfplumber	`page.extract_tables()`
Create PDFs	reportlab	Canvas or Platypus
Command line merge	qpdf	`qpdf --empty --pages ...`
OCR scanned PDFs	pytesseract	Convert to image first
Fill PDF forms	pdf-lib or pypdf (see forms.md)	See forms.md

Next Steps

For advanced pypdfium2 usage, see reference.md
For JavaScript libraries (pdf-lib), see reference.md
If you need to fill out a PDF form, follow the instructions in forms.md
For troubleshooting guides, see reference.md

Maintainer

Dedalus-ERP-PAS Core maintainer

Source details

Full Name: Dedalus-ERP-PAS/hexagone-foundation-skills
Branch: main
Path in repo: skills/pdf
License: MIT License
Topics: skills llm dev

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

Dedalus-ERP-PAS/hexagone-foundation-skills

ubiquitous-language

Extrait un glossaire de langage ubiquitaire style DDD de la conversation en cours, signale les ambiguïtés et propose des termes canoniques. Sauvegarde dans UBIQUITOUS_LANGUAGE.md. À utiliser quand l'utilisateur veut définir des termes métier, construire un glossaire, durcir la terminologie, créer un langage ubiquitaire ou mentionne « domain model », « DDD », « glossaire » ou « langage ubiquitaire ».

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

hexagone-web-feature-extractor

Explore any Hexagone Web space via Playwright headless browser, capture screenshots, and produce a PO-oriented Markdown document.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

gitlab-issue

Crée, récupère, met à jour et gère les issues GitLab avec collecte complète du contexte. À utiliser quand l'utilisateur veut créer une nouvelle issue, voir les détails d'une issue, mettre à jour des issues existantes, lister les issues du projet ou gérer les workflows d'issues dans GitLab.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

tdd

Développement piloté par les tests avec boucle red-green-refactor. À utiliser quand l'utilisateur veut construire des fonctionnalités ou corriger des bugs en TDD, mentionne « red-green-refactor », veut des tests d'intégration ou demande du développement test-first.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

testing-patterns

Patrons et stratégies de test complets pour les projets JavaScript/TypeScript. Couvre les tests unitaires, d'intégration et E2E, les stratégies de mocking, l'organisation des tests et les anti-patrons courants. À utiliser quand l'utilisateur veut écrire des tests, améliorer la couverture de tests, établir une stratégie de test ou corriger des tests instables.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

uniface-procscript

Navigue et interroge la documentation de référence officielle Uniface 9.7 ProcScript (594 entrées couvrant les instructions, fonctions, triggers, types de données, directives préprocesseur et fonctions struct). À utiliser quand l'utilisateur pose des questions sur la syntaxe ProcScript, les triggers Uniface, les opérations base de données, la gestion des listes, la manipulation d'entités, les fonctions de chaînes, la gestion d'erreurs ou tout sujet de programmation Uniface 9.7.

2 1

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

PDF Processing Guide

Overview

Quick Start

Python Libraries

pypdf - Basic Operations

Merge PDFs

Split PDF

Extract Metadata

Rotate Pages

pdfplumber - Text and Table Extraction

Extract Text with Layout

Extract Tables

Advanced Table Extraction

reportlab - Create PDFs

Basic PDF Creation

Create PDF with Multiple Pages

Command-Line Tools

pdftotext (poppler-utils)

qpdf

pdftk (if available)

Common Tasks

Extract Text from Scanned PDFs

Add Watermark

Extract Images

Password Protection

Quick Reference

Next Steps

Recommended Agent Skills

ubiquitous-language

hexagone-web-feature-extractor

gitlab-issue

tdd

testing-patterns

uniface-procscript