Agent skill
pdf-parity-checker
Verify visual and structural parity between XHTML chapters and POD PDF files. Use to ensure print edition matches digital EPUB layout.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/pdf-parity-checker
SKILL.md
PDF Parity Checker Skill
Purpose
Compare the 44 XHTML chapter files against their corresponding POD (print-on-demand) PDF files to ensure visual and structural consistency. This is critical for maintaining brand quality across digital and print editions.
When to Invoke
- User asks "do the PDFs match the EPUB chapters?"
- Before sending POD files to IngramSpark or print vendor
- After making changes to XHTML or CSS
- User mentions "print edition" or "PDF consistency"
- User asks "verify the PDFs are up to date"
Workflow
Run PDF Parity Verification
python3 scripts/pdf_verify.py \
--root REBRANDED_OUTPUT \
--targets docs/REBRANDED_VISUAL_AUDIT.json \
--update-json
What it does:
- For each of the 44 XHTML files:
- Locates corresponding PDF in
REBRANDED_OUTPUT/pdf-pod/ - Compares:
- Page count (XHTML rendered vs PDF pages)
- Media box dimensions (PDF page size)
- First-page visual hash (downscaled grayscale comparison)
- Text extraction and paragraph continuity
- Locates corresponding PDF in
- If PDF is missing:
- Generates temporary reference PDF via headless browser print-to-PDF
- Uses this for comparison (but does NOT commit to repo)
- Flags as "MISSING" in report
- Updates
docs/REBRANDED_VISUAL_AUDIT.jsonwith:pdf_checkobject for each chapter- Fields:
page_count_match,bbox_match,image_hash_delta,pdf_status
Comparison Metrics
1. Page Count Match
Compares rendered XHTML page count vs PDF page count.
Example:
Chapter IX: "Unveiling Your Creative Odyssey"
- XHTML rendered: 8 pages (at 6×9" print size)
- PDF actual: 8 pages
- Status: ✅ MATCH
Acceptable variance:
- Exact match: ✅ PASS
- ±1 page: ⚠️ WARN (minor reflow difference)
- ±2+ pages: ❌ FAIL (significant layout mismatch)
2. Media Box (Page Size)
Verifies PDF pages are correct physical dimensions.
Expected for 6×9" POD:
- Width: 432 points (6 inches × 72 DPI)
- Height: 648 points (9 inches × 72 DPI)
Example:
Chapter XV: Media box check
- Expected: 432×648 pt
- Actual: 432×648 pt
- Status: ✅ MATCH
3. Visual Hash Comparison
Computes perceptual hash of first page to detect visual differences.
Process:
- Render XHTML first page as PNG (grayscale, downscaled to 200×300)
- Convert PDF first page to PNG (same size)
- Compute average hash for both
- Calculate Hamming distance
Scoring:
- Hash delta 0-5: ✅ IDENTICAL (perfect match)
- Hash delta 6-15: ✅ SIMILAR (acceptable variance)
- Hash delta 16-30: ⚠️ DIFFERENT (minor layout shift)
- Hash delta >30: ❌ MISMATCH (significant visual difference)
Example:
Chapter IV: Visual hash comparison
- XHTML hash: d4a3f2c1...
- PDF hash: d4a3f2c1...
- Hamming distance: 3
- Status: ✅ IDENTICAL
4. Text Extraction
Extracts text from PDF and verifies key content is present.
Checks:
- Chapter title appears in first 500 characters
- Heading order matches XHTML heading structure
- Paragraph count is similar (±10%)
Example:
Chapter XII: Text extraction
- Title found: ✅ "Financial Wisdom"
- Headings: 12 in XHTML, 12 in PDF ✅
- Paragraphs: 84 in XHTML, 83 in PDF ✅ (within 10%)
- Status: ✅ PASS
Interpreting Results
JSON Output Structure
{
"file": "REBRANDED_OUTPUT/xhtml/9-chapter-i-unveiling-your-creative-odyssey.xhtml",
"basename": "9-chapter-i-unveiling-your-creative-odyssey",
"pdf_check": {
"pdf_path": "REBRANDED_OUTPUT/pdf-pod/chapters/9-chapter-i-unveiling-your-creative-odyssey.pdf",
"pdf_status": "ok",
"page_count_match": true,
"page_count_xhtml": 8,
"page_count_pdf": 8,
"bbox_match": true,
"bbox_expected": [432, 648],
"bbox_actual": [432, 648],
"image_hash_delta": 3,
"image_hash_verdict": "identical",
"text_checks": {
"title_found": true,
"heading_count_match": true,
"paragraph_variance_pct": 1.2
}
}
}
Markdown Summary
Generated in docs/REBRANDED_VISUAL_AUDIT.md:
| File | PDF Status | Page Match | Visual Match | Issues |
|---|---|---|---|---|
| 9-chapter-i-... | ✅ OK | ✅ 8 pages | ✅ Identical | None |
| 15-chapter-vi-... | ⚠️ OK | ⚠️ 10 vs 11 | ✅ Similar | +1 page variance |
| 22-chapter-xii-... | ❌ MISSING | N/A | N/A | PDF not found |
Common Issues and Fixes
Issue: Page Count Mismatch
Symptom: XHTML renders as 8 pages, PDF has 9 pages
Possible causes:
- Extra blank page in PDF (page break issue)
- Different margin settings between XHTML and PDF export
- Widow/orphan control differences
How to fix:
- Open PDF in Acrobat to verify blank page
- Adjust
print-pod.cssorphans/widows settings:cssp { orphans: 2; widows: 2; } - Re-export PDF from InDesign or print-to-PDF workflow
- Re-run parity check to verify
Issue: Visual Hash Mismatch
Symptom: Hash delta >30 (significant visual difference)
Possible causes:
- Font substitution in PDF vs XHTML
- Image resolution difference
- Different CSS applied (print vs digital styles)
How to fix:
- Open both side-by-side:
- XHTML screenshot:
docs/screenshots/{basename}/title_1080x1440.png - PDF page 1: Open in Acrobat, export as PNG
- XHTML screenshot:
- Identify visual difference (typography, spacing, images)
- Adjust
print-pod.cssto match digital rendering - Regenerate PDF
- Re-run parity check
Issue: Missing PDF
Symptom: pdf_status: "missing"
Action:
- Check if PDF was not generated yet for this chapter
- Verify PDF naming matches XHTML basename:
- XHTML:
9-chapter-i-unveiling-your-creative-odyssey.xhtml - PDF should be:
9-chapter-i-unveiling-your-creative-odyssey.pdf
- XHTML:
- If PDF exists with different name, rename to match
- If PDF doesn't exist:
- Generate from XHTML using print-to-PDF workflow
- Place in appropriate
pdf-pod/subdirectory
Issue: Text Extraction Failure
Symptom: text_checks.title_found: false
Possible causes:
- PDF is image-only (scanned, not text-based)
- PDF has font embedding issues (text not extractable)
- PDF was generated with text as outlines/curves
How to fix:
- Verify PDF is text-based (select text in Acrobat)
- If image-only, regenerate PDF from source XHTML
- Ensure fonts are embedded as subsets (not outlines)
Example Invocation
User: "Are the print PDFs in sync with the latest EPUB chapters?"
Agent response:
- Run PDF parity checker
- Read results from updated JSON
- Summarize:
PDF Parity Check Complete
=========================
Total chapters: 44
PDFs verified: 42
Missing PDFs: 2
Results:
✅ PASS: 38 chapters (perfect parity)
⚠️ WARN: 4 chapters (minor variance)
❌ FAIL: 0 chapters (significant mismatch)
🔍 MISSING: 2 chapters (PDFs not found)
Warnings (4 chapters):
- Chapter VI: +1 page variance (11 vs 10) - Extra blank page in PDF
- Chapter XI: Visual hash delta 18 - Minor font rendering difference
- Chapter XV: Text variance 12% - Some paragraphs reflowed
- Chapter XXII: Media box 432×660 - Height should be 648
Missing PDFs (2 chapters):
- 30-SelfAssessment.xhtml (no matching PDF found)
- 43-DoodlePage.xhtml (no matching PDF found)
Recommended actions:
1. Fix page break in Chapter VI
2. Review font settings for Chapter XI
3. Generate missing PDFs for Self-Assessment and Doodle pages
4. Verify media box for Chapter XXII
Full report: docs/REBRANDED_VISUAL_AUDIT.md (PDF Parity column)
Detailed JSON: docs/REBRANDED_VISUAL_AUDIT.json (pdf_check objects)
Integration with Other Skills
Run after:
epub-visual-auditor- Ensure XHTML rendering is correct first
Run before:
- Sending POD files to print vendor
- Uploading to IngramSpark or KDP Print
- Final publication package
Pair with:
epub-publication-validator- Comprehensive pre-publication check
Notes
- PDF comparison requires
pypdfandPillowPython libraries - First run may be slower (generates temporary PDFs for missing files)
- Temporary reference PDFs are stored in
/tmp/and not committed to repo - Visual hash comparison is perceptual (small rendering differences are OK)
- Re-run after any CSS or XHTML changes to verify parity maintained
Didn't find tool you were looking for?