Agent skill
smart-screenshot
Intelligent screenshot and screen capture with OCR, markdown conversion, and annotation. Triggered by PrtSc key, captures screen regions, extracts text with OCR, converts to markdown using MarkItDown, saves with auto-formatting. Similar to Windows Snipping Tool with AI enhancements for text extraction and document processing.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/smart-screenshot
SKILL.md
Smart Screenshot
Intelligent screen capture with OCR, markdown conversion, and smart formatting. Capture screen regions, extract text, convert images/PDFs to markdown, and save with automated formatting.
Quick Start
Trigger methods:
- Keyboard shortcut: Press
PrtSc(customizable) - Command line:
python scripts/capture.py - Claude Code: Ask Claude to "take a screenshot"
Workflow:
- Press PrtSc → Capture mode activates
- Choose: Image or Text
- Select region/window
- If Image: Save with annotation options
- If Text: OCR → MarkItDown → Save markdown
Prerequisites
System Requirements
- Windows 10/11, macOS 10.14+, or Linux
- Python 3.8+
- Screen with display access
Install Dependencies
Core (required):
# Screenshot and OCR
pip install pillow pyautogui mss pytesseract pyscreenshot --break-system-packages
# MarkItDown (Microsoft's converter)
pip install markitdown --break-system-packages
# Keyboard hooks
pip install keyboard pynput --break-system-packages
# GUI for dialogs
pip install tkinter --break-system-packages # May be pre-installed
OCR engine (Tesseract):
Windows:
# Download installer from:
# https://github.com/UB-Mannheim/tesseract/wiki
# Install to: C:\Program Files\Tesseract-OCR\
# Add to PATH
macOS:
brew install tesseract
Linux:
sudo apt-get install tesseract-ocr
# or
sudo dnf install tesseract
Optional enhancements:
# Better OCR (EasyOCR - slower but more accurate)
pip install easyocr --break-system-packages
# PDF handling
pip install pdf2image pypdf2 --break-system-packages
# Image enhancement
pip install opencv-python --break-system-packages
# Clipboard integration
pip install pyperclip --break-system-packages
See reference/setup-guide.md for detailed installation.
Features
Capture Modes
1. Region Selection
- Click and drag to select area
- Real-time preview
- Pixel-perfect selection
2. Window Capture
- Automatically detect windows
- Capture specific application
- Includes/excludes borders
3. Full Screen
- Entire display
- Multi-monitor support
- All screens at once
4. Scrolling Capture
- Capture long web pages
- Auto-scroll and stitch
- Perfect for documentation
Text Extraction
OCR Engines:
- Tesseract - Fast, free, 100+ languages
- EasyOCR - Slower, more accurate
- Cloud OCR - Azure/Google (highest accuracy)
Smart text processing:
- Automatic language detection
- Text cleanup and formatting
- Table recognition
- Layout preservation
Markdown Conversion
Using MarkItDown (Microsoft):
- Images → Markdown with alt text
- PDFs → Clean markdown
- Screenshots → Formatted text
- Tables → Markdown tables
- Code blocks → Syntax highlighting
Conversion features:
- Smart heading detection
- List preservation
- Link extraction
- Code formatting
- Table structure recognition
Core Operations
Quick Capture
Keyboard shortcut:
# Run as background service
python scripts/screenshot_service.py
# Now press PrtSc anytime:
# 1. Screen freezes
# 2. Choose "Image" or "Text"
# 3. Select region
# 4. Auto-process and save
Command line:
# Capture with UI
python scripts/capture.py
# Capture full screen immediately
python scripts/capture.py --fullscreen --output screenshot.png
# Capture region with coordinates
python scripts/capture.py --region 100,100,800,600 --output region.png
Text Mode (OCR → Markdown)
Interactive:
# Start capture
python scripts/capture.py --mode text
# Process:
# 1. Select region
# 2. OCR extracts text
# 3. MarkItDown formats
# 4. Save dialog opens
# 5. Save as .md file
Automatic:
# Capture and OCR
python scripts/capture_text.py --output extracted.md
# With specific language
python scripts/capture_text.py --lang eng+fra --output text.md
# With enhancement
python scripts/capture_text.py --enhance --output clean.md
Image Mode
Interactive:
# Start capture
python scripts/capture.py --mode image
# Process:
# 1. Select region
# 2. Annotation tools appear
# 3. Add arrows, boxes, text
# 4. Save dialog opens
With annotations:
# Capture and annotate
python scripts/capture_annotate.py --output annotated.png
# Annotation tools:
# - Arrow
# - Rectangle
# - Circle
# - Text
# - Highlight
# - Blur (redact sensitive info)
PDF to Markdown
Convert PDF to markdown:
# Using MarkItDown
python scripts/pdf_to_markdown.py --input document.pdf --output document.md
# With OCR for scanned PDFs
python scripts/pdf_to_markdown.py --input scanned.pdf --ocr --output text.md
# Batch convert folder
python scripts/batch_pdf_convert.py --input ./pdfs/ --output ./markdown/
Screenshot from Image
Process existing image:
# Extract text to markdown
python scripts/image_to_markdown.py --input screenshot.png --output text.md
# Clean up image first
python scripts/enhance_and_extract.py --input noisy.png --output clean.md
Configuration
Settings file: config.yaml
# Keyboard shortcut
hotkey: "Print" # or "ctrl+shift+s", "cmd+shift+5", etc.
# Default capture mode
default_mode: "prompt" # "image", "text", or "prompt"
# OCR settings
ocr:
engine: "tesseract" # "tesseract", "easyocr", or "cloud"
language: "eng"
enhance: true # Pre-process image for better OCR
# Output settings
output:
directory: "~/Screenshots"
filename_pattern: "Screenshot-{date}-{time}"
auto_save: false # true = skip save dialog
clipboard: true # Copy to clipboard
# Markdown settings
markdown:
format_code_blocks: true
detect_tables: true
preserve_formatting: true
# Annotation defaults
annotation:
arrow_color: "#FF0000"
box_color: "#0000FF"
text_color: "#000000"
text_size: 12
line_width: 2
Common Workflows
Workflow 1: Code Documentation
Scenario: Capture code from screen → Markdown documentation
# 1. Run screenshot service
python scripts/screenshot_service.py &
# 2. Press PrtSc on your keyboard
# 3. Select "Text" mode
# 4. Select code region on screen
# 5. OCR extracts code
# 6. MarkItDown formats as code block:
```python
def example_function():
return "formatted code"
7. Save dialog opens → Save as code-snippet.md
### Workflow 2: Meeting Notes from Slides
**Scenario:** Capture presentation slides → Formatted notes
```bash
# Capture multiple slides
python scripts/capture_sequence.py \
--count 5 \
--delay 3 \
--mode text \
--output slides.md
# Result: All slides as markdown in one file
Workflow 3: Email/Document Processing
Scenario: Screenshot email → Extract and format text
# Capture email
python scripts/capture.py --mode text --enhance
# Text extracted, formatted, and saved
# Perfect for archiving or processing
Workflow 4: Research Paper Annotation
Scenario: Screenshot paper → Annotate → Save
# Capture and annotate
python scripts/capture_annotate.py --output paper-notes.png
# Add arrows, highlights, notes
# Save annotated version
Workflow 5: Batch PDF Conversion
Scenario: Convert all PDFs to markdown
# Convert folder of PDFs
python scripts/batch_pdf_convert.py \
--input ~/Documents/PDFs/ \
--output ~/Documents/Markdown/ \
--ocr # Enable OCR for scanned docs
# Progress shown for each file
# All PDFs → Clean markdown
MarkItDown Features
Microsoft's MarkItDown converts:
Images:
- Screenshots → Extracted text
- Diagrams → Alt text descriptions
- Charts → Data tables
PDFs:
- Native PDFs → Clean markdown
- Scanned PDFs → OCR + markdown
- Preserve structure and formatting
Documents:
- Word docs → Markdown
- PowerPoint → Slide content
- Excel → Markdown tables
Code:
- Syntax highlighted code blocks
- Language detection
- Proper indentation
Tables:
- Visual tables → Markdown tables
- Preserved alignment
- Header detection
Keyboard Shortcuts
During capture:
Esc- Cancel captureSpace- Toggle crosshair/selectionEnter- Confirm selectionCtrl+Z- Undo annotationCtrl+C- Copy to clipboardCtrl+S- Save
Annotation mode:
A- Arrow toolR- Rectangle toolC- Circle toolT- Text toolH- Highlight toolB- Blur toolDelete- Remove last annotation
OCR Accuracy Tips
Better results:
- Enhance image first - Increase contrast, denoise
- Correct language - Specify language(s)
- Proper DPI - Higher resolution = better OCR
- Clean background - Remove clutter
- Good lighting - For camera captures
- Straight text - Rotate if needed
Pre-processing:
# Enhance before OCR
python scripts/enhance_image.py \
--input screenshot.png \
--output enhanced.png \
--operations "grayscale,contrast,denoise"
# Then OCR
python scripts/image_to_markdown.py --input enhanced.png
Multi-Monitor Support
Capture from specific monitor:
# List monitors
python scripts/list_monitors.py
# Capture from monitor 2
python scripts/capture.py --monitor 2
# Capture all monitors
python scripts/capture.py --all-monitors
Save Dialog Options
When save dialog appears:
- Filename: Auto-generated or custom
- Location: Last used or default directory
- Format: .md, .txt, .png, .jpg
- Options:
- Copy to clipboard
- Open in editor
- Share
Skip dialog (auto-save):
# config.yaml
output:
auto_save: true
directory: "~/Screenshots"
Integration with Clipboard
Copy to clipboard automatically:
# Text mode - copies markdown
python scripts/capture.py --mode text --clipboard
# Image mode - copies image
python scripts/capture.py --mode image --clipboard
# Both
python scripts/capture.py --clipboard --save
Paste from clipboard:
# Process clipboard image
python scripts/process_clipboard.py --output result.md
Running as Service
Windows
Install as Windows service:
# Install
python scripts/install_windows_service.py
# Service runs on startup
# PrtSc always available
Or use Task Scheduler:
1. Open Task Scheduler
2. Create Basic Task
3. Trigger: At log on
4. Action: Start program
5. Program: python
6. Arguments: path\to\screenshot_service.py
macOS
LaunchAgent setup:
# Install service
python scripts/install_macos_service.py
# Creates ~/Library/LaunchAgents/com.screenshot.service.plist
# Runs on login
Manual:
# Create LaunchAgent plist
# Load with launchctl
launchctl load ~/Library/LaunchAgents/com.screenshot.service.plist
Linux
Systemd service:
# Install
python scripts/install_linux_service.py
# Creates ~/.config/systemd/user/screenshot.service
# Enable and start:
systemctl --user enable screenshot
systemctl --user start screenshot
Or use autostart:
# Copy desktop entry
cp screenshot.desktop ~/.config/autostart/
Cloud OCR (Optional)
For highest accuracy:
Azure Computer Vision:
export AZURE_CV_KEY="your-key"
export AZURE_CV_ENDPOINT="https://your-region.api.cognitive.microsoft.com/"
python scripts/capture.py --mode text --ocr-engine azure
Google Vision:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
python scripts/capture.py --mode text --ocr-engine google
Costs:
- Azure: 1000 transactions/month free, then $1/1000
- Google: 1000 units/month free, then $1.50/1000
Scripts Reference
Capture:
capture.py- Main interactive capturecapture_text.py- Text mode onlycapture_annotate.py- Image with annotationscapture_sequence.py- Multiple captures
Service:
screenshot_service.py- Background serviceinstall_windows_service.py- Windows installerinstall_macos_service.py- macOS installerinstall_linux_service.py- Linux installer
Conversion:
image_to_markdown.py- Image → Markdownpdf_to_markdown.py- PDF → Markdownbatch_pdf_convert.py- Batch conversion
Processing:
enhance_image.py- Image enhancementprocess_clipboard.py- Clipboard processingextract_tables.py- Table extraction
Utilities:
list_monitors.py- List displaystest_ocr.py- Test OCR accuracyconfigure.py- Interactive config
Best Practices
- Run as service - Always available with hotkey
- Configure hotkey - Choose comfortable shortcut
- Enable clipboard - Quick copy-paste workflow
- Enhance first - Better OCR results
- Use appropriate OCR - Tesseract for speed, Cloud for accuracy
- Organize output - Set default directory
- Backup settings - Save config.yaml
- Test thoroughly - Verify OCR accuracy for your use case
Troubleshooting
"Tesseract not found"
# Install Tesseract
# Windows: Download installer
# macOS: brew install tesseract
# Linux: apt install tesseract-ocr
# Check installation
tesseract --version
"Permission denied" (screenshot)
Windows: Run as Administrator
macOS: System Preferences → Security → Privacy → Screen Recording
Linux: Check X11 permissions
"Keyboard hook failed"
# Requires administrator/root privileges
# Windows: Run as Administrator
# macOS: Grant Accessibility permissions
# Linux: Run with sudo or add user to input group
"Poor OCR quality"
# Enhance image first
python scripts/enhance_image.py --input screenshot.png
# Try different OCR engine
python scripts/capture.py --ocr-engine easyocr
# Specify language
python scripts/capture.py --lang eng+fra
"MarkItDown not working"
pip install --upgrade markitdown --break-system-packages
# Check version
python -c "import markitdown; print(markitdown.__version__)"
Platform-Specific Notes
Windows
- PrtSc key native support
- Windows Ink integration available
- OneDrive sync compatible
- Notification system integration
macOS
- cmd+shift+5 alternative
- Quick Look preview
- iCloud Drive sync
- Notification Center integration
Linux
- Wayland/X11 support
- Various hotkey daemons
- Desktop environment integration
- Screenshot directories vary
Integration Examples
See examples/ for complete workflows:
- examples/documentation-workflow.md - Code docs
- examples/research-notes.md - Paper processing
- examples/meeting-capture.md - Meeting slides
- examples/email-archival.md - Email processing
Reference Documentation
- reference/setup-guide.md - Complete setup
- reference/ocr-engines.md - OCR comparison
- reference/markitdown-guide.md - MarkItDown features
- reference/hotkey-config.md - Keyboard shortcuts
- reference/service-install.md - Service setup
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?