Agent skills
article-extractor

Agent skill

article-extractor

Extraire le contenu propre d'articles depuis des URLs (billets de blog, articles, tutoriels) et sauvegarder en texte lisible. À utiliser quand l'utilisateur veut télécharger, extraire ou sauvegarder un article/billet de blog depuis une URL sans publicités, navigation ou encombrement.

View SKILL.md on GitHub Repository

Stars 2

Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/Dedalus-ERP-PAS/hexagone-foundation-skills/tree/main/skills/article-extractor

Metadata

Additional technical details for this skill

author: Foundation Skills

SKILL.md

Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

Provides an article/blog URL and wants the text content
Asks to "download this article"
Wants to "extract the content from [URL]"
Asks to "save this blog post as text"
Needs clean article text without distractions

How It Works

Priority Order:

Check if tools are installed (reader or trafilatura)
Download and extract article using best available tool
Clean up the content (remove extra whitespace, format properly)
Save to file with article title as filename
Confirm location and show preview

Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability)

bash

command -v reader

If not installed:

bash

npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli

Option 2: trafilatura (Python-based, very good)

bash

command -v trafilatura

If not installed:

bash

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods

Method 1: Using reader (Best for most articles)

bash

# Extract article
reader "URL" > article.txt

Pros:

Based on Mozilla's Readability algorithm
Excellent at removing clutter
Preserves article structure

Method 2: Using trafilatura (Best for blogs/news)

bash

# Extract article
trafilatura --URL "URL" --output-format txt > article.txt

# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

Very accurate extraction
Good with various site structures
Handles multiple languages

Options:

--no-comments: Skip comment sections
--no-tables: Skip data tables
--precision: Favor precision over recall
--recall: Extract more content (may include some noise)

Method 3: Fallback (curl + basic parsing)

bash

# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
        self.current_tag = None

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                self.in_content = True
                self.current_tag = tag

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\\n\\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

bash

# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

bash

# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback):

bash

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

Clean title for filesystem:

bash

# Get title
TITLE="Article Title from Website"

# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

# Add extension
FILENAME="${FILENAME}.txt"

Complete Workflow

bash

ARTICLE_URL="https://example.com/article"

# Check for tools
if command -v reader &> /dev/null; then
    TOOL="reader"
    echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
    TOOL="trafilatura"
    echo "Using trafilatura"
else
    TOOL="fallback"
    echo "Using fallback method (may be less accurate)"
fi

# Extract article
case $TOOL in
    reader)
        # Get content
        reader "$ARTICLE_URL" > temp_article.txt
        # Get title (first line after # in markdown)
        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
        ;;
    trafilatura)
        # Get title from metadata
        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
        # Get clean content
        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
        ;;
    fallback)
        # Get title
        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
        TITLE=${TITLE%% - *}  # Remove site name
        TITLE=${TITLE%% | *}  # Remove site name (alternate)
        # Get content (basic extraction)
        curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main'}:
                self.in_content = True
            if tag in {'h1', 'h2', 'h3'}:
                self.content.append('\\n')

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\\n\\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
        ;;
esac

# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"

# Move to final filename
mv temp_article.txt "$FILENAME"

# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"

Error Handling

Common Issues

1. Tool not installed

Try alternate tool (reader → trafilatura → fallback)
Offer to install: "Install reader with: npm install -g reader-cli"

2. Paywall or login required

Extraction tools may fail
Inform user: "This article requires authentication. Cannot extract."

3. Invalid URL

Check URL format
Try with and without redirects

4. No content extracted

Site may use heavy JavaScript
Try fallback method
Inform user if extraction fails

5. Special characters in title

Clean title for filesystem
Remove: /, :, ?, ", <, >, |
Replace with - or remove

Output Format

Saved File Contains:

Article title (if available)
Author (if available from tool)
Main article text
Section headings
No navigation, ads, or clutter

What Gets Removed:

Navigation menus
Ads and promotional content
Newsletter signup forms
Related articles sidebars
Comment sections (optional)
Social media buttons
Cookie notices

Tips for Best Results

1. Use reader for most articles

Best all-around tool
Based on Firefox Reader View
Works on most news sites and blogs

2. Use trafilatura for:

Academic articles
News sites
Blogs with complex layouts
Non-English content

3. Fallback method limitations:

May include some noise
Less accurate paragraph detection
Better than nothing for simple sites

4. Check extraction quality:

Always show preview to user
Ask if it looks correct
Offer to try different tool if needed

Example Usage

Simple extraction:

bash

# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"

With error handling:

bash

if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

✅ Always show preview after extraction (first 10 lines)
✅ Verify extraction succeeded before saving
✅ Clean filename for filesystem compatibility
✅ Try fallback method if primary fails
✅ Inform user which tool was used
✅ Keep filename length reasonable (< 100 chars)

After Extraction

Display to user:

"✓ Extracted: [Article Title]"
"✓ Saved to: [filename]"
Show preview (first 10-15 lines)
File size and location

Ask if needed:

"Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
"Should I extract another article?"

Maintainer

Dedalus-ERP-PAS Core maintainer

Source details

Full Name: Dedalus-ERP-PAS/hexagone-foundation-skills
Branch: main
Path in repo: skills/article-extractor
License: MIT License
Topics: skills llm dev

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

Dedalus-ERP-PAS/hexagone-foundation-skills

ubiquitous-language

Extrait un glossaire de langage ubiquitaire style DDD de la conversation en cours, signale les ambiguïtés et propose des termes canoniques. Sauvegarde dans UBIQUITOUS_LANGUAGE.md. À utiliser quand l'utilisateur veut définir des termes métier, construire un glossaire, durcir la terminologie, créer un langage ubiquitaire ou mentionne « domain model », « DDD », « glossaire » ou « langage ubiquitaire ».

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

hexagone-web-feature-extractor

Explore any Hexagone Web space via Playwright headless browser, capture screenshots, and produce a PO-oriented Markdown document.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

gitlab-issue

Crée, récupère, met à jour et gère les issues GitLab avec collecte complète du contexte. À utiliser quand l'utilisateur veut créer une nouvelle issue, voir les détails d'une issue, mettre à jour des issues existantes, lister les issues du projet ou gérer les workflows d'issues dans GitLab.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

tdd

Développement piloté par les tests avec boucle red-green-refactor. À utiliser quand l'utilisateur veut construire des fonctionnalités ou corriger des bugs en TDD, mentionne « red-green-refactor », veut des tests d'intégration ou demande du développement test-first.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

testing-patterns

Patrons et stratégies de test complets pour les projets JavaScript/TypeScript. Couvre les tests unitaires, d'intégration et E2E, les stratégies de mocking, l'organisation des tests et les anti-patrons courants. À utiliser quand l'utilisateur veut écrire des tests, améliorer la couverture de tests, établir une stratégie de test ou corriger des tests instables.

2 1

Explore

Dedalus-ERP-PAS/hexagone-foundation-skills

uniface-procscript

Navigue et interroge la documentation de référence officielle Uniface 9.7 ProcScript (594 entrées couvrant les instructions, fonctions, triggers, types de données, directives préprocesseur et fonctions struct). À utiliser quand l'utilisateur pose des questions sur la syntaxe ProcScript, les triggers Uniface, les opérations base de données, la gestion des listes, la manipulation d'entités, les fonctions de chaînes, la gestion d'erreurs ou tout sujet de programmation Uniface 9.7.

2 1

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Article Extractor

When to Use This Skill

How It Works

Priority Order:

Installation Check

Option 1: reader (Recommended - Mozilla's Readability)

Option 2: trafilatura (Python-based, very good)

Option 3: Fallback (curl + simple parsing)

Extraction Methods

Method 1: Using reader (Best for most articles)

Method 2: Using trafilatura (Best for blogs/news)

Method 3: Fallback (curl + basic parsing)

Getting Article Title

Using reader:

Using trafilatura:

Using curl (fallback):

Filename Creation

Complete Workflow

Error Handling

Common Issues

Output Format

Saved File Contains:

What Gets Removed:

Tips for Best Results

Example Usage

Best Practices

After Extraction

Recommended Agent Skills

ubiquitous-language

hexagone-web-feature-extractor

gitlab-issue

tdd

testing-patterns

uniface-procscript