Agent skill

apify-scraper-builder

Build Apify Actors (web scrapers) using Node.js and Crawlee. Use when creating new scrapers, defining input schemas, configuring Dockerfiles, or deploying to Apify. Triggers include apify, actor, scraper, crawlee, web scraping, data extraction.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/apify-scraper-builder-dvorkinguy-claude-skills-agents

SKILL.md

Apify Scraper Builder

Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.

Crawler Type Decision Tree

Scenario Crawler Why
Static HTML, no JavaScript CheerioCrawler Fastest, lowest memory
JavaScript-rendered content PlaywrightCrawler Modern, cross-browser
Legacy sites, specific Chrome behavior PuppeteerCrawler Chrome-specific features
Need to handle both static and JS PlaywrightCrawler More versatile
High-volume scraping (1000s pages) CheerioCrawler Best performance

Actor Creation Workflow

Step 1: Initialize Project

bash
python scripts/init_actor.py my-scraper --type cheerio

Or manually create structure:

my-scraper/
├── .actor/
│   ├── actor.json           # REQUIRED
│   ├── input_schema.json    # Recommended
│   └── Dockerfile           # REQUIRED
├── src/
│   └── main.ts              # Entry point
├── package.json
└── tsconfig.json

Step 2: Configure actor.json

json
{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile"
}

Step 3: Define Input Schema

bash
python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"

Or use templates from references/input-schema-guide.md

Step 4: Implement Crawler

Use patterns from references/crawlee-patterns.md

Step 5: Validate Configuration

bash
python scripts/validate_actor.py /path/to/actor

Step 6: Deploy

bash
apify login
apify push

Project Structure

Required Files

.actor/actor.json

json
{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "minMemoryMbytes": 256,
    "maxMemoryMbytes": 4096,
    "dockerfile": "./Dockerfile",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

.actor/Dockerfile (Node.js)

dockerfile
FROM apify/actor-node:20

COPY package*.json ./
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && npm list || true \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version

COPY . ./
CMD npm start

package.json

json
{
    "name": "my-scraper",
    "version": "0.0.1",
    "type": "module",
    "main": "dist/main.js",
    "scripts": {
        "start": "node dist/main.js",
        "build": "tsc"
    },
    "dependencies": {
        "apify": "^3.0.0",
        "crawlee": "^3.0.0"
    },
    "devDependencies": {
        "typescript": "^5.0.0"
    }
}

Input Schema Editors

Editor Use Case Example
textfield Single-line text Name, URL
textarea Multi-line text CSS selectors, notes
requestListSources URL list with labels Start URLs
proxy Proxy configuration Apify Proxy settings
json JSON object/array Custom configuration
select Dropdown options Country, category
checkbox Boolean toggle Debug mode
number Integer/float Max items, delay
datepicker Date selection Date range filter

Common Input Schema Pattern

json
{
    "title": "Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start scraping from",
            "editor": "requestListSources",
            "prefill": [{"url": "https://example.com"}]
        },
        "maxItems": {
            "title": "Max Items",
            "type": "integer",
            "description": "Maximum number of items to scrape",
            "default": 100,
            "minimum": 1
        },
        "proxyConfig": {
            "title": "Proxy Configuration",
            "type": "object",
            "description": "Proxy settings for the scraper",
            "editor": "proxy",
            "default": {"useApifyProxy": true}
        }
    },
    "required": ["startUrls"]
}

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

typescript
import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('h1').text().trim();
        const price = $('.price').text().trim();

        await Dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue pagination links
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PlaywrightCrawler (JavaScript Rendering)

typescript
import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const proxyConfiguration = await Actor.createProxyConfiguration(
    input?.proxyConfig
);

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ page, request, enqueueLinks }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                title: item.querySelector('h2')?.textContent?.trim(),
                price: item.querySelector('.price')?.textContent?.trim(),
            }))
        );

        for (const product of products) {
            await Dataset.pushData({
                url: request.url,
                ...product,
            });
        }

        await enqueueLinks({
            selector: 'a.pagination',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PuppeteerCrawler (Chrome-specific)

typescript
import { Actor } from 'apify';
import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
}>();

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request }) {
        await page.waitForSelector('.content');

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            content: document.querySelector('.content')?.innerHTML,
        }));

        await Dataset.pushData({
            url: request.url,
            ...data,
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

Scripts

Initialize New Actor

bash
python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]

Validate Actor Configuration

bash
python scripts/validate_actor.py <actor-path>

Generate Input Schema

bash
python scripts/generate_input_schema.py "<description>" [--output <path>]

Deployment Commands

bash
# Install Apify CLI
npm install -g @apify/cli

# Login to Apify
apify login

# Create new Actor from template (interactive)
apify create my-actor

# Run Actor locally
apify run --purge

# Push to Apify platform
apify push

# Build Actor remotely
apify actors build

# Call Actor remotely
apify actors call <actor-id>

# Pull Actor code from Apify
apify actors pull <actor-id>

Validation Checklist

Before Building

  • Correct crawler type selected for target site
  • Input schema defines all required parameters
  • Dependencies in package.json are correct

Configuration

  • actor.json has actorSpecification: 1
  • actor.json has valid name and version
  • Dockerfile uses correct Node.js base image
  • Input schema editors match field types

Code Quality

  • Error handling for network failures
  • Proxy configuration used for production
  • Rate limiting/delays configured
  • Data validation before pushData

Pre-Deployment

  • apify run --purge succeeds locally
  • Output data structure is correct
  • Memory limits are appropriate

References

Topic File
actor.json Specification references/actor-json-spec.md
Input Schema Editors references/input-schema-guide.md
Crawlee Patterns references/crawlee-patterns.md

Templates

Template Description Path
Cheerio Fast HTML scraping templates/crawlee-cheerio/
Playwright JS-rendered content templates/crawlee-playwright/
Puppeteer Chrome-specific templates/crawlee-puppeteer/

Didn't find tool you were looking for?

Be as detailed as possible for better results