Apify Scraper Builder

Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.

Crawler Type Decision Tree

Scenario	Crawler	Why
Static HTML, no JavaScript	CheerioCrawler	Fastest, lowest memory
JavaScript-rendered content	PlaywrightCrawler	Modern, cross-browser
Legacy sites, specific Chrome behavior	PuppeteerCrawler	Chrome-specific features
Need to handle both static and JS	PlaywrightCrawler	More versatile
High-volume scraping (1000s pages)	CheerioCrawler	Best performance

Actor Creation Workflow

Step 1: Initialize Project

bash

python scripts/init_actor.py my-scraper --type cheerio

Or manually create structure:

my-scraper/
├── .actor/
│   ├── actor.json           # REQUIRED
│   ├── input_schema.json    # Recommended
│   └── Dockerfile           # REQUIRED
├── src/
│   └── main.ts              # Entry point
├── package.json
└── tsconfig.json

Step 2: Configure actor.json

json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile"
}

Step 3: Define Input Schema

bash

python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"

Or use templates from references/input-schema-guide.md

Step 4: Implement Crawler

Use patterns from references/crawlee-patterns.md

Step 5: Validate Configuration

bash

python scripts/validate_actor.py /path/to/actor

Step 6: Deploy

bash

apify login
apify push

Project Structure

Required Files

.actor/actor.json

json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "minMemoryMbytes": 256,
    "maxMemoryMbytes": 4096,
    "dockerfile": "./Dockerfile",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

.actor/Dockerfile (Node.js)

dockerfile

FROM apify/actor-node:20

COPY package*.json ./
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && npm list || true \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version

COPY . ./
CMD npm start

package.json

json

{
    "name": "my-scraper",
    "version": "0.0.1",
    "type": "module",
    "main": "dist/main.js",
    "scripts": {
        "start": "node dist/main.js",
        "build": "tsc"
    },
    "dependencies": {
        "apify": "^3.0.0",
        "crawlee": "^3.0.0"
    },
    "devDependencies": {
        "typescript": "^5.0.0"
    }
}

Input Schema Editors

Editor	Use Case	Example
`textfield`	Single-line text	Name, URL
`textarea`	Multi-line text	CSS selectors, notes
`requestListSources`	URL list with labels	Start URLs
`proxy`	Proxy configuration	Apify Proxy settings
`json`	JSON object/array	Custom configuration
`select`	Dropdown options	Country, category
`checkbox`	Boolean toggle	Debug mode
`number`	Integer/float	Max items, delay
`datepicker`	Date selection	Date range filter

Common Input Schema Pattern

json

{
    "title": "Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start scraping from",
            "editor": "requestListSources",
            "prefill": [{"url": "https://example.com"}]
        },
        "maxItems": {
            "title": "Max Items",
            "type": "integer",
            "description": "Maximum number of items to scrape",
            "default": 100,
            "minimum": 1
        },
        "proxyConfig": {
            "title": "Proxy Configuration",
            "type": "object",
            "description": "Proxy settings for the scraper",
            "editor": "proxy",
            "default": {"useApifyProxy": true}
        }
    },
    "required": ["startUrls"]
}

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

typescript

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('h1').text().trim();
        const price = $('.price').text().trim();

        await Dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue pagination links
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PlaywrightCrawler (JavaScript Rendering)

typescript

import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const proxyConfiguration = await Actor.createProxyConfiguration(
    input?.proxyConfig
);

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ page, request, enqueueLinks }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                title: item.querySelector('h2')?.textContent?.trim(),
                price: item.querySelector('.price')?.textContent?.trim(),
            }))
        );

        for (const product of products) {
            await Dataset.pushData({
                url: request.url,
                ...product,
            });
        }

        await enqueueLinks({
            selector: 'a.pagination',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PuppeteerCrawler (Chrome-specific)

typescript

import { Actor } from 'apify';
import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
}>();

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request }) {
        await page.waitForSelector('.content');

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            content: document.querySelector('.content')?.innerHTML,
        }));

        await Dataset.pushData({
            url: request.url,
            ...data,
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

Scripts

Initialize New Actor

bash

python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]

Validate Actor Configuration

bash

python scripts/validate_actor.py <actor-path>

Generate Input Schema

bash

python scripts/generate_input_schema.py "<description>" [--output <path>]

Deployment Commands

bash

# Install Apify CLI
npm install -g @apify/cli

# Login to Apify
apify login

# Create new Actor from template (interactive)
apify create my-actor

# Run Actor locally
apify run --purge

# Push to Apify platform
apify push

# Build Actor remotely
apify actors build

# Call Actor remotely
apify actors call <actor-id>

# Pull Actor code from Apify
apify actors pull <actor-id>

Validation Checklist

Before Building

Correct crawler type selected for target site
Input schema defines all required parameters
Dependencies in package.json are correct

Configuration

actor.json has actorSpecification: 1
actor.json has valid name and version
Dockerfile uses correct Node.js base image
Input schema editors match field types

Code Quality

Error handling for network failures
Proxy configuration used for production
Rate limiting/delays configured
Data validation before pushData

Pre-Deployment

apify run --purge succeeds locally
Output data structure is correct
Memory limits are appropriate

References

Topic	File
actor.json Specification	`references/actor-json-spec.md`
Input Schema Editors	`references/input-schema-guide.md`
Crawlee Patterns	`references/crawlee-patterns.md`

Templates

Template	Description	Path
Cheerio	Fast HTML scraping	`templates/crawlee-cheerio/`
Playwright	JS-rendered content	`templates/crawlee-playwright/`
Puppeteer	Chrome-specific	`templates/crawlee-puppeteer/`

Search AI Tools

apify-scraper-builder

Install this agent skill to your Project

SKILL.md

Apify Scraper Builder

Crawler Type Decision Tree

Actor Creation Workflow

Step 1: Initialize Project

Step 2: Configure actor.json

Step 3: Define Input Schema

Step 4: Implement Crawler

Step 5: Validate Configuration

Step 6: Deploy

Project Structure

Required Files

.actor/actor.json

.actor/Dockerfile (Node.js)

package.json

Input Schema Editors

Common Input Schema Pattern

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

PlaywrightCrawler (JavaScript Rendering)

PuppeteerCrawler (Chrome-specific)

Scripts

Initialize New Actor

Validate Actor Configuration

Generate Input Schema

Deployment Commands

Validation Checklist

Before Building

Configuration

Code Quality

Pre-Deployment

References

Templates