Agent skills
document-indexing

Agent skill

document-indexing

Extract structured metadata from documents using AI. Classify content types, extract topics and tools. Supports async batch processing.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/document-indexing

SKILL.md

Document Indexing

Overview

Extract structured metadata from fetched documents using LLM:

Content type: blog, tutorial, guide, reference, etc.
Topics & Tools: Main subjects and technologies
Structure: Code examples, procedures, narrative

Creates DocumentMetadata records for search and clustering.

Quick Start

bash

# Index single document
kurt index 5494cc13

# Batch index (async, 5-10x faster)
kurt index --url-prefix https://example.com/

# Re-index with custom concurrency
kurt index --url-prefix https://example.com/ --force --max-concurrent 10

Prerequisites: Documents must be FETCHED (kurct content fetch)

Commands

bash

# Single
kurt index <doc-id>
kurt index <doc-id> --force

# Batch (async parallel)
kurt index --url-prefix <url>
kurt index --url-contains <string>
kurt index --max-concurrent 10     # Default: 5

# Filters
kurt index --status FETCHED --url-prefix <url>

Content Types

Extracted Metadata

json

{
  "content_type": "TUTORIAL",
  "extracted_title": "Machine Learning Guide",
  "primary_topics": ["Machine Learning", "Python"],
  "tools_technologies": ["TensorFlow", "Pandas"],
  "has_code_examples": true,
  "has_step_by_step_procedures": true,
  "has_narrative_structure": false
}

Performance

Sequential: ~3-5s per document
Parallel (5 concurrent): ~1s per document avg
Example: 92 docs in 30s (parallel) vs 5 mins (sequential)

Python API

python

from kurt.indexing import extract_document_metadata, batch_extract_document_metadata
import asyncio

# Single
result = extract_document_metadata("abc-123")

# Batch
results = asyncio.run(batch_extract_document_metadata(
    ["abc-123", "def-456"],
    max_concurrent=5
))

Troubleshooting

Issue	Solution
"Document not FETCHED"	Run `kurct content fetch <id>` first
"Content file not found"	Re-fetch document
Slow batch	Increase `--max-concurrent`
Rate limits	Reduce `--max-concurrent`

Next Steps

ingest-content-skill - Fetch documents first
document-management-skill - Query and manage documents

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/document-indexing
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Document Indexing

Overview

Quick Start

Commands

Content Types

Extracted Metadata

Performance

Python API

Troubleshooting

Next Steps

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state