Agent skill

ingest-sources

Process multiple source documents with Extract-Then-Aggregate discipline. Use when user shares multiple transcripts, emails, or documents for batch processing.

View SKILL.md on GitHub Repository

Stars 206

Forks 26

Install this agent skill to your Project

npx add-skill https://github.com/kbanc85/claudia/tree/main/template-v2/.claude/skills/ingest-sources

SKILL.md

Ingest Sources

Process multiple source documents (transcripts, emails, documents) using Extract-Then-Aggregate discipline to ensure no entity with dedicated sources gets lost.

Trigger

"Process these transcripts"
"Here are my notes from [event]"
Multiple files shared in sequence
"Here's everything about [topic]"
Folder path provided with multiple files
/ingest-sources

Why This Skill Exists

When processing many sources, the failure mode is jumping to aggregation and missing entities that have dedicated sources but aren't prominent in high-traffic threads. A person with 2 transcripts dedicated to them can get lost if they're not mentioned often in emails.

The discipline: Inventory before processing, extraction before synthesis.

Input

User provides one of:

Folder path containing multiple files
List of file paths
Multiple documents pasted in sequence
Reference to previously shared content

The Five-Phase Workflow

Phase 1: Inventory

Before reading any content, create a manifest of all sources:

| # | Filename | Type | Date | Size | Likely Entities |
|---|----------|------|------|------|-----------------|
| 1 | call-with-sarah.md | transcript | 2026-01-15 | 4.2KB | Sarah Chen |
| 2 | jim-partnership-email.md | email | 2026-01-16 | 1.8KB | Jim Ferry |
| 3 | acme-contract.pdf | document | 2026-01-17 | 52KB | Acme Corp |
...

**Summary:**
- Total: 36 sources
- Date range: Jan 15 - Feb 1
- Types: 28 transcripts, 5 emails, 3 documents

Show inventory to user before proceeding. This prevents partial processing.

Phase 2: File-Then-Extract (Per Document)

CRITICAL: For each document, file it BEFORE extracting. This ensures provenance.

For each source in inventory:
    1. READ the full content
    2. CALL `claudia memory document store --project-dir "$PWD"` immediately (do not skip!)
    3. THEN extract entities/facts/commitments

Process each document systematically. Use IngestService (via local Ollama) when available, or extract directly.

Auto-detect source type:

.md, .txt with participant names → meeting mode
Email headers detected → email mode
.pdf or formal structure → document mode
Mixed content → general mode

Extraction schema per document:

Source #1: call-with-sarah.md
├── entities[]
│   ├── name: "Sarah Chen"
│   ├── type: person
│   ├── mention_count: 47
│   └── first_context: "Product lead at Acme Corp"
├── facts[]
│   ├── content: "Sarah prefers async communication"
│   ├── about: ["Sarah Chen"]
│   └── importance: 0.7
├── commitments[]
│   ├── content: "Send proposal by Friday"
│   ├── who: "user"
│   ├── to: "Sarah Chen"
│   └── deadline: "2026-02-07"
├── relationships[]
│   ├── source: "Sarah Chen"
│   ├── target: "Acme Corp"
│   └── relationship: "works_at"
└── dedicated_to: "Sarah Chen"  ← CRITICAL: This source is primarily ABOUT Sarah

Progress tracking:

Extracting: [========>   ] 28/36 (78%)

The dedicated_to field is essential. If a source is primarily about a specific entity (not just mentioning them), mark it. This prevents the "missing entity" problem.

Phase 3: Consolidation

After all extractions complete, merge by entity:

Canonicalize names:

Check existing entity_aliases table for known aliases
Fuzzy match "Sarah" vs "Sarah Chen" vs "S. Chen"
Ask user to confirm ambiguous matches

Merge semantically identical facts:

"Sarah prefers Slack" + "Sarah likes async comms" → single fact about communication preference
Keep the more specific version

Track source counts:

Entity: Sarah Chen
├── Dedicated sources: 4 (#1, #5, #12, #18)
├── Total mentions: 12 sources
├── Facts extracted: 8
└── Commitments: 2

Phase 4: Verification

Before storing anything, verify completeness:

### Entity Coverage

| Entity | Dedicated Sources | Total Mentions | Sources |
|--------|-------------------|----------------|---------|
| Sarah Chen | 4 | 12 | #1, #5, #12, #18, ... |
| Jim Ferry | 2 | 6 | #2, #15, ... |
| Acme Corp | 3 | 8 | #3, #7, #22, ... |
| Project Alpha | 0 | 4 | #4, #8, #11, #19 |

### Dedicated Source Rule

**Any entity with 2+ dedicated sources MUST appear proportionally in the final output.**

If Jim Ferry has 2 transcripts dedicated to him but doesn't show up in the entity coverage summary, that's a verification failure. Stop and investigate.

### Gaps Detected

- Source #14: No entities extracted (may need manual review)
- Source #22: References "the investor" without name

### Completeness Check

Before proceeding:
- [ ] Every dedicated source entity appears in coverage
- [ ] No sources skipped or failed
- [ ] Ambiguous entity names resolved
- [ ] Gaps acknowledged or explained

User must confirm before proceeding to storage. This is the checkpoint that catches the "missing entity" problem.

Phase 5: Storage

After user confirms verification:

1. Verify all sources filed: Sources were already filed during Phase 2 (File-Then-Extract). Verify the file count matches:

Confirm: [N] sources filed to ~/.claudia/files/

If any sources weren't filed in Phase 2, file them now before proceeding.

Files are auto-routed to entity folders:

people/sarah-chen/transcripts/...
clients/acme-corp/documents/...
projects/alpha/emails/...

2. Create/update entities:

bash

claudia memory batch --project-dir "$PWD" <<'EOF'
[
  { "op": "entity", "name": "Sarah Chen", "type": "person", "description": "Product lead at Acme Corp" },
  { "op": "entity", "name": "Jim Ferry", "type": "person", "description": "Partnership contact" },
  { "op": "entity", "name": "Acme Corp", "type": "organization", "description": "Client company" }
]
EOF

3. Store facts and relationships:

bash

claudia memory batch --project-dir "$PWD" <<'EOF'
[
  { "op": "remember", "content": "Sarah prefers async communication", "about": ["Sarah Chen"], "importance": 0.7 },
  { "op": "relate", "source": "Sarah Chen", "target": "Acme Corp", "relationship": "works_at", "strength": 0.9 }
]
EOF

4. Link provenance:

memory_sources table connects memories → source documents
entity_documents table connects documents → entities

This creates the chain: any fact can trace back to the exact document it came from.

Output Format

**📥 Multi-Source Ingestion: [Topic/Event]**

### Phase 1: Inventory Complete
[Summary table shown above]

Proceed with extraction? [y/n]

---

### Phase 2: Extraction Complete
- Sources processed: 36/36
- Entities found: 12
- Facts extracted: 87
- Commitments detected: 14
- Relationships mapped: 23

---

### Phase 3: Consolidation Complete
- Unique entities: 9 (after deduplication)
- Canonical names resolved: 4 aliases merged

---

### Phase 4: Verification

[Coverage table shown above]

**Dedicated Source Check:**
✓ Sarah Chen: 4 dedicated sources, appears in 12 total
✓ Jim Ferry: 2 dedicated sources, appears in 6 total
✓ Acme Corp: 3 dedicated sources, appears in 8 total

**Gaps:**
⚠ Source #14: No entities extracted

Ready to store? [y/n]

---

### Phase 5: Storage Complete

**Files stored:** 36
**Entities created/updated:** 9
**Memories stored:** 87
**Relationships created:** 23

All sources linked to entities. Provenance chain complete.

**Query examples:**
- "What do I know about Jim Ferry?" → will surface all 6 source memories
- "Show me Sarah's transcripts" → will list all 4 dedicated files
- "Where did I learn about Acme's timeline?" → will cite exact source

---

Judgment Points

Ask for confirmation on:

Ambiguous entity matches (is "S. Chen" the same as "Sarah Chen"?)
Sources with no extractable entities (manual review needed?)
Importance scores for extracted facts
Proceeding past verification phase
Creating new entities vs linking to existing

Quality Checklist

Inventory created before reading content
Every source gets extraction record (none skipped)
dedicated_to field populated for sources primarily about an entity
Verification phase completed with user confirmation
Dedicated source rule enforced (2+ dedicated = must appear proportionally)
All sources filed via claudia memory document store
Provenance chain complete (memories link to documents)
No entity lost that had dedicated sources

Error Handling

If extraction fails for a source:

Log the failure
Continue with other sources
Surface in verification phase
Offer manual review option

If IngestService unavailable (no Ollama):

Fall back to direct Claude extraction
Slower but still systematic
Same extraction schema applies

If verification fails:

Do NOT proceed to storage
Show which entities are missing
Offer to re-extract specific sources
User must explicitly override to continue

Extensibility

This workflow is schema-agnostic. Works for any source type:

Data Type	Detection	Extraction Mode
Meeting transcripts	`.md`, `.txt` with names	`meeting`
Email threads	Email headers	`email`
Documents/PDFs	`.pdf`, formal structure	`document`
Research notes	Mixed content	`general`
Slack exports	Message format	`general`
CRM exports	Structured records	`general`

Add new extraction modes to IngestService if needed, or use general mode which extracts: facts, entities, relationships, summary.

Tone

Methodical: this is a systematic process
Transparent: show progress at each phase
Protective: catch errors before they become permanent
Efficient: batch operations, clear status updates

Maintainer

kbanc85 Core maintainer

Source details

Full Name: kbanc85/claudia
Branch: main
Path in repo: template-v2/.claude/skills/ingest-sources
License: Other
Topics: claude-code productivity ai-assistant terminal relationship-management

Featured Tools

Join Our Newsletter

Launch the Brain Visualizer, a real-time 3D view of memory and relationships. Triggers on "show your brain", "visualize memory", "open the brain", "memory graph".

206 26

Explore

Didn't find tool you were looking for?