Agent skill
locomo-benchmark
Run LoCoMo benchmark for long-term conversational memory
Install this agent skill to your Project
npx add-skill https://github.com/genomewalker/cc-soul/tree/main/skills/locomo-benchmark
SKILL.md
LoCoMo Benchmark
Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.
Quick Start
Run the benchmark script:
# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py
# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30
# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full
# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20
Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).
What the Script Does
- Downloads LoCoMo data from GitHub to
/tmp/locomo/(if not present) - Ingests conversations into cc-soul memory:
- Extracts session summaries as observations
- Creates triplets for speaker facts
- Tags with sample_id for retrieval
- Evaluates QA pairs:
- Retrieves context using
chitta recall --tag {sample_id} - Calculates F1 score vs ground truth
- Retrieves context using
- Reports results by category
Categories
| Cat | Name | Description |
|---|---|---|
| 1 | Multi-hop | Requires connecting multiple facts |
| 2 | Single-hop | Direct fact retrieval |
| 3 | Temporal | Date/time questions |
| 4 | Open-domain | General knowledge |
| 5 | Adversarial | Should answer "no information" |
Baseline Scores (from paper)
| Model | F1 |
|---|---|
| Human ceiling | 87.9% |
| AutoMem | 90.5% |
| GPT-4 | 32.1% |
| GPT-3.5 | 23.7% |
| Mistral-7B | 13.9% |
Data
- Repository:
https://github.com/snap-research/locomo - Local cache:
/tmp/locomo/data/locomo10.json - 10 conversations, ~200 QA pairs each, ~35 sessions per conversation
Manual Execution
If you prefer to run manually:
# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo
# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26
Expected Output
=== LoCoMo Benchmark Results ===
Total QA Pairs: 50
Overall F1: XX.X%
By Category:
Multi-hop (n=XX): XX.X%
Single-hop (n=XX): XX.X%
Temporal (n=XX): XX.X%
Open-domain (n=XX): XX.X%
Adversarial (n=XX): XX.X%
Per Conversation:
conv-26: XX.X% (50 QA)
Comparison (from paper):
Human ceiling: 87.9%
GPT-4 baseline: 32.1%
cc-soul: XX.X%
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
kriya
Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.
cc-soul-daemon
Start, stop, or check the chittad daemon
ultrathink
First-principles deep thinking for significant problems
reawaken
Restore context and momentum via Pratyabhijñā (recognition)
long-task
Initialize or resume a long-running task session. Use when starting a complex multi-session task, resuming work from a previous session, or when the user mentions claude-progress.json or long-running work.
shepherd
Autonomous pipeline monitor using sense-think-act loop. Watches snakemake/nextflow jobs, detects errors, applies fixes from memory, restarts on failure.
Didn't find tool you were looking for?