Agent skill

locomo-benchmark

Run LoCoMo benchmark for long-term conversational memory

View SKILL.md on GitHub Repository

Stars 2

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/genomewalker/cc-soul/tree/main/skills/locomo-benchmark

SKILL.md

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

bash

# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py

# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30

# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full

# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)
Ingests conversations into cc-soul memory:
- Extracts session summaries as observations
- Creates triplets for speaker facts
- Tags with sample_id for retrieval
Evaluates QA pairs:
- Retrieves context using chitta recall --tag {sample_id}
- Calculates F1 score vs ground truth
Reports results by category

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

Baseline Scores (from paper)

Model	F1
Human ceiling	87.9%
AutoMem	90.5%
GPT-4	32.1%
GPT-3.5	23.7%
Mistral-7B	13.9%

Data

Repository: https://github.com/snap-research/locomo
Local cache: /tmp/locomo/data/locomo10.json
10 conversations, ~200 QA pairs each, ~35 sessions per conversation

Manual Execution

If you prefer to run manually:

bash

# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo

# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26

Expected Output

=== LoCoMo Benchmark Results ===

Total QA Pairs: 50
Overall F1: XX.X%

By Category:
  Multi-hop (n=XX): XX.X%
  Single-hop (n=XX): XX.X%
  Temporal (n=XX): XX.X%
  Open-domain (n=XX): XX.X%
  Adversarial (n=XX): XX.X%

Per Conversation:
  conv-26: XX.X% (50 QA)

Comparison (from paper):
  Human ceiling: 87.9%
  GPT-4 baseline: 32.1%
  cc-soul: XX.X%

Maintainer

genomewalker Core maintainer

Source details

Full Name: genomewalker/cc-soul
Branch: main
Path in repo: skills/locomo-benchmark

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

genomewalker/cc-soul

kriya

Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.

2 0

Explore

genomewalker/cc-soul

cc-soul-daemon

Start, stop, or check the chittad daemon

2 0

Explore

genomewalker/cc-soul

ultrathink

First-principles deep thinking for significant problems

2 0

Explore

genomewalker/cc-soul

reawaken

Restore context and momentum via Pratyabhijñā (recognition)

2 0

Explore

genomewalker/cc-soul

long-task

Initialize or resume a long-running task session. Use when starting a complex multi-session task, resuming work from a previous session, or when the user mentions claude-progress.json or long-running work.

2 0

Explore

genomewalker/cc-soul

shepherd

Autonomous pipeline monitor using sense-think-act loop. Watches snakemake/nextflow jobs, detects errors, applies fixes from memory, restarts on failure.

2 0

Explore

Didn't find tool you were looking for?

Search AI Tools

locomo-benchmark

Install this agent skill to your Project

SKILL.md

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

Recommended Agent Skills

kriya

cc-soul-daemon

ultrathink

reawaken

long-task

shepherd