Agent skill
clio-clustering
Build a complete data clustering and visualization pipeline from any data source. Use when the user wants to analyze patterns in text data (GitHub issues, Slack messages, support tickets, code reviews, forum posts, customer feedback, etc.), cluster similar items, or build an interactive visualization to explore the patterns. Triggers on: "cluster", "analyze patterns", "group similar", "clio-style", "pattern analysis", "visualize clusters", "find themes", "topic modeling", "semantic clustering".
Install this agent skill to your Project
npx add-skill https://github.com/josh-cooper/.claude/tree/main/skills/clio-clustering
SKILL.md
Clio-Style Clustering Pipeline
Build an end-to-end semantic clustering analysis from any text data source, with interactive visualization.
What This Skill Does
This skill guides you through building a complete clustering pipeline:
- Data Sourcing - Identify APIs/methods to fetch data, build tests to verify access
- Scraping - Collect data with proper pagination and rate limiting
- Embedding - Generate embeddings using OpenAI's text-embedding-3-large
- Clustering - Hierarchical HDBSCAN clustering with UMAP projection
- Labeling - LLM-powered cluster naming and description
- Visualization - Interactive React/D3 explorer with drill-down
Quick Start
When the user describes a data source (e.g., "GitHub issues from facebook/react"), follow these steps:
Phase 1: Data Source Discovery
First, identify how to access the data:
- Research the API - Use web search to find official API documentation
- Identify authentication - What tokens/keys are needed?
- Find pagination patterns - How does the API handle large datasets?
- Determine rate limits - What are the constraints?
See data-sourcing.md for common patterns (GitHub, Slack, etc.)
Phase 2: Build & Test Data Fetcher
IMPORTANT: Write tests BEFORE building the full scraper.
# test_fetcher.py - Verify API access works
import os
import requests
def test_api_access():
"""Verify we can access the API."""
# Adapt this for your specific data source
token = os.environ.get('API_TOKEN')
assert token, "API_TOKEN not set"
response = requests.get(
'https://api.example.com/endpoint',
headers={'Authorization': f'Bearer {token}'}
)
assert response.status_code == 200
data = response.json()
assert len(data) > 0, "No data returned"
print(f"Successfully fetched {len(data)} items")
if __name__ == '__main__':
test_api_access()
Run the test: python test_fetcher.py
Only proceed to the full scraper once tests pass.
Phase 3: Build the Scraper
Create a scraper that:
- Handles pagination efficiently
- Respects rate limits
- Stores data in SQLite for resumability
- Saves progress for resumable scraping
See data-sourcing.md for the database schema and scraper template.
Phase 4: Generate Embeddings & Cluster
Use the clustering pipeline to:
- Generate embeddings with OpenAI
- Run hierarchical HDBSCAN clustering
- Project to 2D with UMAP
- Label clusters with LLM
See clustering-reference.md for the complete implementation.
Phase 5: Build Visualization
Set up the interactive visualization:
- Export data to JSON
- Create Next.js app with D3 visualization
- Add hierarchical drill-down view
See visualization-setup.md for setup instructions.
The components/ directory contains ready-to-copy React components.
Project Structure
When complete, the project should look like:
project/
├── data/
│ └── items.db # SQLite database
├── pipeline/
│ ├── __init__.py
│ ├── db.py # Database operations
│ ├── scraper.py # Data fetcher
│ ├── embed.py # Embedding generation
│ ├── cluster.py # HDBSCAN clustering
│ ├── describe.py # LLM labeling
│ └── export.py # JSON export
├── visualizer/
│ ├── app/
│ │ ├── page.tsx
│ │ └── layout.tsx
│ ├── components/
│ │ ├── HierarchicalView.tsx
│ │ ├── ScatterPlot.tsx
│ │ └── ...
│ ├── lib/
│ │ ├── types.ts
│ │ ├── data.ts
│ │ └── utils.ts
│ └── public/data/
│ ├── items.json
│ └── clusters.json
├── test_fetcher.py # API access tests
├── requirements.txt
└── README.md
Dependencies
Python (for pipeline)
openai>=1.0
instructor>=1.0
hdbscan>=0.8.33
umap-learn>=0.5
scikit-learn>=1.3
numpy>=1.24
rich>=13.0
Node.js (for visualization)
{
"dependencies": {
"next": "14.2.0",
"react": "^18.2.0",
"d3": "^7.8.5",
"framer-motion": "^11.0.0",
"tailwindcss": "^3.4.1"
}
}
Environment Variables
OPENAI_API_KEY=sk-... # Required for embeddings and labeling
# Plus whatever auth your data source needs:
GITHUB_TOKEN=ghp_... # For GitHub
SLACK_TOKEN=xoxb-... # For Slack
# etc.
Running the Pipeline
# 1. Test API access
python test_fetcher.py
# 2. Scrape data
python -m pipeline.scraper
# 3. Generate embeddings
python -m pipeline.embed
# 4. Cluster
python -m pipeline.cluster
# 5. Label clusters with LLM
python -m pipeline.describe
# 6. Export for visualization
python -m pipeline.export
# 7. Run visualizer
cd visualizer && npm run dev
Key Design Decisions
- SQLite for storage - Simple, portable, supports resumability
- HDBSCAN over K-means - Finds natural clusters, handles noise
- 3-level hierarchy - Coarse (L1) -> Medium (L2) -> Fine (L3)
- UMAP for projection - Preserves local structure better than t-SNE
- text-embedding-3-large - Best quality embeddings for semantic similarity
- Next.js + D3 - Fast, interactive visualization with SSR support
Detailed Documentation
- Data Sourcing Patterns - API patterns, auth, pagination
- Clustering Implementation - Embedding, HDBSCAN, UMAP code
- Visualization Setup - Next.js app and components
Didn't find tool you were looking for?