Agent skill
taxonomy-resolver
Resolves ambiguous organism names to precise NCBI taxonomy IDs and scientific names, then searches for genomic data in ENA (European Nucleotide Archive). Use this skill when users provide common names (like "malaria parasite", "E. coli", "mouse"), abbreviated names, or when you need to convert any organism reference to an exact scientific name for API queries. This skill handles disambiguation through conversation and validates taxonomy IDs via NCBI Taxonomy API.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/taxonomy-resolver
SKILL.md
Taxonomy Resolver Skill
Purpose
This skill enables Claude to convert ambiguous organism names, common names, or taxonomy references into precise, API-ready scientific names and NCBI taxonomy IDs. It also helps users find relevant genomic data (FASTQ files, assemblies, BioProjects) from ENA. The core principle: let external APIs do the work - Claude's role is orchestration, disambiguation, and validation - NOT inventing taxonomy data.
When to Use This Skill
Use this skill when:
- User mentions organisms by common name ("malaria parasite", "mosquito", "house mouse")
- User provides ambiguous scientific names ("E. coli", "SARS-CoV-2 isolate")
- User asks to search for genomic data (FASTQ, assemblies, etc.) for an organism
- You need to validate or look up taxonomy IDs
- User provides a taxonomy ID that needs verification
- Converting organism names for NCBI, ENA, or other database queries
Core Workflow
1. Extract User Intent (Critical)
Before calling any APIs, understand what the user wants. Extract:
- Organism: What species/taxa are they interested in?
- Data type: FASTQ reads, assemblies, studies, samples, etc.
- Filters (optional): Library strategy (RNA-Seq, WGS, ChIP-Seq, etc.)
Examples of intent extraction:
- "Find FASTQ files for Plasmodium falciparum" → Organism: P. falciparum, Data: FASTQ reads
- "Search for E. coli genome assemblies" → Organism: E. coli (needs disambiguation), Data: assemblies
- "Get RNA-seq data for mouse" → Organism: Mus musculus, Data: FASTQ with RNA-Seq filter
2. Disambiguation (Critical)
NEVER pass ambiguous names to APIs. Always disambiguate to species-level or specific taxa first.
If the user's input is NOT an explicit species-level scientific name:
- Identify the ambiguity
- Ask a clarifying question OR show a small disambiguation list
- Wait for user confirmation before proceeding
Examples of ambiguous inputs that require clarification:
- "malaria parasite" → Ask: "Which malaria parasite? (Plasmodium falciparum, P. vivax, P. malariae, P. ovale, P. knowlesi)"
- "E. coli" → Ask: "Which E. coli strain? (K-12, O157:H7, other specific strain)"
- "mouse" → Ask: "Did you mean house mouse (Mus musculus) or a different species?"
- "SARS-CoV-2 isolate" → Ask: "Please provide the specific isolate or strain name"
- "bacteria" → Too broad, ask for specific genus/species
3. Taxonomy Resolution
Once you have a specific name, use the resolve_taxonomy.py script to:
- Query NCBI Taxonomy API
- Get the official taxonomy ID
- Retrieve the scientific name
- Get taxonomic lineage
- Validate the organism exists in NCBI
4. ENA Search (Optional)
If the user needs FASTQ files or genomic data, use the search_ena.py script with intent-based filtering:
- Use the extracted intent to add filters to your query
- For RNA-seq: Add
library_strategy="RNA-Seq"to the query - For WGS: Add
library_strategy="WGS"to the query - For ChIP-seq: Add
library_strategy="ChIP-Seq"to the query - Search ENA's database with these filters
- Automatically group results by BioProject
- Present technical details for each BioProject:
- Sequencing platform (Illumina, PacBio, Oxford Nanopore, etc.)
- Library layout (SINGLE or PAIRED)
- Read length and insert size (if available)
- Number of runs/samples
- Library strategy (RNA-Seq, WGS, etc.)
Example intent-based queries:
- User wants RNA-seq data →
python search_ena.py 'scientific_name="Plasmodium falciparum" AND library_strategy="RNA-Seq"' - User wants WGS data →
python search_ena.py 'scientific_name="Mus musculus" AND library_strategy="WGS"' - User just wants any data →
python search_ena.py "Plasmodium falciparum"
5. BioProject Details (Optional)
After getting ENA search results, you can fetch detailed descriptions for BioProjects using get_bioproject_details.py:
- Query ENA for BioProject metadata
- Get study title and description
- Retrieve organism information and submission details
- Provide context about what each BioProject contains
Important Principles
-
Extract intent first: Before calling APIs, understand what the user wants (organism, data type, filters)
-
Use intent to filter API calls:
- Add
library_strategyfilters to ENA searches based on data type - This gives more relevant results and saves the user time
- Add
-
Let the API handle validation: Don't try to validate taxonomy yourself. Call the API and report what it returns.
-
Be conversational about disambiguation: Don't lecture, just ask naturally:
- ✅ "Which malaria parasite are you interested in? Plasmodium falciparum or P. vivax?"
- ❌ "I cannot proceed without a species-level designation. Please provide taxonomic clarification."
-
Don't hallucinate taxonomy IDs: If you're not certain, use the API. Never make up taxonomy IDs.
-
Species-level is usually the target: Most database queries work best with species-level names, but subspecies and strains are fine if specified.
-
Common names are okay as starting points: Use them to begin disambiguation, but always convert to scientific names for APIs.
Available Scripts
resolve_taxonomy.py
Usage:
python resolve_taxonomy.py "Plasmodium falciparum"
python resolve_taxonomy.py --tax-id 5833
Purpose: Queries NCBI Taxonomy API to resolve organism names to taxonomy IDs and vice versa.
Returns: JSON with taxonomy ID, scientific name, common name, and lineage.
search_ena.py
Usage:
# Basic search
python search_ena.py "Plasmodium falciparum" --data-type fastq
# Intent-based search with library_strategy filter (RECOMMENDED)
python search_ena.py 'scientific_name="Plasmodium falciparum" AND library_strategy="RNA-Seq"'
python search_ena.py 'scientific_name="Mus musculus" AND library_strategy="WGS"'
python search_ena.py 'scientific_name="SARS-CoV-2" AND library_strategy="AMPLICON"'
# Other options
python search_ena.py "Mus musculus" --limit 10
Purpose: Searches ENA (European Nucleotide Archive) for genomic data.
Intent-based filtering: Use ENA query syntax to add filters based on user intent:
library_strategy="RNA-Seq"- For RNA-seq/transcriptomicslibrary_strategy="WGS"- For whole genome sequencinglibrary_strategy="WXS"- For whole exome sequencinglibrary_strategy="ChIP-Seq"- For ChIP-seq/epigeneticslibrary_strategy="AMPLICON"- For amplicon sequencinglibrary_strategy="Bisulfite-Seq"- For methylation studies
Returns: JSON with accession numbers, study information, and metadata. For read_run searches, results are automatically grouped by BioProject with:
- BioProject accession
- Number of reads associated with each BioProject
- Study title (if available)
- Sample run details
- Library strategy (experiment type)
get_bioproject_details.py
Usage:
python get_bioproject_details.py PRJEB1234
python get_bioproject_details.py PRJNA123456 --format json
python get_bioproject_details.py PRJEB1234 PRJNA456789
Purpose: Fetches detailed information about BioProjects from ENA.
Returns: JSON with study title, description, organism, center name, and dates.
Example Interactions
Example 1: Simple Resolution
User: "What's the taxonomy ID for house mouse?"
Claude's Process:
- User said "house mouse" - this is clear enough (Mus musculus is unambiguous)
- Run:
python resolve_taxonomy.py "Mus musculus" - Return the taxonomy ID to user
Example 2: Disambiguation Required with BioProject Details
User: "Find FASTQ files for malaria parasite"
Claude's Process:
- "Malaria parasite" is ambiguous
- Ask: "Which malaria parasite? The main ones are:
- Plasmodium falciparum (most common, causes severe malaria)
- Plasmodium vivax (widespread, relapses common)
- Plasmodium malariae
- Plasmodium ovale"
- Wait for user response
- Once user specifies (e.g., "P. falciparum"), then:
- Run:
python resolve_taxonomy.py "Plasmodium falciparum" - Run:
python search_ena.py "Plasmodium falciparum" --data-type fastq - Results will be grouped by BioProject automatically
- Run:
- Present BioProject results with technical details:
- Platform (e.g., "Illumina HiSeq 2500")
- Layout ("SINGLE" or "PAIRED")
- Read length (e.g., "150 bp")
- Number of runs
- (Optional) If user wants more context about specific BioProjects:
- Run:
python get_bioproject_details.py PRJEB1234 PRJEB5678
- Run:
- Present results with BioProject grouping, descriptions, and technical specifications
Example 3: Strain-Level Detail
User: "Search for E. coli K-12 data"
Claude's Process:
- "E. coli K-12" is specific enough
- Run:
python resolve_taxonomy.py "Escherichia coli K-12" - Run:
python search_ena.py "Escherichia coli K-12" - Present results
Example 4: Taxonomy ID Lookup
User: "What organism is taxonomy ID 9606?"
Claude's Process:
- Run:
python resolve_taxonomy.py --tax-id 9606 - Report the result (Homo sapiens)
Example 5: Intent-Based Data Search
User: "I need Plasmodium falciparum RNA-seq data"
Claude's Process:
- Extract intent: Organism = P. falciparum, Data = RNA-seq/FASTQ, Filter = RNA-Seq
- Organism is specific enough (P. falciparum)
- Run:
python resolve_taxonomy.py "Plasmodium falciparum" - Run with intent-based filter:
python search_ena.py 'scientific_name="Plasmodium falciparum" AND library_strategy="RNA-Seq"' --limit 10 - Present BioProject groupings with technical details:
- Example: "PRJEB1234: 12 runs, Illumina HiSeq 2500, PAIRED-end, 150bp reads, RNA-Seq"
- Provide BioProject accessions and details
Error Handling
If NCBI API returns no results:
- Don't assume the organism doesn't exist
- Suggest alternative spellings or ask if they meant something similar
- Example: "I couldn't find 'Homo sapian' in NCBI. Did you mean 'Homo sapiens'?"
If ENA search returns no results:
- Report this clearly
- Suggest broadening the search or trying different terms
- Example: "No FASTQ files found for this specific search. You might try searching for the genus or checking NCBI SRA instead."
If network errors occur:
- Report the error clearly
- Suggest the user check their network settings
- Note which domains need to be allowlisted (api.ncbi.nlm.nih.gov, www.ebi.ac.uk)
If API rate limits are hit:
- Retry strategy: Wait 1-2 seconds and retry the API call
- Maximum retries: Try up to 3 times total before reporting failure
- Exponential backoff: Consider increasing wait time with each retry (1s, 2s, 4s)
- After 3 failed attempts, report to the user:
- "The API is currently rate-limited. Please wait a moment and try again."
Network Requirements
⚠️ Important: This skill requires network access to:
api.ncbi.nlm.nih.gov(NCBI Taxonomy API)www.ebi.ac.uk(ENA API)
If you encounter network errors, the user needs to add these domains to their network allowlist.
Best Practices
- Extract user intent FIRST - Understand what they want before calling any APIs
- Use intent to filter API calls - Add appropriate filters to get more relevant results (library_strategy for ENA)
- Always disambiguate before calling APIs
- Use the actual API responses, don't invent taxonomy data
- Be conversational and helpful with disambiguation
- Report API errors clearly and suggest solutions
- Remember: let the APIs do the heavy lifting, Claude just orchestrates
- Handle API rate limits gracefully: If you hit rate limits, wait 1-2 seconds and retry up to 3 times before reporting failure
- Present BioProject groupings: When searching ENA for reads, always present results grouped by BioProject with technical details
Common Library Strategies for ENA Filtering
When users mention specific data types, use these library_strategy values:
- RNA-seq, transcriptomics, gene expression →
RNA-Seq - Whole genome sequencing, WGS →
WGS - Whole exome sequencing, WXS, exome →
WXS - ChIP-seq, chromatin, histone →
ChIP-Seq - Amplicon sequencing, targeted sequencing →
AMPLICON - Methylation, bisulfite sequencing →
Bisulfite-Seq - ATAC-seq, chromatin accessibility →
ATAC-seq - Hi-C, chromosome conformation →
Hi-C - Metagenomics →
METAGENOMIC - Small RNA, miRNA →
miRNA-Seq
Testing the Skill
To verify the skill is working:
# Test taxonomy resolution
python resolve_taxonomy.py "Homo sapiens"
# Test with taxonomy ID
python resolve_taxonomy.py --tax-id 9606
# Test ENA search (will show BioProject grouping)
python search_ena.py "Saccharomyces cerevisiae" --data-type fastq --limit 5
# Test BioProject details
python get_bioproject_details.py PRJDB7788
Notes for Developers
This skill follows the principle: "Let the APIs do the work, Claude just orchestrates."
The skill doesn't try to make Claude an expert in taxonomy or bioinformatics. It just provides:
- Clear guidance on when to disambiguate
- Tools to call the right APIs
- Instructions on how to handle responses
- Guidance on filtering searches based on intent
All validation is the API's problem. If results seem wrong or missing, that's ENA/NCBI's issue to address, not ours.
Didn't find tool you were looking for?