Agent skill
bio-entrez-fetch
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-entrez-fetch
SKILL.md
name: bio-entrez-fetch description: Retrieve records from NCBI databases using Biopython Bio.Entrez. Use when downloading sequences, fetching GenBank records, getting document summaries, or parsing NCBI data into Biopython objects. tool_type: python primary_tool: Bio.Entrez measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Entrez Fetch
Retrieve records from NCBI databases using Biopython's Entrez module (EFetch, ESummary utilities).
Required Setup
from Bio import Entrez
Entrez.email = 'your.email@example.com' # Required by NCBI
Entrez.api_key = 'your_api_key' # Optional, raises rate limit 3->10 req/sec
Core Functions
Entrez.efetch() - Retrieve Full Records
Fetch complete records in various formats from any NCBI database.
# Fetch GenBank record by ID
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text')
genbank_text = handle.read()
handle.close()
# Fetch FASTA sequence
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='fasta', retmode='text')
fasta_text = handle.read()
handle.close()
# Fetch multiple records
handle = Entrez.efetch(db='nucleotide', id='NM_007294,NM_000059', rettype='fasta', retmode='text')
Key Parameters:
| Parameter | Description | Example |
|---|---|---|
db |
Database name | 'nucleotide', 'protein', 'pubmed' |
id |
Record ID(s) | 'NM_007294' or '123,456,789' |
rettype |
Return type | 'fasta', 'gb', 'abstract' |
retmode |
Return mode | 'text', 'xml' |
retstart |
Start index | 0 |
retmax |
Max records | 20 |
WebEnv |
History server session | From esearch |
query_key |
History server query | From esearch |
Common Return Types by Database
Nucleotide/Protein:
| rettype | retmode | Description |
|---|---|---|
'fasta' |
'text' |
FASTA sequence |
'gb' |
'text' |
GenBank flat file |
'gp' |
'text' |
GenPept flat file (protein) |
'gbwithparts' |
'text' |
GenBank with contig sequences |
'seqid' |
'text' |
Seq-id only |
'acc' |
'text' |
Accession only |
PubMed:
| rettype | retmode | Description |
|---|---|---|
'abstract' |
'text' |
Abstract text |
'medline' |
'text' |
MEDLINE format |
'xml' |
'xml' |
Full PubMed XML |
Gene:
| rettype | retmode | Description |
|---|---|---|
'gene_table' |
'text' |
Gene table format |
'xml' |
'xml' |
Full gene XML |
Entrez.esummary() - Document Summaries
Get brief summaries without downloading full records. Faster than efetch.
# Get summary for nucleotide record
handle = Entrez.esummary(db='nucleotide', id='NM_007294')
record = Entrez.read(handle)
handle.close()
summary = record[0] # First (only) record
print(f"Title: {summary['Title']}")
print(f"Length: {summary['Length']}")
print(f"Organism: {summary['Organism']}")
Common Summary Fields:
# Nucleotide/Protein
summary['Title'] # Record title/description
summary['Caption'] # Short identifier
summary['Length'] # Sequence length
summary['Organism'] # Source organism
summary['TaxId'] # Taxonomy ID
summary['AccessionVersion'] # Full accession.version
# PubMed
summary['Title'] # Article title
summary['AuthorList'] # Authors
summary['Source'] # Journal
summary['PubDate'] # Publication date
summary['DOI'] # Digital Object Identifier
Parsing with Biopython
Parse into SeqRecord Objects
from Bio import Entrez, SeqIO
Entrez.email = 'your.email@example.com'
# Parse GenBank into SeqRecord
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text')
record = SeqIO.read(handle, 'genbank')
handle.close()
print(f"ID: {record.id}")
print(f"Length: {len(record.seq)}")
print(f"Features: {len(record.features)}")
# Parse FASTA into SeqRecord
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='fasta', retmode='text')
record = SeqIO.read(handle, 'fasta')
handle.close()
Parse Multiple Records
# Fetch multiple as FASTA
handle = Entrez.efetch(db='nucleotide', id='NM_007294,NM_000059,NM_000546', rettype='fasta', retmode='text')
records = list(SeqIO.parse(handle, 'fasta'))
handle.close()
for record in records:
print(f"{record.id}: {len(record.seq)} bp")
Parse XML with Entrez.read()
# For structured data, use XML mode
handle = Entrez.efetch(db='gene', id='672', retmode='xml')
records = Entrez.read(handle)
handle.close()
# Navigate nested structure
gene = records[0]
print(f"Gene: {gene['Entrezgene_gene']['Gene-ref']['Gene-ref_locus']}")
Code Patterns
Fetch Sequence by Accession
from Bio import Entrez, SeqIO
Entrez.email = 'your.email@example.com'
def fetch_sequence(accession, db='nucleotide'):
handle = Entrez.efetch(db=db, id=accession, rettype='fasta', retmode='text')
record = SeqIO.read(handle, 'fasta')
handle.close()
return record
seq = fetch_sequence('NM_007294')
print(f"{seq.id}: {seq.seq[:50]}...")
Fetch GenBank with Features
def fetch_genbank(accession):
handle = Entrez.efetch(db='nucleotide', id=accession, rettype='gb', retmode='text')
record = SeqIO.read(handle, 'genbank')
handle.close()
return record
gb = fetch_genbank('NM_007294')
for feature in gb.features:
if feature.type == 'CDS':
print(f"CDS: {feature.location}")
print(f"Product: {feature.qualifiers.get('product', ['?'])[0]}")
Fetch PubMed Abstract
def fetch_abstract(pmid):
handle = Entrez.efetch(db='pubmed', id=pmid, rettype='abstract', retmode='text')
abstract = handle.read()
handle.close()
return abstract
abstract = fetch_abstract('35412348')
print(abstract)
Get Record Summaries
def get_summaries(db, ids):
if isinstance(ids, list):
ids = ','.join(ids)
handle = Entrez.esummary(db=db, id=ids)
records = Entrez.read(handle)
handle.close()
return records
summaries = get_summaries('nucleotide', ['NM_007294', 'NM_000059'])
for s in summaries:
print(f"{s['Caption']}: {s['Title'][:50]}... ({s['Length']} bp)")
Search Then Fetch
# Search for records
handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND insulin[gene] AND mRNA[fkey]', retmax=5)
search_results = Entrez.read(handle)
handle.close()
ids = search_results['IdList']
# Fetch the sequences
handle = Entrez.efetch(db='nucleotide', id=','.join(ids), rettype='fasta', retmode='text')
records = list(SeqIO.parse(handle, 'fasta'))
handle.close()
for record in records:
print(f"{record.id}: {len(record.seq)} bp")
Fetch Protein by Gene ID
# Search gene database
handle = Entrez.esearch(db='gene', term='BRCA1[sym] AND human[orgn]')
result = Entrez.read(handle)
handle.close()
gene_id = result['IdList'][0]
# Get linked protein IDs
handle = Entrez.elink(dbfrom='gene', db='protein', id=gene_id)
links = Entrez.read(handle)
handle.close()
protein_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link'][:3]]
# Fetch proteins
handle = Entrez.efetch(db='protein', id=','.join(protein_ids), rettype='fasta', retmode='text')
proteins = list(SeqIO.parse(handle, 'fasta'))
handle.close()
Save Fetched Records to File
def download_sequences(ids, output_file, db='nucleotide', format='fasta'):
handle = Entrez.efetch(db=db, id=','.join(ids), rettype=format, retmode='text')
with open(output_file, 'w') as out:
out.write(handle.read())
handle.close()
download_sequences(['NM_007294', 'NM_000059'], 'brca_genes.fasta')
Common Errors
| Error | Cause | Solution |
|---|---|---|
HTTPError 400 |
Invalid ID or parameters | Verify ID exists, check rettype |
HTTPError 429 |
Rate limit exceeded | Add delays or use API key |
| Empty result | Record doesn't exist | Verify accession in web browser |
ValueError in SeqIO |
Wrong format specified | Match rettype with SeqIO format |
ExpatError |
XML parsing error | Use retmode='text' instead |
Decision Tree
Need to retrieve NCBI records?
├── Need full sequence?
│ └── Use efetch with rettype='fasta'
├── Need sequence + annotations?
│ └── Use efetch with rettype='gb' (GenBank)
├── Just need metadata (length, organism)?
│ └── Use esummary (faster)
├── Need PubMed abstract?
│ └── Use efetch with rettype='abstract'
├── Need structured data for parsing?
│ └── Use efetch with retmode='xml' + Entrez.read()
├── Downloading many records?
│ └── See batch-downloads skill
└── Need records from multiple databases?
└── See entrez-link skill first
Related Skills
- entrez-search - Find record IDs before fetching
- entrez-link - Find related records in other databases
- batch-downloads - Download large numbers of records efficiently
- sequence-io/read-sequences - Parse downloaded sequences with SeqIO
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?