Agent skills
bio-batch-downloads

Agent skill

bio-batch-downloads

Stars 2,009

Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-batch-downloads

SKILL.md

name: bio-batch-downloads description: Download large datasets from NCBI efficiently using history server, batching, and rate limiting. Use when performing bulk sequence downloads, handling large query results, or production-scale data retrieval. tool_type: python primary_tool: Bio.Entrez measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

Batch Downloads

Download large numbers of records from NCBI efficiently using the history server, batching, and proper rate limiting.

Required Setup

python

from Bio import Entrez
import time

Entrez.email = 'your.email@example.com'  # Required by NCBI
Entrez.api_key = 'your_api_key'          # Recommended for large downloads

Rate Limits

Authentication	Requests/Second	Delay Between
Email only	3	0.34 seconds
Email + API key	10	0.1 seconds

Get an API key at: https://www.ncbi.nlm.nih.gov/account/settings/

History Server

The history server stores search results on NCBI servers, enabling efficient batch retrieval without re-sending large ID lists.

How It Works

Search with usehistory='y'
Get WebEnv (session ID) and query_key (result set ID)
Fetch results in batches using these identifiers
Results stay available for ~15 minutes

python

# Search with history
handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND mRNA[fkey]', usehistory='y')
search = Entrez.read(handle)
handle.close()

webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])

print(f"Found {total} records, stored in history")

Core Pattern: Batch Download

python

from Bio import Entrez, SeqIO
import time

Entrez.email = 'your.email@example.com'

def batch_download(db, term, output_file, rettype='fasta', batch_size=500):
    # Search with history
    handle = Entrez.esearch(db=db, term=term, usehistory='y')
    search = Entrez.read(handle)
    handle.close()

    webenv = search['WebEnv']
    query_key = search['QueryKey']
    total = int(search['Count'])

    print(f"Downloading {total} records...")

    with open(output_file, 'w') as out:
        for start in range(0, total, batch_size):
            print(f"  Fetching {start+1}-{min(start+batch_size, total)}...")

            handle = Entrez.efetch(
                db=db,
                rettype=rettype,
                retmode='text',
                retstart=start,
                retmax=batch_size,
                webenv=webenv,
                query_key=query_key
            )
            out.write(handle.read())
            handle.close()

            time.sleep(0.34)  # Rate limiting (no API key)

    print(f"Saved to {output_file}")

Code Patterns

Download All Search Results

python

from Bio import Entrez
import time

Entrez.email = 'your.email@example.com'
Entrez.api_key = 'your_api_key'  # Optional

def download_search_results(db, term, output_file, rettype='fasta', batch_size=500):
    # Search with history server
    handle = Entrez.esearch(db=db, term=term, usehistory='y', retmax=0)
    search = Entrez.read(handle)
    handle.close()

    webenv = search['WebEnv']
    query_key = search['QueryKey']
    total = int(search['Count'])

    if total == 0:
        print("No records found")
        return

    delay = 0.1 if Entrez.api_key else 0.34

    with open(output_file, 'w') as out:
        for start in range(0, total, batch_size):
            end = min(start + batch_size, total)
            print(f"Downloading {start+1}-{end} of {total}")

            attempts = 3
            for attempt in range(attempts):
                try:
                    handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
                                           retstart=start, retmax=batch_size,
                                           webenv=webenv, query_key=query_key)
                    out.write(handle.read())
                    handle.close()
                    break
                except Exception as e:
                    if attempt < attempts - 1:
                        print(f"  Retry {attempt+1}: {e}")
                        time.sleep(5)
                    else:
                        raise

            time.sleep(delay)

    print(f"Downloaded {total} records to {output_file}")

download_search_results('nucleotide', 'human[orgn] AND insulin[gene] AND mRNA[fkey]', 'insulin_mrna.fasta')

Download by ID List

python

def download_by_ids(db, ids, output_file, rettype='fasta', batch_size=200):
    total = len(ids)
    delay = 0.1 if Entrez.api_key else 0.34

    with open(output_file, 'w') as out:
        for start in range(0, total, batch_size):
            batch = ids[start:start+batch_size]
            print(f"Downloading {start+1}-{start+len(batch)} of {total}")

            handle = Entrez.efetch(db=db, id=','.join(batch), rettype=rettype, retmode='text')
            out.write(handle.read())
            handle.close()

            time.sleep(delay)

    print(f"Downloaded {total} records to {output_file}")

# Example with list of IDs
ids = ['NM_007294', 'NM_000059', 'NM_000546', 'NM_001126112', 'NM_004985']
download_by_ids('nucleotide', ids, 'genes.fasta')

Post IDs to History (EPost)

For very large ID lists, post them to the history server first:

python

def post_and_download(db, ids, output_file, rettype='fasta', batch_size=500):
    # Post IDs to history server
    handle = Entrez.epost(db=db, id=','.join(ids))
    result = Entrez.read(handle)
    handle.close()

    webenv = result['WebEnv']
    query_key = result['QueryKey']
    total = len(ids)

    delay = 0.1 if Entrez.api_key else 0.34

    with open(output_file, 'w') as out:
        for start in range(0, total, batch_size):
            end = min(start + batch_size, total)
            print(f"Fetching {start+1}-{end} of {total}")

            handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
                                   retstart=start, retmax=batch_size,
                                   webenv=webenv, query_key=query_key)
            out.write(handle.read())
            handle.close()

            time.sleep(delay)

    print(f"Downloaded {total} records")

Download GenBank with Progress

python

from Bio import Entrez, SeqIO
from io import StringIO
import time

def download_genbank_records(term, output_file, batch_size=100):
    Entrez.email = 'your.email@example.com'

    # Search
    handle = Entrez.esearch(db='nucleotide', term=term, usehistory='y')
    search = Entrez.read(handle)
    handle.close()

    webenv, query_key = search['WebEnv'], search['QueryKey']
    total = int(search['Count'])

    records = []
    for start in range(0, total, batch_size):
        print(f"Fetching {start+1}-{min(start+batch_size, total)} of {total}")

        handle = Entrez.efetch(db='nucleotide', rettype='gb', retmode='text',
                               retstart=start, retmax=batch_size,
                               webenv=webenv, query_key=query_key)
        batch_records = list(SeqIO.parse(handle, 'genbank'))
        handle.close()

        records.extend(batch_records)
        time.sleep(0.34)

    SeqIO.write(records, output_file, 'genbank')
    print(f"Saved {len(records)} GenBank records")
    return records

Download with Retry Logic

python

import time
from urllib.error import HTTPError

def robust_download(db, term, output_file, rettype='fasta', batch_size=500, max_retries=3):
    handle = Entrez.esearch(db=db, term=term, usehistory='y')
    search = Entrez.read(handle)
    handle.close()

    webenv, query_key = search['WebEnv'], search['QueryKey']
    total = int(search['Count'])
    delay = 0.1 if Entrez.api_key else 0.34

    with open(output_file, 'w') as out:
        for start in range(0, total, batch_size):
            for retry in range(max_retries):
                try:
                    handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
                                           retstart=start, retmax=batch_size,
                                           webenv=webenv, query_key=query_key)
                    data = handle.read()
                    handle.close()

                    if data.strip():
                        out.write(data)
                    break

                except HTTPError as e:
                    if e.code == 429:  # Rate limit
                        wait = 10 * (retry + 1)
                        print(f"Rate limited, waiting {wait}s...")
                        time.sleep(wait)
                    elif retry == max_retries - 1:
                        raise
                    else:
                        time.sleep(5)

            time.sleep(delay)

    print(f"Downloaded to {output_file}")

Stream to File (Memory Efficient)

python

def stream_download(db, term, output_file, rettype='fasta', batch_size=1000):
    handle = Entrez.esearch(db=db, term=term, usehistory='y')
    search = Entrez.read(handle)
    handle.close()

    webenv, query_key = search['WebEnv'], search['QueryKey']
    total = int(search['Count'])

    downloaded = 0
    with open(output_file, 'w') as out:
        for start in range(0, total, batch_size):
            handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
                                   retstart=start, retmax=batch_size,
                                   webenv=webenv, query_key=query_key)

            # Stream chunks to file
            while True:
                chunk = handle.read(8192)
                if not chunk:
                    break
                out.write(chunk)

            handle.close()
            downloaded = min(start + batch_size, total)
            print(f"Progress: {downloaded}/{total} ({100*downloaded/total:.1f}%)")
            time.sleep(0.34)

Batch Size Guidelines

Database	rettype	Recommended Batch
nucleotide	fasta	500-1000
nucleotide	gb	100-200
protein	fasta	500-1000
protein	gp	100-200
pubmed	abstract	1000-2000
pubmed	xml	200-500

Smaller batches for GenBank/XML (more data per record).

Common Errors

Error	Cause	Solution
`HTTPError 429`	Rate limit exceeded	Increase delay, use API key
`HTTPError 400`	Invalid WebEnv/query_key	Session expired, re-search
Incomplete data	Connection interrupted	Add retry logic
Memory error	Batch too large	Reduce batch_size
Empty response	No more records	Check total vs start

Decision Tree

Need to download many NCBI records?
├── Have search query?
│   └── Use esearch with usehistory='y', then batch efetch
├── Have list of IDs?
│   ├── < 200 IDs? → Direct efetch with comma-separated IDs
│   └── >= 200 IDs? → Use epost, then batch efetch
├── Need records as Biopython objects?
│   └── Parse each batch with SeqIO
├── Downloading > 10,000 records?
│   └── Use streaming to avoid memory issues
└── Getting rate limited?
    └── Get API key, add retry logic

Related Skills

entrez-search - Build search queries for batch downloads
entrez-fetch - Single record fetching
sra-data - For raw sequencing data (use SRA toolkit instead)

Maintainer

FreedomIntelligence Core maintainer

Source details

Full Name: FreedomIntelligence/OpenClaw-Medical-Skills
Branch: main
Path in repo: skills/bio-batch-downloads
Topics: claude-code skills openclaw awesome clawhub openclaw-skills medical nanoclaw

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量，并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Batch Downloads

Required Setup

Rate Limits

History Server

How It Works

Core Pattern: Batch Download

Code Patterns

Download All Search Results

Download by ID List

Post IDs to History (EPost)

Download GenBank with Progress

Download with Retry Logic

Stream to File (Memory Efficient)

Batch Size Guidelines

Common Errors

Decision Tree

Related Skills

Recommended Agent Skills

vcf-annotator

chemist-analyst

bio-alignment-io

sleep-analyzer

metabolomics-workbench-database

bio-hi-c-analysis-matrix-operations