Agent skill
fastq-analysis-pipeline
Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/fastq-analysis
SKILL.md
Overview
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:
- SRA data acquisition:
prefetchandfqdump(fasterq-dump wrapper) - Quality control:
fastpfor adapter trimming and QC reports - RNA-seq alignment:
STARaligner with auto-index building - Gene quantification:
featureCount(subread featureCounts wrapper) - Single-cell path:
refandcountvia kb-python (kallisto/bustools) - Parallel SRA download:
parallel_fastq_dump
All functions share a common CLI infrastructure (_cli_utils.py) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
Instructions
-
Environment setup
- Bioinformatics tools are resolved automatically from PATH or the active conda environment.
- If
auto_install=True(default), missing tools are installed via mamba/conda on demand. - Supported tools:
prefetch,vdb-validate,fasterq-dump,fastp,STAR,samtools,featureCounts,pigz,gzip. - For the single-cell path, ensure
kb-pythonis installed:pip install kb-python.
-
SRA data download (
ov.alignment.prefetch+ov.alignment.fqdump)- Use
prefetchfirst for reliable downloads with integrity validation (vdb-validate). - Then convert to FASTQ with
fqdump. It auto-detects single-end vs paired-end. fqdumpcan also work directly from SRR accessions without prefetch.- Both support retry with exponential backoff for network errors.
pythonimport omicverse as ov # Step 1: Prefetch SRA files (optional but recommended) pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4) # Step 2: Convert to FASTQ fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'], output_dir='fastq', sra_dir='prefetch', gzip=True, threads=8, jobs=4) - Use
-
FASTQ quality control (
ov.alignment.fastp)- Runs fastp for adapter trimming, quality filtering, and QC reporting.
- Supports single-end and paired-end reads.
- Produces per-sample JSON and HTML QC reports.
- Sample format: tuple of
(sample_name, fq1_path, fq2_path_or_None).
pythonsamples = [ ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'), ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'), ] clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2) -
STAR alignment (
ov.alignment.STAR)- Aligns FASTQ reads using the STAR aligner.
- Auto-index building: set
auto_index=True(default) withgenome_fasta_filesandgtfto build index automatically if missing. - Produces coordinate-sorted BAM files.
- Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
- Use
strict=False(default) for graceful error handling per sample.
python# Prepare samples from fastp output star_samples = [ ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'), ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'), ] bams = ov.alignment.STAR( star_samples, genome_dir='star_index', output_dir='star_out', gtf='genes.gtf', genome_fasta_files=['genome.fa'], threads=8, memory='50G', ) -
Gene quantification (
ov.alignment.featureCount)- Counts aligned reads per gene using featureCounts (subread).
- Auto-detects paired-end from BAM headers (via pysam or samtools).
auto_fix=True(default) retries with corrected paired-end flag on error.gene_mapping=Truemaps gene_id to gene_name from the GTF.merge_matrix=Trueproduces a combined count matrix across all samples.
pythonbam_items = [ ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'), ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'), ] counts = ov.alignment.featureCount( bam_items, gtf='genes.gtf', output_dir='counts', gene_mapping=True, merge_matrix=True, threads=8, ) # counts is a pandas DataFrame (gene_id x samples) -
Single-cell path (
ov.alignment.ref+ov.alignment.count)- Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
ref()builds a kallisto index and transcript-to-gene mapping.count()quantifies single-cell data with barcode/UMI handling.- Supports technologies: 10XV2, 10XV3, BULK, and custom.
- Output formats: h5ad, loom, cellranger MTX.
python# Build reference index ref_result = ov.alignment.ref( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', fasta_paths=['genome.fa'], gtf_paths=['genes.gtf'], threads=8, ) # Quantify 10x v3 data count_result = ov.alignment.count( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', technology='10XV3', fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'], output_path='kb_out', h5ad=True, filter_barcodes=True, threads=8, ) -
Wiring fastp output into STAR input
- fastp output is a list of dicts with keys:
sample,clean1,clean2,json,html. - Convert to STAR sample tuples:
pythonstar_samples = [ (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None) for r in (clean if isinstance(clean, list) else [clean]) ] - fastp output is a list of dicts with keys:
-
Wiring STAR output into featureCount input
- STAR output is a list of dicts with keys:
sample,bam(orerror). - Convert to featureCount items:
pythonbam_items = [ (r['sample'], r['bam']) for r in (bams if isinstance(bams, list) else [bams]) if 'bam' in r ] - STAR output is a list of dicts with keys:
-
Skipping completed steps
- All functions check for existing outputs and skip if
overwrite=False(default). - Set
overwrite=Trueto force re-execution.
- All functions check for existing outputs and skip if
-
Troubleshooting
- If a tool is not found, check
auto_install=Trueand that conda/mamba is accessible. - For STAR index errors, ensure
genome_fasta_filespoints to uncompressed or gzip FASTA files. - For featureCounts paired-end detection errors,
auto_fix=Truehandles most cases automatically. - GTF files can be gzip-compressed; they are auto-decompressed as needed.
- If a tool is not found, check
Critical API Reference
Sample Format Convention
All alignment functions use a consistent sample tuple format:
- FASTQ samples:
(sample_name, fq1_path, fq2_path_or_None) - BAM items:
(sample_name, bam_path)or(sample_name, bam_path, is_paired_bool) - Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.
Auto-installation
# All functions support these parameters:
auto_install=True # Auto-install missing tools via conda/mamba
overwrite=False # Skip if outputs already exist
threads=8 # Per-tool thread count
jobs=None # Concurrent job count (auto-detected from CPU count)
Examples
- Bulk RNA-seq from SRA:
prefetch->fqdump->fastp->STAR->featureCount-> pandas DataFrame - Single-cell 10x v3:
ref->countwithtechnology='10XV3'-> h5ad AnnData - Local FASTQ files: Skip download steps, start directly with
fastp->STAR->featureCount
References
- See reference.md for copy-paste-ready code templates.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?