Agent skill
bio-compressed-files
Read and write compressed sequence files (gzip, bzip2, BGZF) using Biopython. Use when working with .gz or .bz2 sequence files. Use BGZF for indexable compressed files.
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-compressed-files
SKILL.md
Version Compatibility
Reference examples tested with: BioPython 1.83+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Compressed Files
Handle gzip, bzip2, and BGZF compressed sequence files with Biopython.
"Read a compressed sequence file" → Open a compressed file handle in text mode, then parse with the standard SeqIO interface.
- gzip:
gzip.open(path, 'rt')(Python stdlib) - bzip2:
bz2.open(path, 'rt')(Python stdlib) - BGZF:
bgzf.open(path, 'rt')(BioPython) or directSeqIO.parse(path, fmt)
"Make a compressed file indexable" → Convert to BGZF format. Only BGZF supports SeqIO.index() on compressed data.
Required Imports
import gzip
import bz2
from Bio import SeqIO
from Bio import bgzf
Reading Compressed Files
Goal: Parse sequence records from compressed files without decompressing to disk.
Approach: Open a decompression handle in text mode ('rt'), then pass the handle to SeqIO.parse(). The parser works identically to uncompressed input.
Gzip (.gz) (BioPython 1.83+)
with gzip.open('sequences.fasta.gz', 'rt') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id, len(record.seq))
Important: Use 'rt' (read text) mode, not 'rb' (read binary).
Bzip2 (.bz2) (BioPython 1.83+)
with bz2.open('sequences.fasta.bz2', 'rt') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id, len(record.seq))
BGZF (Block Gzip) (BioPython 1.83+)
BGZF files can be read like regular gzip, but also support indexing:
for record in SeqIO.parse('sequences.fasta.bgz', 'fasta'):
print(record.id)
with bgzf.open('sequences.fasta.bgz', 'rt') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id)
Writing Compressed Files
Goal: Save sequence records directly to compressed files without an intermediate uncompressed step.
Approach: Open a compression handle in text mode ('wt'), then pass it to SeqIO.write().
Gzip (.gz)
with gzip.open('output.fasta.gz', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
Bzip2 (.bz2)
with bz2.open('output.fasta.bz2', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
BGZF (.bgz)
with bgzf.open('output.fasta.bgz', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
BGZF: Indexable Compression
Goal: Enable random access to records in compressed sequence files.
Approach: Write sequences in BGZF (Block GZip Format) — the only compressed format supporting SeqIO.index() and SeqIO.index_db(). BGZF is a gzip variant used by BAM and tabix-indexed files.
Create Indexable Compressed File
from Bio import SeqIO, bgzf
records = SeqIO.parse('input.fasta', 'fasta')
with bgzf.open('output.fasta.bgz', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
Index a BGZF File
records = SeqIO.index('sequences.fasta.bgz', 'fasta')
seq = records['target_id'].seq
records.close()
records = SeqIO.index_db('index.sqlite', 'sequences.fasta.bgz', 'fasta')
Convert Gzip to BGZF
"Convert gzip to indexable format" → Parse from gzip handle, write through BGZF handle.
from Bio import SeqIO, bgzf
import gzip
with gzip.open('input.fasta.gz', 'rt') as in_handle:
with bgzf.open('output.fasta.bgz', 'wt') as out_handle:
SeqIO.write(SeqIO.parse(in_handle, 'fasta'), out_handle, 'fasta')
Code Patterns
Read Gzipped FASTQ
with gzip.open('reads.fastq.gz', 'rt') as handle:
records = list(SeqIO.parse(handle, 'fastq'))
print(f'Loaded {len(records)} reads')
Count Records in Gzipped File
with gzip.open('sequences.fasta.gz', 'rt') as handle:
count = sum(1 for _ in SeqIO.parse(handle, 'fasta'))
print(f'{count} sequences')
Fast Count with Low-Level Parser
from Bio.SeqIO.FastaIO import SimpleFastaParser
import gzip
with gzip.open('sequences.fasta.gz', 'rt') as handle:
count = sum(1 for _ in SimpleFastaParser(handle))
Convert Compressed to Uncompressed
with gzip.open('input.fasta.gz', 'rt') as in_handle:
records = SeqIO.parse(in_handle, 'fasta')
SeqIO.write(records, 'output.fasta', 'fasta')
Convert Uncompressed to Compressed
records = SeqIO.parse('input.fasta', 'fasta')
with gzip.open('output.fasta.gz', 'wt') as out_handle:
SeqIO.write(records, out_handle, 'fasta')
Auto-Detect Compression
from pathlib import Path
from Bio import SeqIO, bgzf
import gzip
import bz2
def open_sequence_file(filepath, format):
filepath = Path(filepath)
suffix = filepath.suffix.lower()
if suffix == '.gz':
# Could be gzip or bgzf - bgzf handles both
handle = bgzf.open(filepath, 'rt')
elif suffix == '.bgz':
handle = bgzf.open(filepath, 'rt')
elif suffix == '.bz2':
handle = bz2.open(filepath, 'rt')
else:
handle = open(filepath, 'r')
return SeqIO.parse(handle, format)
Process Large Gzipped File (Memory Efficient)
with gzip.open('large.fastq.gz', 'rt') as handle:
for record in SeqIO.parse(handle, 'fastq'):
if len(record.seq) >= 100:
process(record)
Compress Existing File (Raw Copy)
import shutil
with open('sequences.fasta', 'rb') as f_in:
with gzip.open('sequences.fasta.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Compression Comparison
| Format | Extension | Indexable | Speed | Compression |
|---|---|---|---|---|
| Gzip | .gz |
No | Fast | Good |
| BGZF | .bgz |
Yes | Fast | Good |
| Bzip2 | .bz2 |
No | Slow | Better |
| LZMA | .xz |
No | Slowest | Best |
When to Use Each Format
| Use Case | Recommended Format |
|---|---|
| Archive (no random access needed) | gzip or bzip2 |
| Need to index compressed file | BGZF |
| BAM files and tabix | BGZF (native) |
| Maximum compression | bzip2 or xz |
| Best speed | gzip or BGZF |
Common Errors
| Error | Cause | Solution |
|---|---|---|
TypeError: a bytes-like object is required |
Used 'rb' mode | Use 'rt' for text mode |
UnicodeDecodeError |
Wrong encoding | Try gzip.open(file, 'rt', encoding='latin-1') |
gzip.BadGzipFile |
Not a gzip file | Check file extension matches actual format |
OSError: Not a gzipped file |
Corrupt or wrong format | Verify file integrity |
SeqIO.index() fails on .gz |
Regular gzip not indexable | Convert to BGZF first |
Decision Tree
Working with compressed sequence files?
├── Just reading sequentially?
│ └── Use gzip.open() or bz2.open() with 'rt' mode
├── Need to index the compressed file?
│ └── Convert to BGZF, then use SeqIO.index()
├── Writing compressed output?
│ ├── Will need to index later? → Use bgzf.open()
│ └── Just archiving? → Use gzip.open() or bz2.open()
└── Converting between formats?
└── Parse with SeqIO, write to new handle
Related Skills
- read-sequences - Core parsing functions used with compressed handles
- write-sequences - Write to compressed output files
- batch-processing - Process multiple compressed files
- alignment-files - BAM files use BGZF natively; samtools handles compression
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?