Agent skill
bio-filter-sequences
Filter and select sequences by criteria (length, ID, GC content, patterns) using Biopython. Use when subsetting sequences, removing unwanted records, or selecting by specific criteria.
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-filter-sequences
SKILL.md
Version Compatibility
Reference examples tested with: BioPython 1.83+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Filter Sequences
"Filter sequences by length, quality, or content" → Apply boolean criteria to a stream of sequence records and write survivors to output.
- Python: generator expression with
SeqIO.parse()+SeqIO.write()(BioPython) - CLI:
seqkit seq -m 200(SeqKit) orawkon FASTA
Filter and select sequences based on various criteria using Biopython.
Required Imports
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
Core Pattern
Use generator expressions for memory-efficient filtering:
records = SeqIO.parse('input.fasta', 'fasta')
filtered = (rec for rec in records if len(rec.seq) >= 100)
SeqIO.write(filtered, 'output.fasta', 'fasta')
Filter by Length
Minimum Length
records = SeqIO.parse('input.fasta', 'fasta')
long_seqs = (rec for rec in records if len(rec.seq) >= 500)
SeqIO.write(long_seqs, 'long.fasta', 'fasta')
Length Range
records = SeqIO.parse('input.fasta', 'fasta')
sized = (rec for rec in records if 100 <= len(rec.seq) <= 1000)
SeqIO.write(sized, 'sized.fasta', 'fasta')
Remove Short Sequences
min_length = 200
records = SeqIO.parse('input.fasta', 'fasta')
filtered = (rec for rec in records if len(rec.seq) >= min_length)
count = SeqIO.write(filtered, 'filtered.fasta', 'fasta')
Filter by ID
Select Specific IDs
wanted_ids = {'seq1', 'seq2', 'seq3'}
records = SeqIO.parse('input.fasta', 'fasta')
selected = (rec for rec in records if rec.id in wanted_ids)
SeqIO.write(selected, 'selected.fasta', 'fasta')
Select from ID File
Goal: Extract sequences whose IDs appear in an external list file.
Approach: Load IDs into a set for O(1) lookup, then stream-filter and write matches.
Reference (BioPython 1.83+):
with open('ids.txt') as f:
wanted_ids = {line.strip() for line in f}
records = SeqIO.parse('input.fasta', 'fasta')
selected = (rec for rec in records if rec.id in wanted_ids)
SeqIO.write(selected, 'selected.fasta', 'fasta')
Exclude Specific IDs
exclude_ids = {'bad_seq1', 'bad_seq2'}
records = SeqIO.parse('input.fasta', 'fasta')
kept = (rec for rec in records if rec.id not in exclude_ids)
SeqIO.write(kept, 'kept.fasta', 'fasta')
Filter by ID Pattern
import re
pattern = re.compile(r'^chr\d+$') # Match chr1, chr2, etc.
records = SeqIO.parse('input.fasta', 'fasta')
chromosomes = (rec for rec in records if pattern.match(rec.id))
SeqIO.write(chromosomes, 'chromosomes.fasta', 'fasta')
Filter by GC Content
from Bio.SeqUtils import gc_fraction
records = SeqIO.parse('input.fasta', 'fasta')
moderate_gc = (rec for rec in records if 0.4 <= gc_fraction(rec.seq) <= 0.6)
SeqIO.write(moderate_gc, 'moderate_gc.fasta', 'fasta')
High GC Sequences
high_gc = (rec for rec in records if gc_fraction(rec.seq) >= 0.6)
Low GC Sequences
low_gc = (rec for rec in records if gc_fraction(rec.seq) <= 0.4)
Filter by Sequence Content
Remove Sequences with N's
records = SeqIO.parse('input.fasta', 'fasta')
clean = (rec for rec in records if 'N' not in str(rec.seq).upper())
SeqIO.write(clean, 'clean.fasta', 'fasta')
Limit N Content
def n_fraction(seq):
return str(seq).upper().count('N') / len(seq)
records = SeqIO.parse('input.fasta', 'fasta')
low_n = (rec for rec in records if n_fraction(rec.seq) < 0.05)
Contains Specific Motif
motif = 'GAATTC' # EcoRI site
records = SeqIO.parse('input.fasta', 'fasta')
with_motif = (rec for rec in records if motif in str(rec.seq).upper())
SeqIO.write(with_motif, 'with_ecori.fasta', 'fasta')
Regex Pattern in Sequence
import re
pattern = re.compile(r'ATG.{30,100}T(AA|AG|GA)') # ORF-like pattern
records = SeqIO.parse('input.fasta', 'fasta')
matches = (rec for rec in records if pattern.search(str(rec.seq)))
Filter by Description
Description Contains Keyword
records = SeqIO.parse('input.fasta', 'fasta')
kinases = (rec for rec in records if 'kinase' in rec.description.lower())
SeqIO.write(kinases, 'kinases.fasta', 'fasta')
Multiple Keywords (OR)
keywords = ['kinase', 'phosphatase', 'transferase']
records = SeqIO.parse('input.fasta', 'fasta')
enzymes = (rec for rec in records if any(k in rec.description.lower() for k in keywords))
Combine Multiple Filters
Goal: Remove sequences that fail any of several quality/content thresholds.
Approach: Define a predicate function that checks all criteria, apply it as a generator filter, and write survivors.
Reference (BioPython 1.83+):
from Bio.SeqUtils import gc_fraction
def passes_filters(record):
if len(record.seq) < 100:
return False
if gc_fraction(record.seq) < 0.3 or gc_fraction(record.seq) > 0.7:
return False
if 'N' in str(record.seq).upper():
return False
return True
records = SeqIO.parse('input.fasta', 'fasta')
filtered = (rec for rec in records if passes_filters(rec))
SeqIO.write(filtered, 'filtered.fasta', 'fasta')
Sample Sequences
Random Sample (requires loading all)
import random
records = list(SeqIO.parse('input.fasta', 'fasta'))
sample = random.sample(records, min(100, len(records)))
SeqIO.write(sample, 'sample.fasta', 'fasta')
First N Sequences
from itertools import islice
records = SeqIO.parse('input.fasta', 'fasta')
first_100 = islice(records, 100)
SeqIO.write(first_100, 'first100.fasta', 'fasta')
Every Nth Sequence
records = SeqIO.parse('input.fasta', 'fasta')
every_10th = (rec for i, rec in enumerate(records) if i % 10 == 0)
SeqIO.write(every_10th, 'sampled.fasta', 'fasta')
Split by Criteria
Split by Length
Goal: Partition sequences into separate files based on a length threshold.
Approach: Load all records, apply list comprehension split, and write each partition.
Reference (BioPython 1.83+):
records = list(SeqIO.parse('input.fasta', 'fasta'))
short = [r for r in records if len(r.seq) < 500]
long = [r for r in records if len(r.seq) >= 500]
SeqIO.write(short, 'short.fasta', 'fasta')
SeqIO.write(long, 'long.fasta', 'fasta')
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Generator exhausted | Used generator twice | Re-create generator or use list() |
| Empty output | Filter too strict | Check filter conditions |
| Memory error | List too large | Use generator expressions |
Related Skills
- read-sequences - Parse sequences before filtering
- write-sequences - Write filtered sequences to output
- fastq-quality - Filter FASTQ by quality scores
- paired-end-fastq - Synchronized filtering of paired reads
- sequence-manipulation/motif-search - Filter by complex motif patterns
- alignment-files - Filter aligned reads with samtools view -f/-F
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?