Read Sequences

Read biological sequence data from files using Biopython's Bio.SeqIO module.

Required Import

python

from Bio import SeqIO

Core Functions

SeqIO.parse() - Multiple Records (Iterator)

Use for files with one or more sequences. Returns an iterator of SeqRecord objects.

python

for record in SeqIO.parse('sequences.fasta', 'fasta'):
    print(record.id, len(record.seq))

Important: Always specify the format explicitly as the second argument.

SeqIO.read() - Single Record Only

Use when file contains exactly one sequence. Raises error if zero or multiple records.

python

record = SeqIO.read('single.fasta', 'fasta')

SeqIO.to_dict() - Load All Into Memory

Use for random access by record ID. Loads entire file into memory.

python

records = SeqIO.to_dict(SeqIO.parse('sequences.fasta', 'fasta'))
seq = records['sequence_id'].seq

SeqIO.index() - Large File Random Access

Use for large files when you need random access without loading everything into memory.

python

records = SeqIO.index('large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()

SeqIO.index_db() - SQLite-Backed Indexing

Use for very large files or multiple files. Creates persistent SQLite index.

python

# Create index (first time - parses file)
records = SeqIO.index_db('index.sqlite', 'large.fasta', 'fasta')
seq = records['sequence_id'].seq
records.close()

# Reuse existing index (instant load)
records = SeqIO.index_db('index.sqlite')

# Index multiple files together
records = SeqIO.index_db('combined.sqlite', ['file1.fasta', 'file2.fasta'], 'fasta')

Advantages over index():

Persistent index survives program restarts
Can index multiple files as one database
Lower memory for extremely large files
SQLite file can be shared across processes

High-Performance Parsing

For maximum throughput on large files, use low-level parsers (3-6x faster than SeqIO.parse):

SimpleFastaParser

python

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open('large.fasta') as handle:
    for title, sequence in SimpleFastaParser(handle):
        if len(sequence) > 1000:
            print(title.split()[0])  # First word is usually ID

Returns (title, sequence) tuples as strings (no SeqRecord overhead).

FastqGeneralIterator

python

from Bio.SeqIO.QualityIO import FastqGeneralIterator

with open('reads.fastq') as handle:
    for title, sequence, quality in FastqGeneralIterator(handle):
        avg_qual = sum(ord(c) - 33 for c in quality) / len(quality)

Returns (title, sequence, quality_string) tuples.

Common Formats

Format	String	Typical Extension	Notes
FASTA	`'fasta'`	.fasta, .fa, .fna, .faa	Most common
FASTA 2-line	`'fasta-2line'`	.fasta	One line per sequence (no wrapping)
FASTQ	`'fastq'`	.fastq, .fq	With quality scores
FASTQ Solexa	`'fastq-solexa'`	.fastq	Old Solexa/Illumina (pre-1.3)
FASTQ Illumina	`'fastq-illumina'`	.fastq	Illumina 1.3-1.7
GenBank	`'genbank'` or `'gb'`	.gb, .gbk	With features/annotations
EMBL	`'embl'`	.embl	European format with features
Swiss-Prot	`'swiss'`	.dat	UniProt format

Specialized Formats

Format	String	Use Case
ABI	`'abi'`	Sanger sequencing trace files (.ab1)
ABI Trimmed	`'abi-trim'`	ABI with low-quality ends trimmed
SFF	`'sff'`	454/Ion Torrent flowgram data
SFF Trimmed	`'sff-trim'`	SFF with adapter/quality trimming
QUAL	`'qual'`	Quality scores file (pairs with FASTA)
PHD	`'phd'`	Phred/Phrap/Consed output
ACE	`'ace'`	Assembly format (Consed)
PDB SEQRES	`'pdb-seqres'`	Protein sequences from PDB files
PDB ATOM	`'pdb-atom'`	Sequences from ATOM records in PDB
SnapGene	`'snapgene'`	SnapGene .dna files
GCK	`'gck'`	Gene Construction Kit files
XDNA	`'xdna'`	DNA Strider / SerialCloner files

Reading ABI Trace Files

python

# Read Sanger sequencing trace with quality
record = SeqIO.read('sample.ab1', 'abi')
print(f'Sequence: {record.seq}')
qualities = record.letter_annotations['phred_quality']

# Auto-trim low quality ends
record_trimmed = SeqIO.read('sample.ab1', 'abi-trim')

Reading 454/Ion Torrent SFF

python

for record in SeqIO.parse('reads.sff', 'sff'):
    print(record.id, len(record.seq))

# With trimming applied
for record in SeqIO.parse('reads.sff', 'sff-trim'):
    print(record.id, len(record.seq))

Reading PDB Sequences

python

# Get sequences from SEQRES records
for record in SeqIO.parse('structure.pdb', 'pdb-seqres'):
    print(f'Chain {record.id}: {record.seq}')

# Get sequences from ATOM coordinates
for record in SeqIO.parse('structure.pdb', 'pdb-atom'):
    print(f'Chain {record.id}: {record.seq}')

Alignment Formats (Read-Only)

Format	String	Notes
PHYLIP	`'phylip'`	Interleaved phylip
PHYLIP Sequential	`'phylip-sequential'`	Sequential phylip
PHYLIP Relaxed	`'phylip-relaxed'`	Longer names allowed
Clustal	`'clustal'`	ClustalW output
Stockholm	`'stockholm'`	Rfam/Pfam alignments
NEXUS	`'nexus'`	PAUP/MrBayes format
MAF	`'maf'`	Multiple Alignment Format

SeqRecord Object Attributes

After parsing, each record has these key attributes:

python

record.id          # Sequence identifier (string)
record.name        # Sequence name (string)
record.description # Full description line (string)
record.seq         # Sequence data (Seq object)
record.features    # List of SeqFeature objects (GenBank/EMBL)
record.annotations # Dictionary of annotations
record.letter_annotations  # Per-letter annotations (quality scores)
record.dbxrefs     # Database cross-references

Code Patterns

Collect All Sequences Into a List

python

records = list(SeqIO.parse('sequences.fasta', 'fasta'))

Count Records Without Loading All

python

count = sum(1 for _ in SeqIO.parse('sequences.fasta', 'fasta'))

Fast Count (FASTA only)

python

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open('sequences.fasta') as f:
    count = sum(1 for _ in SimpleFastaParser(f))

Get Sequence IDs Only

python

ids = [record.id for record in SeqIO.parse('sequences.fasta', 'fasta')]

Read GenBank with Features

python

for record in SeqIO.parse('sequence.gb', 'genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            print(feature.qualifiers.get('product', ['Unknown'])[0])
            cds_seq = feature.extract(record.seq)  # Get feature sequence

Access FASTQ Quality Scores

python

for record in SeqIO.parse('reads.fastq', 'fastq'):
    qualities = record.letter_annotations['phred_quality']
    avg_quality = sum(qualities) / len(qualities)

Read From File Handle

python

with open('sequences.fasta', 'r') as handle:
    for record in SeqIO.parse(handle, 'fasta'):
        print(record.id)

Custom ID Function for Indexing

python

def get_accession(identifier):
    return identifier.split('.')[0]  # Remove version

records = SeqIO.index('sequences.fasta', 'fasta', key_function=get_accession)

Common Errors

Error	Cause	Solution
`ValueError: More than one record`	Used `read()` on multi-record file	Use `parse()` instead
`ValueError: No records found`	Used `read()` on empty file	Check file exists and has content
`ValueError: unknown format`	Typo in format string	Check format string spelling
`UnicodeDecodeError`	Binary file or wrong encoding	Open with `encoding='latin-1'` or check file
`sqlite3.OperationalError`	index_db file locked	Close other connections first

Decision Tree

Need to read sequences?
├── Single record in file?
│   └── Use SeqIO.read()
├── Multiple records?
│   ├── Need all in memory at once?
│   │   └── Use list(SeqIO.parse()) or SeqIO.to_dict()
│   ├── Process one at a time (memory efficient)?
│   │   └── Use SeqIO.parse() iterator
│   ├── Large file, need random access by ID?
│   │   ├── Single session? → Use SeqIO.index()
│   │   └── Persistent/multi-file? → Use SeqIO.index_db()
│   └── Maximum throughput needed?
│       └── Use SimpleFastaParser or FastqGeneralIterator
├── Sanger sequencing trace?
│   └── Use 'abi' or 'abi-trim' format
├── 454/Ion Torrent data?
│   └── Use 'sff' or 'sff-trim' format
└── Protein from structure?
    └── Use 'pdb-seqres' or 'pdb-atom' format

Related Skills

write-sequences - Write parsed sequences to new files
filter-sequences - Filter sequences by criteria after reading
format-conversion - Convert between formats
compressed-files - Read gzip/bzip2/BGZF compressed sequence files
sequence-manipulation/seq-objects - Work with parsed SeqRecord objects
database-access - Fetch sequences from NCBI instead of local files
alignment-files - For SAM/BAM/CRAM alignment files, use samtools/pysam

Search AI Tools

bio-read-sequences

Install this agent skill to your Project

SKILL.md

Read Sequences

Required Import

Core Functions

SeqIO.parse() - Multiple Records (Iterator)

SeqIO.read() - Single Record Only

SeqIO.to_dict() - Load All Into Memory

SeqIO.index() - Large File Random Access

SeqIO.index_db() - SQLite-Backed Indexing

High-Performance Parsing

SimpleFastaParser

FastqGeneralIterator

Common Formats

Specialized Formats

Reading ABI Trace Files

Reading 454/Ion Torrent SFF

Reading PDB Sequences

Alignment Formats (Read-Only)

SeqRecord Object Attributes

Code Patterns

Collect All Sequences Into a List

Count Records Without Loading All

Fast Count (FASTA only)

Get Sequence IDs Only

Read GenBank with Features

Access FASTQ Quality Scores

Read From File Handle

Custom ID Function for Indexing

Common Errors

Decision Tree

Related Skills