Agent skill
bio-alignment-validation
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-alignment-validation
SKILL.md
name: bio-alignment-validation description: Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification. tool_type: mixed primary_tool: samtools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Alignment Validation
Post-alignment quality control to verify alignment quality and identify issues.
Insert Size Distribution
Insert size should match library preparation protocol.
samtools stats
samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt
Picard CollectInsertSizeMetrics
java -jar picard.jar CollectInsertSizeMetrics \
I=input.bam \
O=insert_metrics.txt \
H=insert_histogram.pdf
Expected Insert Sizes
| Library Type | Expected Size |
|---|---|
| Standard WGS | 300-500 bp |
| PCR-free | 350-550 bp |
| RNA-seq | 150-300 bp |
| ChIP-seq | 150-300 bp |
| ATAC-seq | Multimodal |
Python Insert Size Analysis
import pysam
import numpy as np
import matplotlib.pyplot as plt
def get_insert_sizes(bam_file, max_reads=100000):
sizes = []
bam = pysam.AlignmentFile(bam_file, 'rb')
for i, read in enumerate(bam.fetch()):
if i >= max_reads:
break
if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
sizes.append(read.template_length)
bam.close()
return sizes
sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')
plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')
Proper Pairing Rate
Percentage of reads correctly paired.
samtools flagstat
samtools flagstat input.bam
samtools flagstat input.bam | grep "properly paired"
Calculate Pairing Rate
proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"
Expected Rates
| Metric | Good | Marginal | Poor |
|---|---|---|---|
| Proper pair | > 90% | 80-90% | < 80% |
| Mapped | > 95% | 90-95% | < 90% |
| Singletons | < 5% | 5-10% | > 10% |
GC Bias
GC content correlation with coverage.
Picard CollectGcBiasMetrics
java -jar picard.jar CollectGcBiasMetrics \
I=input.bam \
O=gc_bias_metrics.txt \
CHART=gc_bias_chart.pdf \
S=gc_summary.txt \
R=reference.fa
deepTools computeGCBias
computeGCBias \
-b input.bam \
--effectiveGenomeSize 2913022398 \
-g hg38.2bit \
-o gc_bias.txt \
--biasPlot gc_bias.pdf
Interpret GC Bias
| Issue | Symptom |
|---|---|
| Under-representation | Low GC coverage drops |
| Over-representation | High GC coverage elevated |
| PCR bias | Strong correlation |
Strand Balance
Forward and reverse strand should be balanced.
Calculate Strand Ratio
forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"
Check Strand Bias per Chromosome
for chr in chr1 chr2 chr3; do
fwd=$(samtools view -c -F 16 input.bam $chr)
rev=$(samtools view -c -f 16 input.bam $chr)
echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done
Mapping Quality Distribution
Extract MAPQ Distribution
samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n
Calculate Mean MAPQ
samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'
MAPQ Thresholds
| MAPQ | Meaning |
|---|---|
| 0 | Multi-mapper |
| 1-10 | Low confidence |
| 20-30 | Moderate |
| 40+ | High confidence |
| 60 | Unique (BWA) |
Chromosome Coverage Balance
Calculate Per-Chromosome Coverage
samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25
Check for Aneuploidy/Contamination
samtools idxstats input.bam | awk '$3 > 0 {
sum += $3
len[$1] = $2
reads[$1] = $3
} END {
for (chr in reads) {
expected = len[chr] / sum * reads[chr]
ratio = reads[chr] / expected
if (ratio < 0.8 || ratio > 1.2) print chr, ratio
}
}'
Mismatch Rate
Picard CollectAlignmentSummaryMetrics
java -jar picard.jar CollectAlignmentSummaryMetrics \
I=input.bam \
R=reference.fa \
O=alignment_summary.txt
Key Metrics
| Metric | Description | Good Value |
|---|---|---|
| PCT_PF_READS_ALIGNED | Mapped % | > 95% |
| PF_MISMATCH_RATE | Mismatches | < 1% |
| PF_INDEL_RATE | Indels | < 0.1% |
| STRAND_BALANCE | Strand ratio | ~0.5 |
Comprehensive Validation Script
#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}
mkdir -p $OUTDIR
echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt
echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt
echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt
echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt
echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt
echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt
echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt
echo -e "\nReport: $OUTDIR/report.txt"
Python Validation Module
import pysam
import numpy as np
from collections import Counter
class AlignmentValidator:
def __init__(self, bam_file):
self.bam = pysam.AlignmentFile(bam_file, 'rb')
def mapping_rate(self):
stats = self.bam.get_index_statistics()
mapped = sum(s.mapped for s in stats)
unmapped = sum(s.unmapped for s in stats)
return mapped / (mapped + unmapped) * 100
def proper_pair_rate(self, sample_size=100000):
proper = 0
paired = 0
for i, read in enumerate(self.bam.fetch()):
if i >= sample_size:
break
if read.is_paired:
paired += 1
if read.is_proper_pair:
proper += 1
return proper / paired * 100 if paired > 0 else 0
def mapq_distribution(self, sample_size=100000):
mapqs = []
for i, read in enumerate(self.bam.fetch()):
if i >= sample_size:
break
if not read.is_unmapped:
mapqs.append(read.mapping_quality)
return Counter(mapqs)
def strand_balance(self, sample_size=100000):
forward = 0
reverse = 0
for i, read in enumerate(self.bam.fetch()):
if i >= sample_size:
break
if not read.is_unmapped:
if read.is_reverse:
reverse += 1
else:
forward += 1
return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5
def report(self):
print(f'Mapping rate: {self.mapping_rate():.1f}%')
print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
print(f'Strand balance: {self.strand_balance():.3f}')
mapq_dist = self.mapq_distribution()
high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
total = sum(mapq_dist.values())
print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')
def close(self):
self.bam.close()
validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()
Quality Thresholds Summary
| Metric | Good | Warning | Fail |
|---|---|---|---|
| Mapping rate | > 95% | 90-95% | < 90% |
| Proper pairing | > 90% | 80-90% | < 80% |
| Duplicate rate | < 10% | 10-20% | > 20% |
| Strand balance | 0.48-0.52 | 0.45-0.55 | Outside |
| Mean MAPQ | > 40 | 30-40 | < 30 |
| GC bias | < 1.2x | 1.2-1.5x | > 1.5x |
Related Skills
- bam-statistics - Basic flagstat and depth
- duplicate-handling - Mark/remove duplicates
- alignment-filtering - Filter low-quality reads
- chip-seq/chipseq-qc - ChIP-specific metrics
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?