Agent skill
variant-normalization
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/variant-interpretation-acmg/bioSkills/variant-normalization
SKILL.md
name: bio-variant-normalization description: Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Variant Normalization
Left-align indels and split multiallelic sites using bcftools norm.
Why Normalize?
The same variant can be represented multiple ways:
# Same deletion, different representations
chr1 100 ATCG A (right-aligned)
chr1 100 ATC A (left-aligned, normalized)
chr1 101 TCG T (different position)
Normalization ensures consistent representation for:
- Comparing variants from different callers
- Database lookups (dbSNP, ClinVar)
- Merging VCF files
bcftools norm
Left-Align Indels
bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz
Requires reference FASTA to determine left-most representation.
Check for Normalization Issues
bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches
Check modes (-c):
w- Warn on mismatch (default)e- Error on mismatchx- Exclude mismatchess- Set correct REF from reference
Multiallelic Sites
Split Multiallelic to Biallelic
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
Before:
chr1 100 . A G,T 30 PASS . GT 1/2
After:
chr1 100 . A G 30 PASS . GT 1/0
chr1 100 . A T 30 PASS . GT 0/1
Split SNPs Only
bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz
Split Indels Only
bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz
Join Biallelic to Multiallelic
bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz
Split Options
| Option | Description |
|---|---|
-m-any |
Split all multiallelic sites |
-m-snps |
Split multiallelic SNPs only |
-m-indels |
Split multiallelic indels only |
-m-both |
Split SNPs and indels separately |
-m+any |
Join biallelic sites into multiallelic |
-m+snps |
Join biallelic SNPs |
-m+indels |
Join biallelic indels |
-m+both |
Join SNPs and indels separately |
Combined Normalization
Standard Normalization Pipeline
bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
This:
- Left-aligns indels
- Splits multiallelic sites
Remove Duplicates After Splitting
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz
Duplicate removal options (-d):
exact- Remove exact duplicatessnps- Remove duplicate SNPsindels- Remove duplicate indelsboth- Remove duplicate SNPs and indelsall- Remove all duplicatesnone- Keep duplicates (default)
Fixing Reference Alleles
Fix Mismatches from Reference
bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz
This sets REF alleles to match the reference genome.
Exclude Mismatches
bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz
Removes variants where REF doesn't match reference.
Atomize Complex Variants
Split MNPs to SNPs
bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz
Before:
chr1 100 . ATG GCA 30 PASS
After:
chr1 100 . A G 30 PASS
chr1 101 . T C 30 PASS
chr1 102 . G A 30 PASS
Atomize and Left-Align
bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz
Old to New Format
Update VCF Version
bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz
Tags original record for reference.
Common Workflows
Before Comparing Callers
# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
base=$(basename "$vcf" .vcf.gz)
bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
bcftools index "${base}.norm.vcf.gz"
done
# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz
Before Database Annotation
bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.
Prepare for GWAS
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz
cyvcf2 Normalization Check
Check if Variants Need Normalization
from cyvcf2 import VCF
def needs_normalization(variant):
# Check for multiallelic
if len(variant.ALT) > 1:
return True
# Check for complex variants (potential MNPs)
ref, alt = variant.REF, variant.ALT[0]
if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
return True
return False
count = 0
for variant in VCF('input.vcf.gz'):
if needs_normalization(variant):
count += 1
print(f'Variants needing normalization: {count}')
Count Multiallelic Sites
from cyvcf2 import VCF
multiallelic = 0
total = 0
for variant in VCF('input.vcf.gz'):
total += 1
if len(variant.ALT) > 1:
multiallelic += 1
print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')
Quick Reference
| Task | Command |
|---|---|
| Left-align indels | bcftools norm -f ref.fa in.vcf.gz |
| Split multiallelic | bcftools norm -m-any in.vcf.gz |
| Join to multiallelic | bcftools norm -m+any in.vcf.gz |
| Full normalization | bcftools norm -f ref.fa -m-any in.vcf.gz |
| Fix REF alleles | bcftools norm -f ref.fa -c s in.vcf.gz |
| Remove duplicates | bcftools norm -d exact in.vcf.gz |
| Atomize MNPs | bcftools norm --atomize in.vcf.gz |
Common Errors
| Error | Cause | Solution |
|---|---|---|
REF does not match |
Wrong reference | Use same reference as caller |
not sorted |
Unsorted input | Run bcftools sort first |
duplicate records |
Same position twice | Use -d to remove |
Related Skills
- variant-calling - Generate VCF files
- filtering-best-practices - Filter after normalization
- vcf-manipulation - Compare normalized VCFs
- variant-annotation - Annotate normalized variants
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?