Variant Normalization

Left-align indels and split multiallelic sites using bcftools norm.

Why Normalize?

The same variant can be represented multiple ways:

# Same deletion, different representations
chr1  100  ATCG  A      (right-aligned)
chr1  100  ATC   A      (left-aligned, normalized)
chr1  101  TCG   T      (different position)

Normalization ensures consistent representation for:

Comparing variants from different callers
Database lookups (dbSNP, ClinVar)
Merging VCF files

bcftools norm

Left-Align Indels

bash

bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz

Requires reference FASTA to determine left-most representation.

Check for Normalization Issues

bash

bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches

Check modes (-c):

w - Warn on mismatch (default)
e - Error on mismatch
x - Exclude mismatches
s - Set correct REF from reference

Multiallelic Sites

Split Multiallelic to Biallelic

bash

bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz

Before:

chr1  100  .  A  G,T  30  PASS  .  GT  1/2

After:

chr1  100  .  A  G  30  PASS  .  GT  1/0
chr1  100  .  A  T  30  PASS  .  GT  0/1

Split SNPs Only

bash

bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz

Split Indels Only

bash

bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz

Join Biallelic to Multiallelic

bash

bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz

Split Options

Option	Description
`-m-any`	Split all multiallelic sites
`-m-snps`	Split multiallelic SNPs only
`-m-indels`	Split multiallelic indels only
`-m-both`	Split SNPs and indels separately
`-m+any`	Join biallelic sites into multiallelic
`-m+snps`	Join biallelic SNPs
`-m+indels`	Join biallelic indels
`-m+both`	Join SNPs and indels separately

Combined Normalization

Standard Normalization Pipeline

bash

bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz

This:

Left-aligns indels
Splits multiallelic sites

Remove Duplicates After Splitting

bash

bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz

Duplicate removal options (-d):

exact - Remove exact duplicates
snps - Remove duplicate SNPs
indels - Remove duplicate indels
both - Remove duplicate SNPs and indels
all - Remove all duplicates
none - Keep duplicates (default)

Fixing Reference Alleles

Fix Mismatches from Reference

bash

bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz

This sets REF alleles to match the reference genome.

Exclude Mismatches

bash

bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz

Removes variants where REF doesn't match reference.

Atomize Complex Variants

Split MNPs to SNPs

bash

bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Before:

chr1  100  .  ATG  GCA  30  PASS

After:

chr1  100  .  A  G  30  PASS
chr1  101  .  T  C  30  PASS
chr1  102  .  G  A  30  PASS

Atomize and Left-Align

bash

bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Old to New Format

Update VCF Version

bash

bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz

Tags original record for reference.

Common Workflows

Before Comparing Callers

bash

# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
    base=$(basename "$vcf" .vcf.gz)
    bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
    bcftools index "${base}.norm.vcf.gz"
done

# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz

Before Database Annotation

bash

bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.

Prepare for GWAS

bash

bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
    bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz

cyvcf2 Normalization Check

Check if Variants Need Normalization

python

from cyvcf2 import VCF

def needs_normalization(variant):
    # Check for multiallelic
    if len(variant.ALT) > 1:
        return True

    # Check for complex variants (potential MNPs)
    ref, alt = variant.REF, variant.ALT[0]
    if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
        return True

    return False

count = 0
for variant in VCF('input.vcf.gz'):
    if needs_normalization(variant):
        count += 1

print(f'Variants needing normalization: {count}')

Count Multiallelic Sites

python

from cyvcf2 import VCF

multiallelic = 0
total = 0

for variant in VCF('input.vcf.gz'):
    total += 1
    if len(variant.ALT) > 1:
        multiallelic += 1

print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')

Quick Reference

Task	Command
Left-align indels	`bcftools norm -f ref.fa in.vcf.gz`
Split multiallelic	`bcftools norm -m-any in.vcf.gz`
Join to multiallelic	`bcftools norm -m+any in.vcf.gz`
Full normalization	`bcftools norm -f ref.fa -m-any in.vcf.gz`
Fix REF alleles	`bcftools norm -f ref.fa -c s in.vcf.gz`
Remove duplicates	`bcftools norm -d exact in.vcf.gz`
Atomize MNPs	`bcftools norm --atomize in.vcf.gz`

Common Errors

Error	Cause	Solution
`REF does not match`	Wrong reference	Use same reference as caller
`not sorted`	Unsorted input	Run `bcftools sort` first
`duplicate records`	Same position twice	Use `-d` to remove

Related Skills

variant-calling - Generate VCF files
filtering-best-practices - Filter after normalization
vcf-manipulation - Compare normalized VCFs
variant-annotation - Annotate normalized variants

Search AI Tools

bio-variant-normalization

Install this agent skill to your Project

SKILL.md

Variant Normalization

Why Normalize?

bcftools norm

Left-Align Indels

Check for Normalization Issues

Multiallelic Sites

Split Multiallelic to Biallelic

Split SNPs Only

Split Indels Only

Join Biallelic to Multiallelic

Split Options

Combined Normalization

Standard Normalization Pipeline

Remove Duplicates After Splitting

Fixing Reference Alleles

Fix Mismatches from Reference

Exclude Mismatches

Atomize Complex Variants

Split MNPs to SNPs

Atomize and Left-Align

Old to New Format

Update VCF Version

Common Workflows

Before Comparing Callers

Before Database Annotation

Prepare for GWAS

cyvcf2 Normalization Check

Check if Variants Need Normalization

Count Multiallelic Sites

Quick Reference

Common Errors

Related Skills