Agent skill

bio-variant-normalization

Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/variant-normalization

SKILL.md

Variant Normalization

Left-align indels and split multiallelic sites using bcftools norm.

Why Normalize?

The same variant can be represented multiple ways:

# Same deletion, different representations
chr1  100  ATCG  A      (right-aligned)
chr1  100  ATC   A      (left-aligned, normalized)
chr1  101  TCG   T      (different position)

Normalization ensures consistent representation for:

  • Comparing variants from different callers
  • Database lookups (dbSNP, ClinVar)
  • Merging VCF files

bcftools norm

Left-Align Indels

bash
bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz

Requires reference FASTA to determine left-most representation.

Check for Normalization Issues

bash
bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches

Check modes (-c):

  • w - Warn on mismatch (default)
  • e - Error on mismatch
  • x - Exclude mismatches
  • s - Set correct REF from reference

Multiallelic Sites

Split Multiallelic to Biallelic

bash
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz

Before:

chr1  100  .  A  G,T  30  PASS  .  GT  1/2

After:

chr1  100  .  A  G  30  PASS  .  GT  1/0
chr1  100  .  A  T  30  PASS  .  GT  0/1

Split SNPs Only

bash
bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz

Split Indels Only

bash
bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz

Join Biallelic to Multiallelic

bash
bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz

Split Options

Option Description
-m-any Split all multiallelic sites
-m-snps Split multiallelic SNPs only
-m-indels Split multiallelic indels only
-m-both Split SNPs and indels separately
-m+any Join biallelic sites into multiallelic
-m+snps Join biallelic SNPs
-m+indels Join biallelic indels
-m+both Join SNPs and indels separately

Combined Normalization

Standard Normalization Pipeline

bash
bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz

This:

  1. Left-aligns indels
  2. Splits multiallelic sites

Remove Duplicates After Splitting

bash
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz

Duplicate removal options (-d):

  • exact - Remove exact duplicates
  • snps - Remove duplicate SNPs
  • indels - Remove duplicate indels
  • both - Remove duplicate SNPs and indels
  • all - Remove all duplicates
  • none - Keep duplicates (default)

Fixing Reference Alleles

Fix Mismatches from Reference

bash
bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz

This sets REF alleles to match the reference genome.

Exclude Mismatches

bash
bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz

Removes variants where REF doesn't match reference.

Atomize Complex Variants

Split MNPs to SNPs

bash
bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Before:

chr1  100  .  ATG  GCA  30  PASS

After:

chr1  100  .  A  G  30  PASS
chr1  101  .  T  C  30  PASS
chr1  102  .  G  A  30  PASS

Atomize and Left-Align

bash
bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Old to New Format

Update VCF Version

bash
bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz

Tags original record for reference.

Common Workflows

Before Comparing Callers

bash
# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
    base=$(basename "$vcf" .vcf.gz)
    bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
    bcftools index "${base}.norm.vcf.gz"
done

# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz

Before Database Annotation

bash
bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.

Prepare for GWAS

bash
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
    bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz

cyvcf2 Normalization Check

Check if Variants Need Normalization

python
from cyvcf2 import VCF

def needs_normalization(variant):
    # Check for multiallelic
    if len(variant.ALT) > 1:
        return True

    # Check for complex variants (potential MNPs)
    ref, alt = variant.REF, variant.ALT[0]
    if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
        return True

    return False

count = 0
for variant in VCF('input.vcf.gz'):
    if needs_normalization(variant):
        count += 1

print(f'Variants needing normalization: {count}')

Count Multiallelic Sites

python
from cyvcf2 import VCF

multiallelic = 0
total = 0

for variant in VCF('input.vcf.gz'):
    total += 1
    if len(variant.ALT) > 1:
        multiallelic += 1

print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')

Quick Reference

Task Command
Left-align indels bcftools norm -f ref.fa in.vcf.gz
Split multiallelic bcftools norm -m-any in.vcf.gz
Join to multiallelic bcftools norm -m+any in.vcf.gz
Full normalization bcftools norm -f ref.fa -m-any in.vcf.gz
Fix REF alleles bcftools norm -f ref.fa -c s in.vcf.gz
Remove duplicates bcftools norm -d exact in.vcf.gz
Atomize MNPs bcftools norm --atomize in.vcf.gz

Common Errors

Error Cause Solution
REF does not match Wrong reference Use same reference as caller
not sorted Unsorted input Run bcftools sort first
duplicate records Same position twice Use -d to remove

Related Skills

  • variant-calling - Generate VCF files
  • filtering-best-practices - Filter after normalization
  • vcf-manipulation - Compare normalized VCFs
  • variant-annotation - Annotate normalized variants

Didn't find tool you were looking for?

Be as detailed as possible for better results