Agent skills
variant-normalization

Agent skill

variant-normalization

Stars 2,009

Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/variant-interpretation-acmg/bioSkills/variant-normalization

SKILL.md

name: bio-variant-normalization description: Normalize indel representation and split multiallelic variants using bcftools norm. Use when comparing variants from different callers or preparing VCF for downstream analysis. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

Variant Normalization

Left-align indels and split multiallelic sites using bcftools norm.

Why Normalize?

The same variant can be represented multiple ways:

# Same deletion, different representations
chr1  100  ATCG  A      (right-aligned)
chr1  100  ATC   A      (left-aligned, normalized)
chr1  101  TCG   T      (different position)

Normalization ensures consistent representation for:

Comparing variants from different callers
Database lookups (dbSNP, ClinVar)
Merging VCF files

bcftools norm

Left-Align Indels

bash

bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz

Requires reference FASTA to determine left-most representation.

Check for Normalization Issues

bash

bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null
# Reports REF allele mismatches

Check modes (-c):

w - Warn on mismatch (default)
e - Error on mismatch
x - Exclude mismatches
s - Set correct REF from reference

Multiallelic Sites

Split Multiallelic to Biallelic

bash

bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz

Before:

chr1  100  .  A  G,T  30  PASS  .  GT  1/2

After:

chr1  100  .  A  G  30  PASS  .  GT  1/0
chr1  100  .  A  T  30  PASS  .  GT  0/1

Split SNPs Only

bash

bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz

Split Indels Only

bash

bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz

Join Biallelic to Multiallelic

bash

bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz

Split Options

Option	Description
`-m-any`	Split all multiallelic sites
`-m-snps`	Split multiallelic SNPs only
`-m-indels`	Split multiallelic indels only
`-m-both`	Split SNPs and indels separately
`-m+any`	Join biallelic sites into multiallelic
`-m+snps`	Join biallelic SNPs
`-m+indels`	Join biallelic indels
`-m+both`	Join SNPs and indels separately

Combined Normalization

Standard Normalization Pipeline

bash

bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz

This:

Left-aligns indels
Splits multiallelic sites

Remove Duplicates After Splitting

bash

bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz

Duplicate removal options (-d):

exact - Remove exact duplicates
snps - Remove duplicate SNPs
indels - Remove duplicate indels
both - Remove duplicate SNPs and indels
all - Remove all duplicates
none - Keep duplicates (default)

Fixing Reference Alleles

Fix Mismatches from Reference

bash

bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz

This sets REF alleles to match the reference genome.

Exclude Mismatches

bash

bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz

Removes variants where REF doesn't match reference.

Atomize Complex Variants

Split MNPs to SNPs

bash

bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Before:

chr1  100  .  ATG  GCA  30  PASS

After:

chr1  100  .  A  G  30  PASS
chr1  101  .  T  C  30  PASS
chr1  102  .  G  A  30  PASS

Atomize and Left-Align

bash

bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz

Old to New Format

Update VCF Version

bash

bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz

Tags original record for reference.

Common Workflows

Before Comparing Callers

bash

# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
    base=$(basename "$vcf" .vcf.gz)
    bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
    bcftools index "${base}.norm.vcf.gz"
done

# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz

Before Database Annotation

bash

bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz
bcftools index normalized.vcf.gz
# Now annotate against dbSNP, ClinVar, etc.

Prepare for GWAS

bash

bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
    bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz

cyvcf2 Normalization Check

Check if Variants Need Normalization

python

from cyvcf2 import VCF

def needs_normalization(variant):
    # Check for multiallelic
    if len(variant.ALT) > 1:
        return True

    # Check for complex variants (potential MNPs)
    ref, alt = variant.REF, variant.ALT[0]
    if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
        return True

    return False

count = 0
for variant in VCF('input.vcf.gz'):
    if needs_normalization(variant):
        count += 1

print(f'Variants needing normalization: {count}')

Count Multiallelic Sites

python

from cyvcf2 import VCF

multiallelic = 0
total = 0

for variant in VCF('input.vcf.gz'):
    total += 1
    if len(variant.ALT) > 1:
        multiallelic += 1

print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')

Quick Reference

Task	Command
Left-align indels	`bcftools norm -f ref.fa in.vcf.gz`
Split multiallelic	`bcftools norm -m-any in.vcf.gz`
Join to multiallelic	`bcftools norm -m+any in.vcf.gz`
Full normalization	`bcftools norm -f ref.fa -m-any in.vcf.gz`
Fix REF alleles	`bcftools norm -f ref.fa -c s in.vcf.gz`
Remove duplicates	`bcftools norm -d exact in.vcf.gz`
Atomize MNPs	`bcftools norm --atomize in.vcf.gz`

Common Errors

Error	Cause	Solution
`REF does not match`	Wrong reference	Use same reference as caller
`not sorted`	Unsorted input	Run `bcftools sort` first
`duplicate records`	Same position twice	Use `-d` to remove

Related Skills

variant-calling - Generate VCF files
filtering-best-practices - Filter after normalization
vcf-manipulation - Compare normalized VCFs
variant-annotation - Annotate normalized variants

Maintainer

FreedomIntelligence Core maintainer

Source details

Full Name: FreedomIntelligence/OpenClaw-Medical-Skills
Branch: main
Path in repo: skills/variant-interpretation-acmg/bioSkills/variant-normalization
Topics: claude-code skills openclaw awesome clawhub openclaw-skills medical nanoclaw

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量，并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Variant Normalization

Why Normalize?

bcftools norm

Left-Align Indels

Check for Normalization Issues

Multiallelic Sites

Split Multiallelic to Biallelic

Split SNPs Only

Split Indels Only

Join Biallelic to Multiallelic

Split Options

Combined Normalization

Standard Normalization Pipeline

Remove Duplicates After Splitting

Fixing Reference Alleles

Fix Mismatches from Reference

Exclude Mismatches

Atomize Complex Variants

Split MNPs to SNPs

Atomize and Left-Align

Old to New Format

Update VCF Version

Common Workflows

Before Comparing Callers

Before Database Annotation

Prepare for GWAS

cyvcf2 Normalization Check

Check if Variants Need Normalization

Count Multiallelic Sites

Quick Reference

Common Errors

Related Skills

Recommended Agent Skills

vcf-annotator

chemist-analyst

bio-alignment-io

sleep-analyzer

metabolomics-workbench-database

bio-hi-c-analysis-matrix-operations