Agent skill

vcf-manipulation

Stars 2,009

Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/variant-interpretation-acmg/bioSkills/vcf-manipulation

SKILL.md

name: bio-vcf-manipulation description: Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

VCF Manipulation

Merge, concat, sort, and compare VCF files using bcftools.

Operations Overview

Operation	Command	Use Case
Merge	`bcftools merge`	Combine samples from multiple VCFs
Concat	`bcftools concat`	Combine regions from multiple VCFs
Sort	`bcftools sort`	Sort unsorted VCF
Intersect	`bcftools isec`	Compare/intersect call sets
Subset	`bcftools view`	Extract samples or regions

bcftools merge

Combine multiple VCF files with different samples at the same positions.

Basic Merge

bash

bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

Merge Multiple Files

bash

bcftools merge *.vcf.gz -Oz -o all_samples.vcf.gz

Merge from File List

bash

# files.txt: one VCF path per line
bcftools merge -l files.txt -Oz -o merged.vcf.gz

Handle Missing Genotypes

bash

# Output missing genotypes as ./. (default)
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

# Output missing as reference (0/0)
bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

Force Sample Names

When sample names conflict:

bash

bcftools merge --force-samples sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

Merge Specific Regions

bash

bcftools merge -r chr1:1000000-2000000 sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz

bcftools concat

Combine VCF files with same samples from different regions.

Concatenate Chromosomes

bash

bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz

Concatenate All Chromosomes

bash

bcftools concat chr*.vcf.gz -Oz -o genome.vcf.gz

From File List

bash

# files.txt: one VCF path per line (in order)
bcftools concat -f files.txt -Oz -o concatenated.vcf.gz

Allow Overlapping Regions

bash

bcftools concat -a chr1_part1.vcf.gz chr1_part2.vcf.gz -Oz -o chr1.vcf.gz

Remove Duplicates

bash

bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz

Options for -d:

snps - Remove duplicate SNPs
indels - Remove duplicate indels
both - Remove duplicate SNPs and indels
all - Remove all duplicates
exact - Remove exact duplicates only

bcftools sort

Sort VCF by chromosome and position.

Basic Sort

bash

bcftools sort input.vcf -Oz -o sorted.vcf.gz

With Temporary Directory

For large files:

bash

bcftools sort -T /tmp input.vcf.gz -Oz -o sorted.vcf.gz

Memory Limit

bash

bcftools sort -m 4G input.vcf.gz -Oz -o sorted.vcf.gz

bcftools isec

Intersect and compare VCF files.

Find Shared Variants

bash

bcftools isec -p output_dir sample1.vcf.gz sample2.vcf.gz

Creates:

0000.vcf - Private to sample1
0001.vcf - Private to sample2
0002.vcf - Shared (sample1 records)
0003.vcf - Shared (sample2 records)

Output Compressed

bash

bcftools isec -p output_dir -Oz sample1.vcf.gz sample2.vcf.gz

Intersection Only

bash

bcftools isec -p output_dir -n=2 sample1.vcf.gz sample2.vcf.gz
# Only outputs variants present in exactly 2 files

Comparison Options

Flag	Description
`-n=2`	Present in exactly 2 files
`-n+2`	Present in 2 or more files
`-n-2`	Present in fewer than 2 files
`-n~11`	Boolean: file1 AND file2
`-n~10`	Boolean: file1 AND NOT file2

Two-File Intersection

bash

# Variants in both files
bcftools isec -n=2 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o shared.vcf.gz

# Variants only in sample1
bcftools isec -n~10 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o only_sample1.vcf.gz

Complement Mode

bash

# Variants in file1 not in file2
bcftools isec -C sample1.vcf.gz sample2.vcf.gz -Oz -o unique.vcf.gz

Subsetting VCF Files

Extract Samples

bash

bcftools view -s sample1,sample2 input.vcf.gz -Oz -o subset.vcf.gz

Exclude Samples

bash

bcftools view -s ^sample3 input.vcf.gz -Oz -o without_sample3.vcf.gz

From Sample List File

bash

# samples.txt: one sample name per line
bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz

Extract Region

bash

bcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz

Extract Multiple Regions

bash

bcftools view -R regions.bed input.vcf.gz -Oz -o targets.vcf.gz

Renaming Samples

Single Sample

bash

echo "old_name new_name" > rename.txt
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz

Multiple Samples

bash

# rename.txt format: old_name new_name
cat > rename.txt << EOF
sample1 patient_001
sample2 patient_002
sample3 patient_003
EOF

bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz

Splitting VCF Files

Split by Sample

bash

for sample in $(bcftools query -l input.vcf.gz); do
    bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz"
done

Split by Chromosome

bash

for chr in $(bcftools view -h input.vcf.gz | grep "^##contig" | sed 's/.*ID=\([^,]*\).*/\1/'); do
    bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz"
done

Split Multiallelic Sites

bash

bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz

Common Workflows

Merge Cohort VCFs

bash

# Create file list
ls *.vcf.gz > files.txt

# Merge all samples
bcftools merge -l files.txt -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz

Combine Chromosome VCFs

bash

# After parallel variant calling by chromosome
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gz

Compare Two Callers

bash

# Find variants called by both GATK and bcftools
bcftools isec -p comparison gatk.vcf.gz bcftools.vcf.gz

# Count results
wc -l comparison/*.vcf

Extract Passing Variants

bash

bcftools view -f PASS input.vcf.gz -Oz -o pass_only.vcf.gz
bcftools index pass_only.vcf.gz

cyvcf2 Python Operations

Note: True VCF merging (combining samples at matching positions) is complex. Use bcftools merge for production work. cyvcf2 is better for filtering/querying.

Concatenate Records (Not True Merge)

python

from cyvcf2 import VCF, Writer

# WARNING: This concatenates records, not a true merge
# For actual merging of samples, use bcftools merge
vcf1 = VCF('file1.vcf.gz')
writer = Writer('combined.vcf', vcf1)

for variant in vcf1:
    writer.write_record(variant)

writer.close()
vcf1.close()

Find Shared Positions

python

from cyvcf2 import VCF

# Load positions from first VCF
vcf1_positions = set()
for variant in VCF('sample1.vcf.gz'):
    vcf1_positions.add((variant.CHROM, variant.POS))

# Check second VCF
shared = 0
unique = 0
for variant in VCF('sample2.vcf.gz'):
    if (variant.CHROM, variant.POS) in vcf1_positions:
        shared += 1
    else:
        unique += 1

print(f'Shared: {shared}')
print(f'Unique to sample2: {unique}')

Quick Reference

Task	Command
Merge samples	`bcftools merge s1.vcf.gz s2.vcf.gz -Oz -o merged.vcf.gz`
Concat regions	`bcftools concat chr1.vcf.gz chr2.vcf.gz -Oz -o all.vcf.gz`
Sort VCF	`bcftools sort input.vcf -Oz -o sorted.vcf.gz`
Intersect	`bcftools isec -p dir a.vcf.gz b.vcf.gz`
Extract samples	`bcftools view -s sample1 input.vcf.gz`
Rename samples	`bcftools reheader -s names.txt input.vcf.gz`

Common Errors

Error	Cause	Solution
`different samples`	merge vs concat confusion	Use merge for samples, concat for regions
`not sorted`	Unsorted input to concat	Sort first or use `-a` flag
`sample name conflict`	Duplicate sample names	Use `--force-samples`
`index required`	Missing index for merge/isec	Run `bcftools index` first

Related Skills

vcf-basics - View and query VCF files
filtering-best-practices - Filter variants before manipulation
variant-normalization - Normalize before comparing
vcf-statistics - Compare statistics after manipulation

Maintainer

FreedomIntelligence Core maintainer

Source details

Full Name: FreedomIntelligence/OpenClaw-Medical-Skills
Branch: main
Path in repo: skills/variant-interpretation-acmg/bioSkills/vcf-manipulation
Topics: claude-code skills openclaw awesome clawhub openclaw-skills medical nanoclaw

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量，并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

VCF Manipulation

Operations Overview

bcftools merge

Basic Merge

Merge Multiple Files

Merge from File List

Handle Missing Genotypes

Force Sample Names

Merge Specific Regions

bcftools concat

Concatenate Chromosomes

Concatenate All Chromosomes

From File List

Allow Overlapping Regions

Remove Duplicates

bcftools sort

Basic Sort

With Temporary Directory

Memory Limit

bcftools isec

Find Shared Variants

Output Compressed

Intersection Only

Comparison Options

Two-File Intersection

Complement Mode

Subsetting VCF Files

Extract Samples

Exclude Samples

From Sample List File

Extract Region

Extract Multiple Regions

Renaming Samples

Single Sample

Multiple Samples

Splitting VCF Files

Split by Sample

Split by Chromosome

Split Multiallelic Sites

Common Workflows

Merge Cohort VCFs

Combine Chromosome VCFs

Compare Two Callers

Extract Passing Variants

cyvcf2 Python Operations

Concatenate Records (Not True Merge)

Find Shared Positions

Quick Reference

Common Errors

Related Skills

Recommended Agent Skills

vcf-annotator

chemist-analyst

bio-alignment-io

sleep-analyzer

metabolomics-workbench-database

bio-hi-c-analysis-matrix-operations