Agent skill
vcf-manipulation
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/variant-interpretation-acmg/bioSkills/vcf-manipulation
SKILL.md
name: bio-vcf-manipulation description: Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
VCF Manipulation
Merge, concat, sort, and compare VCF files using bcftools.
Operations Overview
| Operation | Command | Use Case |
|---|---|---|
| Merge | bcftools merge |
Combine samples from multiple VCFs |
| Concat | bcftools concat |
Combine regions from multiple VCFs |
| Sort | bcftools sort |
Sort unsorted VCF |
| Intersect | bcftools isec |
Compare/intersect call sets |
| Subset | bcftools view |
Extract samples or regions |
bcftools merge
Combine multiple VCF files with different samples at the same positions.
Basic Merge
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Merge Multiple Files
bcftools merge *.vcf.gz -Oz -o all_samples.vcf.gz
Merge from File List
# files.txt: one VCF path per line
bcftools merge -l files.txt -Oz -o merged.vcf.gz
Handle Missing Genotypes
# Output missing genotypes as ./. (default)
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
# Output missing as reference (0/0)
bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Force Sample Names
When sample names conflict:
bcftools merge --force-samples sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Merge Specific Regions
bcftools merge -r chr1:1000000-2000000 sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
bcftools concat
Combine VCF files with same samples from different regions.
Concatenate Chromosomes
bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz
Concatenate All Chromosomes
bcftools concat chr*.vcf.gz -Oz -o genome.vcf.gz
From File List
# files.txt: one VCF path per line (in order)
bcftools concat -f files.txt -Oz -o concatenated.vcf.gz
Allow Overlapping Regions
bcftools concat -a chr1_part1.vcf.gz chr1_part2.vcf.gz -Oz -o chr1.vcf.gz
Remove Duplicates
bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz
Options for -d:
snps- Remove duplicate SNPsindels- Remove duplicate indelsboth- Remove duplicate SNPs and indelsall- Remove all duplicatesexact- Remove exact duplicates only
bcftools sort
Sort VCF by chromosome and position.
Basic Sort
bcftools sort input.vcf -Oz -o sorted.vcf.gz
With Temporary Directory
For large files:
bcftools sort -T /tmp input.vcf.gz -Oz -o sorted.vcf.gz
Memory Limit
bcftools sort -m 4G input.vcf.gz -Oz -o sorted.vcf.gz
bcftools isec
Intersect and compare VCF files.
Find Shared Variants
bcftools isec -p output_dir sample1.vcf.gz sample2.vcf.gz
Creates:
0000.vcf- Private to sample10001.vcf- Private to sample20002.vcf- Shared (sample1 records)0003.vcf- Shared (sample2 records)
Output Compressed
bcftools isec -p output_dir -Oz sample1.vcf.gz sample2.vcf.gz
Intersection Only
bcftools isec -p output_dir -n=2 sample1.vcf.gz sample2.vcf.gz
# Only outputs variants present in exactly 2 files
Comparison Options
| Flag | Description |
|---|---|
-n=2 |
Present in exactly 2 files |
-n+2 |
Present in 2 or more files |
-n-2 |
Present in fewer than 2 files |
-n~11 |
Boolean: file1 AND file2 |
-n~10 |
Boolean: file1 AND NOT file2 |
Two-File Intersection
# Variants in both files
bcftools isec -n=2 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o shared.vcf.gz
# Variants only in sample1
bcftools isec -n~10 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o only_sample1.vcf.gz
Complement Mode
# Variants in file1 not in file2
bcftools isec -C sample1.vcf.gz sample2.vcf.gz -Oz -o unique.vcf.gz
Subsetting VCF Files
Extract Samples
bcftools view -s sample1,sample2 input.vcf.gz -Oz -o subset.vcf.gz
Exclude Samples
bcftools view -s ^sample3 input.vcf.gz -Oz -o without_sample3.vcf.gz
From Sample List File
# samples.txt: one sample name per line
bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz
Extract Region
bcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz
Extract Multiple Regions
bcftools view -R regions.bed input.vcf.gz -Oz -o targets.vcf.gz
Renaming Samples
Single Sample
echo "old_name new_name" > rename.txt
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
Multiple Samples
# rename.txt format: old_name new_name
cat > rename.txt << EOF
sample1 patient_001
sample2 patient_002
sample3 patient_003
EOF
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
Splitting VCF Files
Split by Sample
for sample in $(bcftools query -l input.vcf.gz); do
bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz"
done
Split by Chromosome
for chr in $(bcftools view -h input.vcf.gz | grep "^##contig" | sed 's/.*ID=\([^,]*\).*/\1/'); do
bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz"
done
Split Multiallelic Sites
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
Common Workflows
Merge Cohort VCFs
# Create file list
ls *.vcf.gz > files.txt
# Merge all samples
bcftools merge -l files.txt -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz
Combine Chromosome VCFs
# After parallel variant calling by chromosome
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gz
Compare Two Callers
# Find variants called by both GATK and bcftools
bcftools isec -p comparison gatk.vcf.gz bcftools.vcf.gz
# Count results
wc -l comparison/*.vcf
Extract Passing Variants
bcftools view -f PASS input.vcf.gz -Oz -o pass_only.vcf.gz
bcftools index pass_only.vcf.gz
cyvcf2 Python Operations
Note: True VCF merging (combining samples at matching positions) is complex.
Use bcftools merge for production work. cyvcf2 is better for filtering/querying.
Concatenate Records (Not True Merge)
from cyvcf2 import VCF, Writer
# WARNING: This concatenates records, not a true merge
# For actual merging of samples, use bcftools merge
vcf1 = VCF('file1.vcf.gz')
writer = Writer('combined.vcf', vcf1)
for variant in vcf1:
writer.write_record(variant)
writer.close()
vcf1.close()
Find Shared Positions
from cyvcf2 import VCF
# Load positions from first VCF
vcf1_positions = set()
for variant in VCF('sample1.vcf.gz'):
vcf1_positions.add((variant.CHROM, variant.POS))
# Check second VCF
shared = 0
unique = 0
for variant in VCF('sample2.vcf.gz'):
if (variant.CHROM, variant.POS) in vcf1_positions:
shared += 1
else:
unique += 1
print(f'Shared: {shared}')
print(f'Unique to sample2: {unique}')
Quick Reference
| Task | Command |
|---|---|
| Merge samples | bcftools merge s1.vcf.gz s2.vcf.gz -Oz -o merged.vcf.gz |
| Concat regions | bcftools concat chr1.vcf.gz chr2.vcf.gz -Oz -o all.vcf.gz |
| Sort VCF | bcftools sort input.vcf -Oz -o sorted.vcf.gz |
| Intersect | bcftools isec -p dir a.vcf.gz b.vcf.gz |
| Extract samples | bcftools view -s sample1 input.vcf.gz |
| Rename samples | bcftools reheader -s names.txt input.vcf.gz |
Common Errors
| Error | Cause | Solution |
|---|---|---|
different samples |
merge vs concat confusion | Use merge for samples, concat for regions |
not sorted |
Unsorted input to concat | Sort first or use -a flag |
sample name conflict |
Duplicate sample names | Use --force-samples |
index required |
Missing index for merge/isec | Run bcftools index first |
Related Skills
- vcf-basics - View and query VCF files
- filtering-best-practices - Filter variants before manipulation
- variant-normalization - Normalize before comparing
- vcf-statistics - Compare statistics after manipulation
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?