Agent skill

vcf-statistics

Stars 2,009
Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/variant-interpretation-acmg/bioSkills/vcf-statistics

SKILL.md


name: bio-vcf-statistics description: Generate variant statistics, sample concordance, and quality metrics using bcftools stats and gtcheck. Use when evaluating variant quality, comparing samples, or summarizing VCF contents. tool_type: cli primary_tool: bcftools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

VCF Statistics

Generate statistics and quality metrics using bcftools.

Statistics Tools

Command Purpose
bcftools stats Comprehensive variant statistics
bcftools gtcheck Sample concordance and relatedness
bcftools query Custom summaries

bcftools stats

Basic Statistics

bash
bcftools stats input.vcf.gz > stats.txt

View Key Metrics

bash
bcftools stats input.vcf.gz | grep "^SN"

Output sections:

  • SN - Summary numbers
  • TSTV - Transitions/transversions
  • SiS - Singleton stats
  • AF - Allele frequency distribution
  • QUAL - Quality distribution
  • IDD - Indel distribution
  • ST - Substitution types
  • DP - Depth distribution

Summary Numbers (SN)

bash
bcftools stats input.vcf.gz | grep "^SN" | cut -f3-

Reports:

  • Number of samples
  • Number of records
  • Number of SNPs
  • Number of indels
  • Number of multiallelic sites
  • Number of multiallelic SNPs

Transition/Transversion Ratio

bash
bcftools stats input.vcf.gz | grep "^TSTV"

Expected Ti/Tv ratio:

  • Whole genome: ~2.0-2.1
  • Exome: ~2.8-3.3

Per-Sample Statistics

bash
bcftools stats -s - input.vcf.gz > per_sample.txt

Compare Two VCFs

bash
bcftools stats input1.vcf.gz input2.vcf.gz > comparison.txt

Region-Specific Stats

bash
bcftools stats -r chr1:1000000-2000000 input.vcf.gz > region_stats.txt

Exome Statistics

bash
bcftools stats -R exome.bed input.vcf.gz > exome_stats.txt

Plotting Statistics

Generate Plots

bash
bcftools stats input.vcf.gz > stats.txt
plot-vcfstats -p output_dir stats.txt

Creates:

  • output_dir/summary.pdf
  • Individual PNG files

Comparison Plots

bash
bcftools stats file1.vcf.gz file2.vcf.gz > comparison.txt
plot-vcfstats -p comparison_dir comparison.txt

bcftools gtcheck

Check Sample Identity

bash
bcftools gtcheck -g reference.vcf.gz query.vcf.gz

Reports concordance between samples.

Detect Sample Swaps

bash
bcftools gtcheck -G 1 input.vcf.gz > relatedness.txt

Compares all samples pairwise.

Output Format

DC  0  sample1  sample2  0.95  1234  1200

Fields:

  • DC: Data type (discordance)
  • Index
  • Sample 1
  • Sample 2
  • Discordance rate
  • Sites compared
  • Discordant sites

Check Against Reference Panel

bash
bcftools gtcheck -g 1000genomes.vcf.gz unknown_sample.vcf.gz

Quick Statistics with Query

Count Variants

bash
bcftools view -H input.vcf.gz | wc -l

Count by Type

bash
# SNPs
bcftools view -v snps -H input.vcf.gz | wc -l

# Indels
bcftools view -v indels -H input.vcf.gz | wc -l

Count PASS Variants

bash
bcftools view -f PASS -H input.vcf.gz | wc -l

Quality Distribution

bash
bcftools query -f '%QUAL\n' input.vcf.gz | \
    awk '{sum+=$1; count++} END {print "Mean QUAL:", sum/count}'

Depth Distribution

bash
bcftools query -f '%INFO/DP\n' input.vcf.gz | \
    awk '{sum+=$1; count++} END {print "Mean DP:", sum/count}'

Genotype Counts

bash
# Count heterozygous sites per sample
bcftools query -f '[%GT\t]\n' input.vcf.gz | \
    awk -F'\t' '{for(i=1;i<=NF;i++) if($i=="0/1" || $i=="0|1") het[i]++}
        END {for(i in het) print "Sample", i, "het:", het[i]}'

Allele Frequency Spectrum

bash
bcftools query -f '%INFO/AF\n' input.vcf.gz | \
    awk '{
        if($1<0.01) rare++
        else if($1<0.05) low++
        else if($1<0.5) common++
        else freq++
    } END {
        print "Rare (<1%):", rare
        print "Low (1-5%):", low
        print "Common (5-50%):", common
        print "Frequent (>50%):", freq
    }'

Sample Statistics

List Samples

bash
bcftools query -l input.vcf.gz

Count Samples

bash
bcftools query -l input.vcf.gz | wc -l

Per-Sample Variant Counts

bash
for sample in $(bcftools query -l input.vcf.gz); do
    count=$(bcftools view -s "$sample" -H input.vcf.gz | \
        bcftools view -c 1 -H | wc -l)
    echo "$sample: $count"
done

Missing Genotypes per Sample

bash
bcftools stats -s - input.vcf.gz | grep "^PSC"

cyvcf2 Statistics

Basic Counts

python
from cyvcf2 import VCF

stats = {'snps': 0, 'indels': 0, 'other': 0}

for variant in VCF('input.vcf.gz'):
    if variant.is_snp:
        stats['snps'] += 1
    elif variant.is_indel:
        stats['indels'] += 1
    else:
        stats['other'] += 1

print(f'SNPs: {stats["snps"]}')
print(f'Indels: {stats["indels"]}')
print(f'Other: {stats["other"]}')

Quality Statistics

python
from cyvcf2 import VCF
import numpy as np

quals = []
for variant in VCF('input.vcf.gz'):
    if variant.QUAL:
        quals.append(variant.QUAL)

quals = np.array(quals)
print(f'Mean QUAL: {np.mean(quals):.1f}')
print(f'Median QUAL: {np.median(quals):.1f}')
print(f'Min QUAL: {np.min(quals):.1f}')
print(f'Max QUAL: {np.max(quals):.1f}')

Genotype Distribution

python
from cyvcf2 import VCF

vcf = VCF('input.vcf.gz')
samples = vcf.samples

hom_ref = [0] * len(samples)
het = [0] * len(samples)
hom_alt = [0] * len(samples)
missing = [0] * len(samples)

for variant in vcf:
    for i, gt in enumerate(variant.gt_types):
        if gt == 0:
            hom_ref[i] += 1
        elif gt == 1:
            het[i] += 1
        elif gt == 3:
            hom_alt[i] += 1
        else:
            missing[i] += 1

for i, sample in enumerate(samples):
    print(f'{sample}: HOM_REF={hom_ref[i]}, HET={het[i]}, HOM_ALT={hom_alt[i]}, MISS={missing[i]}')

Transition/Transversion Calculation

python
from cyvcf2 import VCF

transitions = 0
transversions = 0

ti_pairs = {('A', 'G'), ('G', 'A'), ('C', 'T'), ('T', 'C')}

for variant in VCF('input.vcf.gz'):
    if not variant.is_snp:
        continue
    ref = variant.REF
    alt = variant.ALT[0]
    if (ref, alt) in ti_pairs:
        transitions += 1
    else:
        transversions += 1

ratio = transitions / transversions if transversions > 0 else 0
print(f'Transitions: {transitions}')
print(f'Transversions: {transversions}')
print(f'Ti/Tv ratio: {ratio:.2f}')

Common Workflows

Quality Control Report

bash
# Generate stats
bcftools stats input.vcf.gz > stats.txt

# Extract key metrics
echo "=== VCF Summary ==="
grep "^SN" stats.txt | cut -f3-

echo ""
echo "=== Ti/Tv Ratio ==="
grep "^TSTV" stats.txt | cut -f5

# Generate plots
plot-vcfstats -p qc_plots stats.txt

Compare Before/After Filtering

bash
bcftools stats raw.vcf.gz filtered.vcf.gz > comparison.txt

echo "=== Before Filtering ==="
grep "^SN.*raw" comparison.txt | cut -f3-

echo ""
echo "=== After Filtering ==="
grep "^SN.*filtered" comparison.txt | cut -f3-

Sample Relatedness Check

bash
bcftools gtcheck -G 1 cohort.vcf.gz > relatedness.txt
cat relatedness.txt

Quick Reference

Task Command
Full stats bcftools stats input.vcf.gz
Summary only bcftools stats input.vcf.gz | grep "^SN"
Ti/Tv ratio bcftools stats input.vcf.gz | grep "^TSTV"
Per-sample bcftools stats -s - input.vcf.gz
Compare VCFs bcftools stats file1.vcf.gz file2.vcf.gz
Sample check bcftools gtcheck -G 1 input.vcf.gz
Plot stats plot-vcfstats -p dir stats.txt

Common Errors

Error Cause Solution
No data Empty VCF Check if VCF has variants
plot-vcfstats not found Not installed Install with bcftools
Cannot open Invalid VCF Check file format

Related Skills

  • vcf-basics - View and query VCF files
  • filtering-best-practices - Evaluate filter impact
  • vcf-manipulation - Compare call sets
  • variant-calling - Assess calling quality

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results