Agent skill
bio-genome-assembly-scaffolding
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-genome-assembly-scaffolding
SKILL.md
name: bio-genome-assembly-scaffolding description: Scaffold contigs into chromosome-level assemblies using Hi-C data with YaHS, 3D-DNA, SALSA2, and validate with BUSCO and contact maps. Use when scaffolding contigs to chromosome-level assemblies. tool_type: cli primary_tool: YaHS measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Genome Scaffolding
Hi-C Data Preprocessing
# Align Hi-C reads to draft assembly
bwa index draft_assembly.fa
bwa mem -5SP -t 16 draft_assembly.fa hic_R1.fq.gz hic_R2.fq.gz | \
samtools view -@ 8 -bhS - > aligned.bam
# Filter for Hi-C contacts (pairtools)
pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 \
--nproc-in 8 --nproc-out 8 --chroms-path draft_assembly.fa.fai aligned.bam | \
pairtools sort --nproc 8 | \
pairtools dedup --nproc 8 --mark-dups | \
pairtools split --output-pairs contacts.pairs.gz
YaHS Scaffolding (Recommended)
# Index assembly
samtools faidx draft_assembly.fa
# Convert BAM to BED
bedtools bamtobed -i aligned.bam | sort -k4 > aligned.bed
# Run YaHS
yahs draft_assembly.fa aligned.bed -o scaffolds
# Output files:
# scaffolds_scaffolds_final.fa - final scaffolds
# scaffolds_scaffolds_final.agp - AGP file
# scaffolds.bin - contact matrix
YaHS with Error Correction
# Run with error correction
yahs draft_assembly.fa aligned.bed -o scaffolds --no-contig-ec
# Generate contact map for juicebox
juicer pre scaffolds.bin scaffolds_scaffolds_final.agp draft_assembly.fa.fai | \
sort -k2,2d -k6,6d -T ./ --parallel=8 -S 50G | \
awk 'NF' > scaffolds.pre.txt
# Create .hic file
java -Xmx48G -jar juicer_tools.jar pre scaffolds.pre.txt scaffolds.hic \
<(cut -f1,2 scaffolds_scaffolds_final.fa.fai)
3D-DNA Pipeline
# Prepare input (requires Juicer aligned data)
# Run Juicer first to get merged_nodups.txt
# Run 3D-DNA
run-asm-pipeline.sh -r 2 draft_assembly.fa merged_nodups.txt
# Output: draft_assembly.final.fasta
# Generate review assembly for Juicebox
run-asm-pipeline-post-review.sh -r draft_assembly.final.review.assembly \
draft_assembly.final.fasta merged_nodups.txt
SALSA2 Scaffolding
# Run SALSA2
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
-b aligned.bed -e GATC -o salsa_output -m yes
# With multiple restriction enzymes
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
-b aligned.bed -e GATC,GANTC -o salsa_output -m yes -p yes
Generate Contact Map
# Using cooler
cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 \
draft_assembly.fa.fai:10000 contacts.pairs.gz scaffolds.cool
# Balance matrix
cooler balance scaffolds.cool
# Multi-resolution (mcool)
cooler zoomify scaffolds.cool -o scaffolds.mcool
Visualize with HiGlass
# Convert to higlass format
clodius aggregate bedfile --chromsizes-filename chrom.sizes \
--output-file scaffolds.beddb scaffold_boundaries.bed
# Load into higlass server
docker run --detach --publish 8888:80 \
--volume ~/hg-data:/data \
higlass/higlass-docker:latest
Manual Curation (Juicebox)
# Load .hic file in Juicebox Assembly Tools (JBAT)
# Perform manual corrections:
# - Break misjoins
# - Order/orient scaffolds
# - Merge scaffolds
# Export corrected assembly
# File -> Export Assembly -> FASTA
Post-Scaffolding Gap Filling
# TGS-GapCloser for long-read gap filling
tgsgapcloser --scaff scaffolds.fa --reads ont_reads.fq.gz \
--output filled --thread 16 --ne
# LR_Gapcloser alternative
LR_Gapcloser.sh -i scaffolds.fa -l ont_reads.fq.gz -t 16 -o gapclosed.fa
Validate Scaffolding
# Check chromosome-scale contiguity
seqkit stats scaffolds.fa
# BUSCO on scaffolds
busco -i scaffolds.fa -l eukaryota_odb10 -o busco_scaffolds -m genome -c 16
# N50/L50 statistics
assembly-stats scaffolds.fa
# Compare pre/post scaffolding
quast.py draft_assembly.fa scaffolds.fa -o quast_comparison
Check Telomeres
# Find telomeric repeats (vertebrate TTAGGG)
seqkit locate -i -p 'TTAGGG{10,}' scaffolds.fa > telomeres_forward.bed
seqkit locate -i -p 'CCCTAA{10,}' scaffolds.fa > telomeres_reverse.bed
# Count chromosomes with telomeres on both ends
awk '$2 < 1000' telomeres_forward.bed | cut -f1 | sort -u > left_telomeres.txt
awk -v OFS='\t' 'NR==FNR{len[$1]=$2;next} $3 > len[$1]-1000' \
scaffolds.fa.fai telomeres_reverse.bed | cut -f1 | sort -u > right_telomeres.txt
comm -12 left_telomeres.txt right_telomeres.txt > complete_chromosomes.txt
Rename to Chromosomes
# After manual curation, rename scaffolds to chromosomes
awk '/^>/{print ">chr" ++i; next}{print}' scaffolds.fa > chromosomes.fa
# Or with mapping file
seqkit replace -p '(.+)' -r '{kv}' -k name_mapping.tsv scaffolds.fa > chromosomes.fa
Related Skills
- genome-assembly/long-read-assembly - Generate initial contigs
- genome-assembly/assembly-polishing - Polish before scaffolding
- genome-assembly/assembly-qc - Validate final assembly
- hi-c-analysis/hic-data-io - Hi-C data processing
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?