Agent skill

24-global-methylation-profile

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/24-global-methylation-profile

SKILL.md


name: global-methylation-profile description: This skill performs genome-wide DNA methylation profiling. It supports single-sample and multi-sample workflows to compute methylation density distributions, genomic feature distribution of the methylation profile, and sample-level clustering/PCA. Use it when you want to systematically characterize global methylation patterns from WGBS or similar per-CpG methylation call files.

Global DNA Methylation Profiling

Overview

Main steps include:

  • Refer to the Inputs & Outputs section to check available inputs and design the output structure.
  • Always prompt user for genome assembly used.
  • Always prompt user for which columns are methylation fraction/percent and coverage and strand.
  • Analyze the genomic feature distribution of methylations for each sample.
  • Compute and visualize genome-wide methylation density distributions.
  • For multi-sample datasets, prepare the matrix of methylation data.
  • Perform PCA and hierarchical clustering to assess sample similarity based on global methylation.
  • Never use MCP tools in this skill, use R scripts instead.

When to use this skill

Use the global-methylation-profiling skill when you want to:

  • Characterize global DNA methylation status of one or multiple samples (e.g. normal vs tumor, different cell types).
  • Compare broad methylation patterns across samples:
    • Are some samples globally hypo-/hyper-methylated?
    • Are certain chromosomes or genomic regions more strongly affected?
  • Explore genomic feature of your methylation dataset (e.g. promoter hypomethylation, gene body hypermethylation).
  • Perform unsupervised clustering/PCA to see if samples separate by condition based on genome-wide methylation patterns.

Inputs & Outputs

Inputs

<sample.bed

Outputs

bash
global_methylation_profile/
  stats/
    summary_statistics.tsv
    ...
  plots/
    sample1_genomic_feature_pie.pdf
    sample2_genomic_feature_pie.pdf
    ... # Other samples
    allSamples_methylation_density_overlay.pdf
    PCA_scatterplot.pdf
    sample_correlation_heatmap.pdf
    ...
  logs/
  temp/ # all the temp files

Decision Tree

Step 1: Prepare the sample meta data

r
library(methylKit)
# Example input: Bismark coverage files (chr, start, end, numCs, numTs, strand)
file.list <- list(
  "sample1.cov",
  "sample2.cov",
  "sample3.cov"
)

sample.id <- list("S1", "S2", "S3")
treatment <- c(0, 1, 1)  # e.g. 0 = control, 1 = treated

# Read methylation data
myobj <- methRead(
  location = file.list,
  sample.id = sample.id,
  assembly  = "hg38", # provided by user
  treatment = treatment,
  context   = "CpG",
  pipeline = list(
    fraction = FALSE,  # percMeth is 0–100, fraction is 0-1, depend on inputs
    chr.col = 1,
    start.col = 2,
    end.col = 3,
    strand.col = 6, # provided by user
    coverage.col = 10, # provided by user
    freqC.col = 11 # provided by user
  )
)

# Optional filtering: remove low / extremely high coverage CpGs
filtered.myobj <- filterByCoverage(
  myobj,
  lo.count = 10, lo.perc = NULL,
  hi.count = 99.9, hi.perc = TRUE
)

# Unite CpGs across samples (common CpG sites)
meth <- unite(filtered.myobj, destrand = TRUE)

Step 2: Analyze Genomic Feature Distribution of CpGs

Annotate CpGs with genomic features (promoter, exon, intron, intergenic, etc.) with genomation and summarize where CpGs (or methylated CpGs) are located for each sample

r
library(genomation)
library(TxDb.Hsapiens.UCSC.hg38.knownGene) # depend on user inputs
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# exons
exons <- unlist(exonsBy(txdb))
names(exons) <- NULL
mcols(exons)$type <- "exon"
# introns
introns <- unlist(intronsByTranscript(txdb))
names(introns) <- NULL
mcols(introns)$type <- "intron"
# promoters
promoters.gr <- promoters(txdb, upstream = 2000, downstream = 200)
names(promoters.gr) <- NULL
mcols(promoters.gr)$type <- "promoter"
# TSS(1bp)
TSSes <- promoters(txdb, upstream = 1, downstream = 1)
names(TSSes) <- NULL
mcols(TSSes)$type <- "TSS"
# 3'UTR
utr3 <- unlist(threeUTRsByTranscript(txdb))
names(utr3) <- NULL
mcols(utr3)$type <- "UTR3"
# 5'UTR
utr5 <- unlist(fiveUTRsByTranscript(txdb))
names(utr5) <- NULL
mcols(utr5)$type <- "UTR5"

gene.obj <- GRangesList(
  promoters = promoters.gr,
  exons     = exons,
  introns   = introns,
  TSSes     = TSSes
  UTR5      = utr5,
  UTR3      = utr3,
  ... # other features
)

for (i in seq_along(filtered.myobj)) {
  sample_id <- filtered.myobj[[i]]@sample.id
  cpg.gr <- as(filtered.myobj[[i]], "GRanges")
  ann.gene <- annotateWithGeneParts(cpg.gr, gene.obj)
  feature.summary <- getTargetAnnotationStats(ann.gene, percentage = TRUE)
  out_tab <- as.data.frame(feature.summary)
  write.table(
    out_tab,
    file      = file.path(plot_dir, paste0(sample_id, "_feature_annotation_stats.tsv")),
    sep       = "\t", quote = FALSE, row.names = FALSE
  )
  pdf(file.path(plot_dir, paste0(sample_id, "_genomic_feature_distribution.pdf")))
  plotTargetAnnotation(
    ann.gene,
    main = paste("Genomic feature distribution of CpGs -", sample_id)
  )
  dev.off()
}

Step 3: Compute & visualize genome-wide methylation density distributions

r
# Convert to percent methylation matrix: rows = CpGs, cols = samples
meth.mat <- percMethylation(meth)  # values 0–100

df.long <- reshape2::melt(
  as.data.frame(meth.mat),
  variable.name = "Sample",
  value.name   = "Methylation"
)

ggplot(df.long, aes(x = Methylation, color = Sample)) +
  geom_density() +
  theme_bw() +
  xlab("Percent methylation") +
  ggtitle("Genome-wide methylation density across samples")

Step 4: PCA & Hierarchical Clustering of Multi-sample Methylation

  • Use CpG methylation profiles across samples to assess sample similarity and batch effects.
r
# Meth matrix: rows = CpGs, cols = samples (0–100)
meth.mat <- percMethylation(meth)

# (Optional) Filter CpGs by variability
cpg.sd <- apply(meth.mat, 1, sd, na.rm = TRUE)
keep.var <- cpg.sd > 0
meth.var <- meth.mat[keep.var, ]
if (sum(keep.var) > 10000) {
  keep.idx <- order(cpg.sd[keep.var], decreasing = TRUE)[1:10000]
  meth.var <- meth.var[keep.idx, ]
}

# Z-score transformation (per CpG) – helps clustering
meth.scaled <- t(scale(t(meth.var)))  # rows scaled

pca <- prcomp(t(meth.scaled), center = FALSE, scale. = FALSE)
pca.df <- data.frame(
  Sample = colnames(meth.scaled),
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  Treatment = factor(treatment, labels = c("Control", "Treatment"))
)

ggplot(pca.df, aes(x = PC1, y = PC2, color = Treatment, label = Sample)) +
  geom_point(size = 3) +
  geom_text(vjust = -1) +
  theme_bw() +
  ggtitle("PCA of global CpG methylation") +
  xlab(paste0("PC1 (", round(summary(pca)$importance[2, 1] * 100, 1), "%)")) +
  ylab(paste0("PC2 (", round(summary(pca)$importance[2, 2] * 100, 1), "%)"))

dist.samples <- dist(t(meth.scaled), method = "euclidean")
hc <- hclust(dist.samples, method = "complete")

plot(hc, main = "Hierarchical clustering of samples (methylation)",
     xlab = "", sub = "")

cor.samples <- cor(meth.var, use = "pairwise.complete.obs")

pheatmap(cor.samples,
         clustering_method = "complete",
         main = "Sample correlation based on CpG methylation")

Didn't find tool you were looking for?

Be as detailed as possible for better results