200 likes | 446 Views
Differential Principal Component Analysis (dPCA) for ChIP-seq. Hongkai Ji ( hji@jhsph.edu ) Department of Biostatistics The Bloomberg School of Public Health Johns Hopkins University. Functional Genomics. Locations and Functions.
E N D
Differential Principal Component Analysis (dPCA) for ChIP-seq Hongkai Ji (hji@jhsph.edu) Department of Biostatistics The Bloomberg School of Public Health Johns Hopkins University
Functional Genomics Locations and Functions Maston, Evans & Green, Annu Rev Genomics Hum Genet, 2006, 7: 29-59
ChIP-seq Transcription Factor (TF) Gene motif
Motivation: how to compare multiple ChIP profiles between two biological conditions? Cell Type 1 Cell Type 2
Data Structure Cell Type 2 Cell Type 1 Marker 1 (H3K4me3) Marker 1 (H3K4me3) Marker 2 (H3K27me3) Marker 2 (H3K27me3) Marker M (Myc) Marker M (Myc) … … Rep K1 Rep K2 Rep K1 Rep K2 Rep K1 Rep K2 Rep 1 Rep 1 Rep 1 Rep 1 Rep 1 Rep 1 … … … … … … Intensities for locus g, marker m, replicate k: xgmk ~ G(x; μ1gm, σ2) Intensities for locus g, marker m, replicate k: ygmk ~ G(x; μ2gm, σ2) Locus 1 Locus 2 … Locus G
Modeling True Difference 0 * 0 0 0 0 * 0 0 0 0 . 0 * * . 0 * 0 . * 0 0 0 * 0 0 0 * 0
0 * 0 0 0 0 * 0 0 0 0 . 0 * * . 0 * 0 . * 0 0 0 * 0 0 0 * 0 Goals of Analysis 1. Estimate … 2. Infer 0 * 0 0 0 0 * 0 0 0 0 . 0 * * . 0 * 0 . * 0 0 0 * 0 0 0 * 0 (2.a) Rank loci according to each component (based on ugi); (2.b) Test ugi = 0?
Example: K562 vs. Huvec ENCODE Data G = 138,328 MYC motif sites in human genome; M = 18 data sets.
PC1 predicts MYC differential binding better than using each marker individually
Example: K562 vs. Huvec ENCODE Data G = 138,328 MYC motif sites in human genome; M = 25 data sets. PC1: 50% FDR<5%: 65252 H3K27me3 H3K36me3 H4K20me1 H3K27me3 H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K9me1 H3K4me3 H3K27ac H3K9ac DNase FAIRE CTCF CTCF Jun Max Pol2 Input Input CTCF Input Input Pol2 PC2: 14% FDR<5%: 47960
Implications TF Cell type 1 TF Cell type 2
Example: K562 vs. Huvec ENCODE Data G = 24376 human promoters; M = 16 markers. H3K27me3 H3K36me3 H4K20me1 H3K27me3 H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K9me1 H3K27ac H3K4me3 H3K9ac CTCF Input CTCF Input
PC1 predicts RNA-seq differential expression Cor = 0.6615
False Discovery Rate (FDR) 0 * 0 0 0 0 * 0 0 0 0 . 0 * * . 0 * 0 . * 0 0 0 * 0 0 0 * 0
Summary • dPCA provides a way to concisely summarize differences between two cell types. • Differential genes along the major PC have biological meaning. • Future directions include modeling the signal shapes, multiple conditions, non-linearity, and establishing convergence rate.