420 likes | 770 Views
ChIP-chip Data, Model and Analysis. Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera , Bing Ren. ChIP-chip. A technology for isolation and identification of the DNA sequences occupied by specific DNA binding proteins (regulatory sequences) in living cells.
E N D
ChIP-chipData, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren
ChIP-chip • A technology for isolation and identification of the DNA sequences occupied by specific DNA binding proteins (regulatory sequences) in living cells. • Chromatin-immunoprecipitation and microarray analysis (chip) are combined to study protein-DNA interaction in vivo. • Also known as “genome-wide location analysis”.
ChIP-chip process Step 1: Bound transcription factors are cross- linked to DNA with formaldehyde
ChIP-chip process (cont’d) Step 2: sonication is used to break genomic DNA to small DNA fragments (various lengths, difficult to measure, 1-2kb)
ChIP-chip process (cont’d) Step 3: Special antibody is added to immuno- precipitate DNA segments crossed-linked with target protein
ChIP-chip process (cont’d) Step 4.1: the cross-linking between DNA and protein is reversed and DNA is amplified by LM-PCR and labeled with a fluorescent dye Cy5.
ChIP-chip process (cont’d) Step 4.2: As a negative control, a sample of DNA which is not enriched by the immuno- precipitation process are also amplified by LM-PCR and labeled with another dye Cy3.
ChIP-chip process (cont’d) Step 5: Both IP-enriched and IP-unenriched samples are hybridized to the same oligonucleotide array.
ChIP-chip process (cont’d) Step 6: The microarray is scanned, Cy5 and Cy3 signal strengths are extracted, and log(Cy5/Cy3) is calculated after normalization. Ren, B. UCSD
Summary of ChIP-chip • Protein bound to DNA • Sonication • Immunoprecipitation • Amplify DNA and add control • Hybridize to probes • Microarray analysis Ren, B. UCSD
ChIP-chip data One probe is one data point in the dataset. The x-axis represents the genomic position of the probe. The y-axis (the height) denotes the signal strength log(Cy5/Cy3) of each probe. SignalMap, NimbleGen Inc.
A closer look SignalMap, NimbleGen Inc.
Cy5 signal • The Cy5 signal strength at a point should be proportional to the probability that an IP-enriched segment contains that point.
Single binding site scenario • Assume there is only one binding site at the origin. • To contribute to the signal at : 1) this binding site is bound by protein 2) no cut should occur between 0 and • Signal at is proportional to (approx):
Model derivation • Assume to be constant around the binding site. Therefore, the Cy5 signal strength should decrease exponentially from the binding site. • Log(Cy5/Cy3) decreases linearly from the binding site: triangular shape.
Regression to fit triangle A simple case: probes are evenly spaced.
Best fitted triangle • Fix left boundary and the right boundary, we can identify the slopes and intercept. • For different combinations of left and right boundary, find the best one with the minimum variance of residual. • This is the best fitted triangle centered at the probe we are considering.
Mpeak process • Arrange local maxima by their signal strength. • For the first local maximum, find the best fitted triangle in a small neighborhood and identify the center as peak.
Mpeak process • For any local maximum in the range of this triangle, if the difference between two fitted values is small, mark it as non-peak. • Continue this process until every local maximum has been considered or smaller than a threshold.
P value of peaks • Null hypothesis: background signal in ChIP-chip data follows normal distribution with mean 0. • is used as the statistic for testing: it is zero-mean and variance stabilized. • Background signals are not independent: probes close to each other tend to be included in the same segment simultaneously.
Result SignalMap, NimbleGen Inc.
Result SignalMap, NimbleGen Inc.
Result SignalMap, NimbleGen Inc.
Result SignalMap, NimbleGen Inc.
Result Kim, T.H. et al. A high-resolution map of active promoters in the human genome. Nature, 436, 876-880 9,328 promoters for known transcripts 1,196 putative promoters for unknown transcripts Ren, B. UCSD
Multi-resolution Peak tree
Why use model? • A promoter is characterized not only by a large probe signal, but also a truncated triangle shape • Identify the neighboring probes that are caused by the same promoter to pool the info for ranking the potential binding sites SignalMap, NimbleGen Inc.
Model justification • Intuitively, human vision recognizes the local shape, instead of a single probe, to detect peaks. • Model fitting improves detection: 1) largest signal may not always be the tip of the best fitted triangle, 2) we can handle outliers caused by probe malfunctioning. • For window smoothing, if the window size is not chosen well, a local maximum of the window average can well be the bottom of a valley.
Model justification • The model gives us a sensible way to choose the range: • this enables us to pool many weak signals together if they form a good triangle. So that we can reduce the chance of false negative. • this prevents us from pooling too many weak signals together if they do not form a good triangle. So that we can reduce the chance of false positive.
Model justification • Probabilistic approx: Poisson process • Fact: two different slopes around the non-differential tip • Functional approx: line segments locally • Gives reasonable fit to data • Not enough data for more complex model • Not enough computational power to fit more complex model within minutes
Software • Fast: ~10 seconds for ~400,000 probes with a regular PC. • Robust to noise (data shown later). • Software and source code publicly available: www.stat.ucla.edu/~zmdl/Mpeak
Chromosome structure Lodish, H. et al. Molecular Cell Biology.
Histone and transcription • Histone proteins need to be modified and DNA needs to be released for transcription to take place. LS3 class note, UCLA
Histone and transcription Ren, B., UCSD
Twin-peak phenomenon The promoter region is in between two binding sites of the modified histone protein, e.g., Acetylated histone H3 (AcH3). ChIP-chip data for AcH3 show a twin-peak phenomenon, with a valley corresponding to promoter region. LS3 class note, UCLA SignalMap, NimbleGen Inc.
Possible solutions • Fit twin-peak shape to data based on the probability model for two binding site scenario. • Use Witkin’s scale-space filtering to detect peaks and twin-peaks.