200 likes | 358 Views
CGH Data. BIOS 691-804. Chromosome Re-arrangements. Normal Human Variation. Array CGH Technology. Chromosome 8 (241 genes) in 10 cell lines and many tumor samples. Pre-processing CGHa Data. QA: Same as for expression Normalization Are values comparable across arrays?
E N D
CGH Data BIOS 691-804
Chromosome 8 (241 genes) in 10 cell lines and many tumor samples
Pre-processing CGHa Data • QA: Same as for expression • Normalization • Are values comparable across arrays? • Can noise be reduced? • Segmentation • Where do copy number aberrations start and stop? • Better estimates for how many copies
Normalization • Most copy numbers are 2 • Centering necessary • Dynamic range varies • Mixtures of tumor with normal • Saturation not usually a problem • Few instances of 10X copy • Dye bias sometimes strong • loess procedure unreliable
Centering • Where is the center (log ratio 0)? • Sometimes modal copy number is 3 • Variability in labeling and tissue extraction • CGH can’t give direct measures of counts • Most researchers set modal copy to log-ratio of 0 • Does it matter? • Take 3 as equivalent to 2 for comparison?
Dynamic Range • Ratios of signal are often less (sometimes much less) than actual ratios of copy numbers between samples From Bilke et al, Bioinformatics, 2005
Fractional Copy Numbers • Often samples are mixtures of tumor and normal • Many tumors have two (or more) distinct clones with distinct karyotypes • Observed copy numbers may lie in between values corresponding to whole numbers
Probe Bias • If errors are random then plot of self vs self ratios should be random • Actual Corr > 60% • Clear bias! • Try to estimate it
Segmentation • Individual probe values are noisy • Most aberrations are segments • Most segments have many probes • Average neighboring probe values to better estimate segment value – how far?
Segmentation • Issues: • How to identify where a segment starts or stops • How to find these points efficiently
How to Find Segments? • Could be large copy number change over short interval or small change over large • Look for jumps in running averages • Distribution of jumps between probes • DNACopy is Maximum Likelihood estimate of change points, using all intervals • StepGram is efficient computation of (subset of) t-scores
Theory • Classical change-point test statistic • Let be values; let be partial sums • Set , where • are the differences in levels before and after i • Now for segments ‘in middle’ • Let , where • This is “Circular Binary Segmentation” • Implemented in DNACopy
DNACopy • In Bioconductor • Does ML identification of segments recursively • Apply procedure within identified segments • Double-checks points near the boundary • Does permutation testing to estimate null distribution • Often data are not Normal
StepGram • DNACopy is slow! • Could try to compute only a fraction of possible scores • StepGram tries to find a subset of most likely scores to compute • Much faster! • Some inaccuracies • Doesn’t handle chromosome ends well
StepGram – Method 1 • Key Idea: • Don’t compute • all possible t-scores • Compute only those • likely to show • significant change • Bound the • estimated t-scores • in future based on • current t-scores