240 likes | 343 Views
Multi-Sample analysis of microarray based copy-number aberration data. Copy Number Detection Meeting. March 6, 2006. Gregory R. Grant ggrant@pcbi.upenn.edu Mitchell Guttman mguttman@sas.upenn.edu. Motivating Framework.
E N D
Multi-Sample analysis of microarray based copy-number aberration data Copy Number Detection Meeting March 6, 2006 Gregory R. Grant ggrant@pcbi.upenn.edu Mitchell Guttman mguttman@sas.upenn.edu
Motivating Framework • The ability to map the location and magnitude of aberrations is important. • Aberration regions can be small. • We are interested in regions of copy number aberrations (CNA) that are recurrent across a class of samples. • Myc-N Amplification in high risk Neuroblastoma. • ErbB2 Amplification in higher risk Breast Cancer. • Both of these are highly correlated with prognosis.
Single Slide Methods • There are numerous single slide methods for determining aberration within an array. • These methods use multiple elements in a region as replicates for determining aberration in the region. • With a single slide this is the best one can do. • The resolution of detection is lower than the resolution of the array. • With multiple slides we can take a different strategy. • Be more liberal on the single slide calls. • Only believe the calls when we see them replicated across samples significantly often. • Finally, while there may be aberration present within a single array that is not present across samples, this aberration is unlikely to be due to a population effect.
Multiple Sample Analysis (MSA) • The ability to use multiple samples as replication we are able to characterize the genomic aberrations at a higher resolution (at the resolution of the array). This also allows us to identify regions of importance to the population. • Use Information from multiple samples to find aberrations characteristic to the class of samples. • Rather than looking across the genome, we look across experiments at each location. • This allows us to pickup small regions of tight concordance regardless of their small size within a single experiment.
STAC Statistical Algorithm • Given a set of calls STAC finds aberrations which are significantly concordant across samples. • STAC provides two statistical tests of significance, the footprint and frequency. • Frequency measures the number of samples that overlap a particular clone. • Footprint measures how tight the overlap is. Footprint 7 Footprint 4 Frequency = 5 in both cases. http://www.cbil.upenn.edu/STAC/
Motivating Dataset (Mies Lab) • Fixed Paraffin Embedded (FFPE) Sample DNA. • Challenging case • Laser Captured Micro-dissected samples from FFPE, archived (10+ years), degraded tissue, with no exact normal analog. • Indirectly labeled samples due to small quantity of DNA. Due to a need for sufficient amplification • Amplification based on human specific degenerate oligo primers. • 2-Channel BAC Arrays made by the Penn Microarray Core based on the Weber library.
Making Calls and Processing Data • Ratios are formed for each clone with the reference (normal) intensity in the denominator and the experimental sample in the numerator. • If a segment of DNA containing a clone is not altered, then ideally the ratio for that clone should be 1. • If (in one chromosome) a segment of DNA containing a clone is missing, then ideally the ratio should be 1/2. • If (in one chromosome) a segment of DNA containing a clone is duplicated ideally the ratio should be 3/2. • If the segment is tripled then ideally the ratio should be 2. • Of course data are noisy and subject to bias and artifacts.
Processing Issues • Clone/Array quality issues • Clone mapping issues • Overlaps and inconsistencies • Unequally spaced clones • How to infer behavior at locations between clones • Tiling Paths • Clone-to-clone variation • Differing clone hybridization affinities and clone/dye interaction effects, etc… • Normalization • Removing dye-bias, etc… • Within array normalization • Between array normalization Nature of clone coverage. Inconsistent spacing due to both technical considerations as well as biological reality.
First Step: Develop a parameterized protocol for single slide calls. • Make calls per clone • Use normal/normal distribution • Make calls for each nucleotide covered by at least 1 clone • How to deal with overlapping clones. • How to deal with replicate (and potentially inconsistent) clones. • Extend the calls to regions with no coverage. • Develop method for extension from neighboring clones. • Determine how to divide regions flanked by inconsistent clones. • Standardize genome spacing for analysis. • Merging continuous genome into discrete regions. • How to deal with overlapping regions
Making clone-wise calls from raw data • Absolute threshold cutoffs.
Using Normal Controls • Using normal samples as controls. • A distribution of sample normals analogous to the test channel of interest hybridized to an identical reference channel as used for the experimental hybridizations • Possible cutoff parameters using normal samples • Percentiles • Standard deviations • Z-scores • User specified • Given a fixed scheme (above), how can we find an “optimal” parameter setting?
Extending calls to regions with no coverage Note: We don’t extend over all length only small spans. We cutout regions longer than a specified length.
Analysis • In an ideal situation we would believe every aberration call. • We would then ask the question: which aberrations occur concordantly across samples? • This is where the STAC statistic helps us out.
Finding a reasonable cutoff • For cutoff SD=1, we are definitely picking up false signal. • For cutoff SD=6 we are likely missing true signal. • Looking one slide at a time it is hard to tell what is a reasonable cutoff. A single array with calls made at 11 different cutoff values.
6 normals, 15 tumor samples, in parallelfor 11 values of the SD cutoff 3.0 3.5 4.0 4.5 5.0 5.5 6.0 1.0 1.5 2.0 2.5
High Cutoff Middle Cutoff Low Cutoff
Methodology • Avoid making decision on cutoffs. • Calculate significance, at a range of cutoff values, using STAC at each cutoff. • Combine results using multiple testing correction. End Point Percent Aberration Start Point Less Conservative More Conservative SD Cutoff Values
Results • Chromosome 8 important in breast cancer. • Provides fine resolution of aberration. • Rather than simply providing gross changes. • Able to characterize aberration at the resolution of the array. • Able to characterize important regions. • Myc, FGFR, etc. • Other regions previously uncharacterized.
MSA: Chromosome 8 • Able to characterize a 1Mb amplification of the FGFR oncogene • All single slide methods missed this. • Able to picks up the Myc oncogene amplification • Single-slide methods missed despite its presence in every sample. • Also characterizes other regions. • Some of these regions the single slide methods were able to detect • Detected other smaller regions of aberration • Allows finer resolution mapping • Smaller regions are either missed or clumped together or into larger regions of aberration. FGFR MYC Note: We are working on adding the CBS algorithm implementation to MSA to allow the use of its single slide approach to our Multiple Sample Approach
Discussion • To our knowledge, there are no methods that combine preprocessing and analysis harnessing the power of multiple samples. • Because most methods are single array methods, integration between experiments is difficult to define. • MSA provides statistical analysis at higher resolution. • MSA works with “difficult” data: Based on Pinkel and Albertson scale of difficulty, our method has been tested, and works well, with 5/6 criteria.
Future Plans • Handle Affymetrix SNP Chip data. • Many of the ideas for leveraging multiple samples should also apply to the anaylsis of Affy SNP data. • We are currently working on this extension. • Release stand-alone GUI software package (CGH-MSA). • To be released this month. • www.cbil.upenn.edu/MSA • Incorporate Single slide methods. • Extend the STAC algorithm beyond binary data to account for levels of change. • Estimate bias in non-Controlled experiments.