880 likes | 1.31k Views
Array-based Comparative Genomic Hybridization. Bastien JOB 2010-10-19. Structural Genomics Sequence variations (CGHa, SNPa, DNAseq, mutations…). Fonctional Genomics Gene expression / splicing… (GEa, Q-PCR, RNAseq… ). Proteomics (Antibody arrays, 2D EP +MS/MS, HPLC+MS / MS, … ). Genome.
E N D
Array-based Comparative Genomic Hybridization Bastien JOB 2010-10-19
Structural Genomics Sequence variations (CGHa, SNPa, DNAseq, mutations…) Fonctional Genomics Gene expression / splicing… (GEa, Q-PCR, RNAseq…) Proteomics (Antibody arrays, 2D EP +MS/MS, HPLC+MS / MS, …) Genome Transcriptome Proteome DNA: gene RNA Post-trad modification mRNA: transcript protein Intron Transcription Translation Exon Splicing, editing miRNA Nucleus Promotor, regulating seq Cell Membrane
History and context Technical principle, classical designs Description of oligo CGH arrays Data preprocessing Bioinformatic analysis Cross-technology correlation
History and context CGH arrayis a methodaimingat the identification of the variation in number of the genomic content of a test sample, by comparison to a referencesample, using an array of (at least) thousands of measure points on the genome. A bit of history of cytogenomics • [196x] : Karyotyping • [1993] : Spectral karyotyping (SKY) • [199x] : CGH (comparative genomichybridization) on chromosomes • [200x] : cDNA-based and BAC-based CGH array • [2005] : oligo-based CGH array In cancer : • The profiling of the patterns defined by thesealterations for a patient or a pathology. • Explore for the association betweensome of these patterns and clinical annotations. Other uses : • Developmentabnormalities, autism, diabetes, inter-individualsCNVs (HapMapproject), ... It’s an establishedmethod in the cancer researchfield, in establishment for the diagnostic field.
196x : Karyotype 1993 : SKY 199x : CGH on chr 200x : cDNA/BAC-based CGH array 2005 : Oligo-based CGH array
Rearrangements in tumors altering gene regulation MYC – IgH translocation in Burkitt lymphoma IMAGE CREDIT: Gregory Schuler, NCBI, NIH, Bethesda, MD, USA Also a common fusion in prostate cancer (Tomlins et al., Science 2005)
Chromosomal amplifications EGFR amplification in lung cancer as HSR (homogeneously stained region) EGFR amplification in lung cancer as several double minutes Varella-Garcia et al, J Clin Pathol 2009
Common alterations across tumorsand pathologies • Mutations activating / repressingpathways • Breakpointscreating duplications / amplifications / deletions / fusions • Known « master genes » like TP53, PTEN, CDKN2A/B, MYC, EGFR, FGF, …, • Some are tissue-specific, others more widelyspread Duplicated genes Deleted genes activation repression
History and context Technical principle, classical designs Description of oligo CGH arrays Data preprocessing Bioinformatic analysis Cross-technology correlation
Designs (dual color) • For dual-channel CGHarray, most of the time : Test sample DNA (tumor) Cy5 -vs- Reference DNA (normal) Cy3 • Mainly use of a sex-matched commercial normal DNA as reference • Sex-matched: anomalies on gonosomes • « outside » reference : polymorphisms (CNV, « copy number variations ») • More rarely (cancer field) : using the same person’s normal DNA • No polymorphism • Same origin ≈ same preparation • Some difficulties for blood DNA extraction • Use of a « stable » cell-line with a complete ploidy as a reference (ex: Coriell NA10851) • More complex designs can be performed (circular, …)
T (R) CGH array simplified process on the platform : From sample to analysis Fragmentation & labelling DNA extraction Hybridization Samples Qualification & quantitation oligo microarray Bioinformatic analyses Segmentation & visualization Scan, signals acquisition & normalization
History and context Technical principle, classical designs Description of oligo CGH arrays Data preprocessing Bioinformatic analysis Cross-technology correlation
Long oligo Agilent CGHarrays G2 : 244 K Agilent oligoarray Spots : 60µm (@ 5µm/px) Spots : 30µm @ 2µm/px G3 : 4 x 180 K Agilent oligoarray
Available formats (for Human) 2ndgeneration • 4 x 44K • 2 x 105K • 1 x 244 K • 3rd generation (current) • 8 x 60K • 4 x 180K • 2 x 400 K • 1 x 1M • Most formats alsoavailable for mouse and rat • Possibility to design one’sown custom array for any format
Short oligo Affymetrix SNP 6.0 array 4x 906,600 SNP probes 945,826 CN probes * • 25-mer oligos • ~700b averageinterval • ~2 Kb real CN interval * ~200,000 CNVs
History and context Technical principle, classical designs Description of oligo CGH arrays Data preprocessing Bioinformatic analysis Cross-technology correlation
Simplified bioinformatics analysis pipeline Genomic profile Segmentation Signals acquisition Quality controls Normalization CBS Feature Extraction v10.x Description of the population Identification of genomic regions of interest Describing genomic contents Public databases + Clinical Annotations R, aCGH STAC
Spot position identification • by 2D intensityhistograms • By a circle (fixed / variable diameter) • Adaptative segmentation by randomseed propagation Credits : Pierre NEUVIAL (ENSAE) Currentoligogeneration : perfect disc-shaped spots.
Spot extraction • Twomethods : • Intensity segmentation • Isolation of real signal from a local background • Needed for bothsignals • Needs a background correction method • Then a ratio canbecomputed • Linearregression (Novikov, 2004) • (1) First linearregression on all intensities • (2) Identification of outliers • (3) Sequentialremoving of outliers pixels • (4) Unbiasedlinearregression on kept pixels • Can onlybeusedwhen background isfairlylow and homogeneous. • The ratio isdirectlyextracted as the slope. (2, 3) (1) (4) Credits : Pierre NEUVIAL
Array quality controls (from Agilent) General information and some parameters Grid positioning check Control of channels (signal, background, …) Control of outliers (number and position) Control of intensity distributions Control of the randomness of signals
QC : Spatial homogeneity controls Spatial representation of signals, background, log2(ratio), p-value, errors (…) Distribution of signals and log2(ratio)
NORMALIZATION Why ? Some biasescanberemovedby specific algorithms
Spatial biases Intensity gradients Block effects Print-tip bias Local bias Most of thesebiases are linked to spottedarrays
Spatial biases correction (example) Credits : Pierre NEUVIAL
CENTRALIZATION Why ? Data generated by thismethodare relativevalues (ratio of a test versus a reference) : we are lacking information about « real » normalitylevel.
Centralization : an obvious example Identifying the most probable normal genomic level is easy here, as we have a main central peak. Frequency Ratio Log2(ratio) Chromosomes
Centralization : a cancer example It’s much more difficult here, to the higher complexity of the distribution / profile… Frequency Ratio Log2(ratio) Chromosomes
Centralization : Comparing to the center of the distribution
GENOMIC PROFILE VISUALIZATION& DATA SEGMENTATION Why segmenting ? Data reduction : The data obtained are a list of hundreds of thousands of values. However, a genomic profile can be simplified to a limited list of segments considered as abnormal.
A normalized, centered, segmented genomic profile with called aberrations Example taken from a breast cancer profile
Challenge : identifying breakpoints • Data consist in a continuous log2(ratio) distribution • Two main difficults : • Localizationof breakpointsisunknown by default • Neithertheirquantity • Twogeneralmodels : • Homoscedastic (m) • Heteroscedastic (m, V)
Several segmentation methods available • Initial methods Median smoothing EM mixture clustering • « Newer », wellknownmethods HMM/EM CBS