1 / 35

Array Informatics Mark Gerstein

Array Informatics Mark Gerstein. CEGS Informatics Developing Tools and Technical Analyses Related to Genome Technologies. Main Genome Technologies Tiling Arrays Next Generation Sequencing Main Applications Transcript mapping Protein-DNA Binding CGH Transitioning to Seq.

sine
Download Presentation

Array Informatics Mark Gerstein

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Array InformaticsMark Gerstein

  2. CEGS Informatics Developing Tools and Technical Analyses Related to Genome Technologies • Main Genome Technologies • Tiling Arrays • Next Generation Sequencing • Main Applications • Transcript mapping • Protein-DNA Binding • CGH • Transitioning to Seq....

  3. Tools & Tech. Analyses for Processing of Genome Technology Data • Normalizing Arrays and Measuring & Correcting Artifacts • COP - Correcting positional artifacts [Yu et al. NAR '07] • Efficient Pseudomedian Calculation - for Tiling Array Scoring [Royce et al., BMC Bioinfo. '07] • Measuring Mismatch Effects[Seringhaus et al., BMC Genomics (submitted)] • Removing Seq. Effects [Royce et al., Bioinfo. '07] • NN Prediction of Probe Intensity - measuring & exploiting specific cross-hyb [Royce et al. NAR '07] • Simulating NextGen Sequencing • ChipSeqSim - simulating ChIP Seq [Zhang et al., PLoS CB '08]

  4. Tools & Tech. Analyses for Genome Structural Variation • Breakptr - HMM-based Array Segmentation for CNV detection [Korbel et al., PNAS '07] • MSB - Mean-shift-based Array Segmentation for CNV detection with extension to sequencing [Wang et al. Gen. Res. (submitted)] • PEMer - Paired-end Mapping for SV Dectection with simulation calibration and breakpoint DB [Korbel et al., GenomeBiol. (submitted)] • Long-SV-Assembly Simulations [Du et al., Nat. Meth. (submitted)] • SD-CNV-CORR - Approach for correlating the occurrence of CNVs and SDs with genomic features (particularly repeats) [Kim et al., Genome Res. (submitted)]

  5. A Starting Point: Noisy Raw Signal from Tiling Arrays (Transcription) Johnson et al. (2005) TIG, 21, 93-102. 5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Li et al., PLOS one (2007)

  6. Specific & Non-specific Cross-Hyb. • Perfect match (PM): probe binding intended target • Specific cross-hyb.: probes binding non-PM targets with a small number of mismatches • Non-specific cross-hyb.: probes binding targets with many mismatches, due to general stickiness of oligos Specific Cross-hyb. Non-specific Cross-hyb. Perfect Match

  7. Non-Specific Cross Hyb. (Sequence Effects)

  8. Creation of Standardized Datasets for Quantifying Effect of Mismatches [Seringhaus et al., BMC Genomics (in press)] Human Yeast 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Normalized Intensity MM MM vs. PM GT v TG CT v TC GA v AG CA v AC Types of Mispairs (probe on array is first)

  9. Observing Non-specific Cross-hyb. (Probe sequence effects) Avg. intensity of all background probes with a C at position 4 Source: Royce, T.E., et al (2007), Bioinformatics, 23, 988-97 Avg. intensity of all background probes with a T at position 33

  10. Iterated Quantile Normalization to Correct for Non-specific Cross-hyb. • Adapt Bolstad et al (2003) approach to tiling arrays • Force distributions with a given nt at each position to be same • Distributions at other positions now different so iterate • Also, robust adaptation of Naef & Magnasco (2003) T Royce et al (2007), Bioinformatics, 23, 988-97

  11. Measuring Specific Cross-Hyb Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

  12. Proof of principle test to “exploit” this • Using Cheng et al. (2005), predict gene expression levels (and profiles across tissues) for genes on part of chr. #6 • ...Based on closest cross-hyb tiles on part of chr. #7 • Then compare to measured levels and profile on #6 Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97 Figure from http://www.members.cox.net/amgough/Fanconi-genetics-genetics-primer.htm

  13. Nearest Nbr Search on Virtual Tiling Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

  14. Agreement between predicted tile expression profile and actual one • Correlated predicted profiles with the actual profiles of gene expression across cell lines • Much more correlation than expected by chance (dist. centered on 0) Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

  15. Very Strong ROC Curve: Most genes are accurately detected using nearest-neighbor features' signals • Illustrates great magnitude of cross-hyb. on hi-density arrays • High feature density arrays inadvertently resurrecting generic n-mer concept (van Dam & Quake, 2003) • Suggests that tiling arrays could be exploited to create universal arrays • Gold std. set of known expressed genes. How well do we find. • A set of known positives was defined as the Refseq genes with at least 75% transfrag coverage. A set of known negatives was constructed by permuting the sequences in the set of known positives. For various thresholds, sensitivity and specificity were computed and then plotted. Royce, T. E. et al. Nucl. Acids Res. 2007 35:e99

  16. Array Corrections J Rozowsky T Royce M Seringhaus PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero Experimental M Snyder S Weissman A Urban CEGS Informatics Credits

  17. Computational Methods for SV CharacterizationMark Gerstein

  18. Computational Methods for SV Characterization Segmenting Array CGH data Building a PEM pipeline Correlating SVs and SDs with Repeats

  19. 0.5 0 Fluorescence log2 ratio -0.5 ACGTGACAC AT AAGCACACCA A TTGCTTGAGGGACCT T AGGCACAGT T AAC A TG AT AAGCACACCA A TTGCTTGAGGTGAC DNA O SCALE NO T T sequence BreakPtr HMM • To get highest resolution on breakpoints need to smooth & segment the signal • BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models http://breakptr.gersteinlab.org Korbel*, Urban* et al., PNAS (2007)

  20. High resolution of tiling arrays allows statistical integration of nucleotide sequence patterns >4-fold enrichment of the breakpoints of copy number variants near segmental duplications (SDs)[e.g. Sharp et al., Am. J. Hum. Genet. 2005; 77:78-88].

  21. alues v y a r r A alues S equen c e v y a Transition B Transition A r r A S equen c e Duplication Normal Deletion Transition A’ Transition B’ BreakPtr statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM) Korbel*, Urban* et al., PNAS (2007)

  22. 100 3 [ in t e rm edia t e B] 30 4 [ in t e rm edia t e A] 1 0 [ c o r e ] l o g ( nu m be r o f C N V s a v ailable f o r pa r a m e t e r e s t i m a t ion ) 10 ‘Active’ approach for breakpoint identification: initial scoring with preliminary model, targeted validation (with sequencing), retraining, and rescoring S D s 250 3 [f ull ] 2 normalized fluorescent intensity log -ratios Breakpoint validation [sequencing] Maximum number of parameters per transition state Model parameter estimation Training data Parameter optimization Gold standards CNV breakpoints sequenced in ~10 cases following BreakPtranalysis; Median resolution <300 bp No improvement in accuracy with higher resolution (9nt tiling) HMM optimized iteratively (using Expectation Maximization, EM) Korbel*, Urban* et al., PNAS (2007)

  23. Moving Beyond Arrays, Computational Methods for Next-Generation Sequencing:Paired End Mapping to Find SVs

  24. Overall Strategy for Analysis of NextGen Seq. Data to Detect Structural Variants [Korbel et al., Science ('07); Korbel et al., GenomeBiol. (submitted)]

  25. Simulation strategy Simulation 454 sequencing Experiment [Korbel et al., GenomeBiol. (submitted)]

  26. Reconstruction efficiency at different coverage [Korbel et al., GenomeBiol. (submitted)] 26 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu

  27. Building a Database of Variants: Complexities 27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu [Korbel et al., GenomeBiol. (submitted)]

  28. Analyzing Duplications in the Genome (SDs & CNVs) pers. photo, see streams.gerstein.info

  29. 080907_SD_CNV_Slides_MBG_CEGS_PMK SEGMENTAL DUPLCATIONS AND COPY NUMBER VARIANTS ARE RELATED PHENOMENA AND HAVE BEEN CREATED BY SEVERAL DIFFERENT MECHANISMS Intra-species variation Fixed mutations (differences to other species) Fixation Copy Number Variants (CNV) Segmental Duplications (SD) NAHR (Non-allelic homologous recombination) Flanking repeat(e.g. Alu, LINE…) NHEJ (Non-homologous-end-joining) No (flanking) repeats. In some cases <4bp microhomologies

  30. …ATCAAGG CCGGAA… 080907_SD_CNV_Slides_MBG_CEGS_PMK PERFORM LARGE SCALE CORRELATION ANALYSIS TO DETECT REPEAT SIGNATURES OF SDs AND CNVs If exact CNV breakpoints are known, we can calculate the enrichment of repeat elements relative to the genome or relative to the local environment 1 Survey a range of genomic features Exact match Local environment Count the number of features in each genomic bin (100kb) 2 Calculate correlations / enrichments using robust stats 3 [Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

  31. 080907_SD_CNV_Slides_MBG_CEGS_PMK SDs ARE CORRELATED WITH ALUS AND OTHER SDs Alu association with SDs by age • The co-localization of Alu elements with SDs is highly significant. • Older SDs have a much higher association with Alus than younger SDs. • SDs can mediate NAHR and lead to the formation of CNVs • Such mechanisms (“preferential attachment”) are well studied in physics and should leads a very skewed (“power-law”) distribution of SDs. • Hotspots 90-92% 92-94% 94-96% 96-98% 98-99% >99% f Occurrence Number of SDs in Genomic Bin [Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

  32. ASSOCIATIONS ARE DIFFERENT FOR SDs AND CNVs CNV Association with SDs >99% SDs* CNVs 080907_SD_CNV_Slides_MBG_CEGS_PMK CNVs ARE LESS ASSOCIATED WITH SDs THAN THE GENERAL SD TREND SD association with repeats Microsatellite Pseudogenes LINE Alu <0.001 <0.001 0.046 0.001 CNV association with repeats Microsatellite Pseudogenes LINE Alu <0.001 0.92 0.046 0.001 [Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

  33. AFTER THE ALU BURST, THE IMPORTANCE OF ALU ELEMENTS FOR GENOME REARRANGEMENT DECLINED RAPIDLY Alu SD NAHR LINE Microsatellite Subtelomeres Fragile sites NHEJ CNVs SDs • About 40 million years ago there was a burst in retrotransposon activity • The majority of Alu elements stem from that time • This, in turn, led to rapid genome rearrangement via NAHR • The resulting SDs, could create more SDs, but with Alu activity decaying, their creation slowed Young High seq-ID (%) Old Low seq-ID (%) Fixation Aging (~40Mya) Alu Burst (40 MYA) [Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

  34. Future Directions Simulations of SV Assembly Analysis of Split Reads Detailed Analysis of SV and CNVs with Genomic Features

  35. Array Corrections J Rozowsky T Royce M Seringhaus PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero Experimental M Snyder S Weissman A Urban CEGS Informatics Credits

More Related