520 likes | 762 Views
Genome Rearrangements in Evolution and Cancer. Guillaume Bourque Genome Institute of Singapore HKU-Pasteur Research Centre - Hong Kong August 28 th , 2009. Outline. Genome Rearrangements in Evolution [ ??? ] Cancer genomics. Genome rearrangements in evolution. 1999. High hopes.
E N D
Genome Rearrangements in Evolution and Cancer Guillaume Bourque Genome Institute of Singapore HKU-Pasteur Research Centre - Hong Kong August 28th, 2009
Outline • Genome Rearrangements in Evolution • [ ??? ] • Cancer genomics
High hopes • Explain the physical clustering of gene families (regulation, editing or retention). • Understand whether even longer linkage associations were preserved by chance or by selection (developmental or functional). • Resolve the mammalian phylogeny using genomic segment exchanges as characters. • Discover molecular fossils of precipitous genomic events. • Identify genetic determinants of reproductive isolation, adaptation, survival and species formation. O’Brien et al, Science 1999
Need to reverse complement Comparing 2 sequences GGCACAAATCCAAATCCAAATCCGGGTTGGGGTTGGGGTTGGGGTTGCGACACATTTGGCCTGTCGTCGTCCGTCGTC GGCACAAATCCAAATCCAAATCCAATGTGTCGCAACCCCAACCCCAACCCCAACCCTGGCCTGTCGTCGTCCGTCGTC
5 4 3 2 1 1 -2 3 4 5 1 2 3 -4 5 If you have 3 sequences… Seq_1 vs Seq_2 Seq_1 vs Seq_3 Seq_2 vs Seq_3 Seq_1 : 1 -2 3 4 5 Seq_2 : 1 2 3 -4 5 Seq_3 : 1 2 3 4 5
Rearrangement Phylogeny A: 1 2 3 4 5 Inversion Block 4 Inversion Block 2 Seq_1: 1 -2 3 4 5 Seq_2: 1 2 3 -4 5 Seq_3: 1 2 3 4 5
Genome rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 45 6 1 2 65 3 4 Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission
Algorithms for sorting genomes Polynomial algorithm for computing the rearrangement distance and the most parsimonious scenario between 2 unichromosomal genomes (Hannenhalli and Pevzner 1995). For example: 1 -6 -3 -7 2 -4 -5 8 1 -6 -3 -2 7 -4 -5 8 1 2 3 6 7 -4 -5 8 1 2 3 4 -7 -6 -5 8 1 2 3 4 5 6 7 8 Further developed for multi-chromosomal genomes (Tesler 2002) and multiple genomes (Bourque and Pevzner 2002).
Chromosome X two way similarities (PatternHunter) synteny bocks (GRIMM-Synteny) rearrangement scenario (MGR)
Mammalian phylogeny pig cat rat mouse human cow dog Murphy et al, Science, 2005
Overview of the Results • Nearly 20% of chromosome breakpoint regions were reused. • Gene-density is higher in evolutionary breakpoint regions. • Segmental duplications populate the majority of primate-specific breakpoints.
Recovering true ancestral events • Analyses of genome rearrangements are typically evaluated on: • Quality of the ancestral reconstructions • Ability to recover the correct topology • Total number of rearrangements in the scenario recovered (parsimony) • We decided to focus on the accuracy of the rearrangements recovered • Start by measuring accuracy using simulations and then apply the approach to real data sets • Why? • Look for events that could have been involved in speciation • Look at sequence features associated with these events (e.g. repeats, genes, etc.) • Gain mechanistic insights into genome rearrangements
EMRAE :: Efficient Method to Recover Ancestral Events • Relies on adjacencies conserved in a significant fraction of the genomes. • Combines conserved adjacencies (and nearly conserved adjacencies) to predict rearrangement events. • Applicable to uni and multi-chromosomal genomes. • Currently models: inversions, translocations, fusions, fissions and transpositions. But also amenable to insertions and deletions. • Achieves high specificity with comparable sensitivity.
Conserved adjacencies • Define an adjacencya(ci, ci+1) as an ordered pair of integers ci ci+1 or its inverse -ci+1 -ci found in a given genome. • For a given edge e, if the adjacency a is found in every genome of SA but not in any genome of SB we say that a is a conserved adjacency of SA.
Simulation results Higher specificity
Mammalian rearrangements events • Predicted 1109 events at a 10Kb resolution: • 831 reversals • 237 transpositions • 15 translocations • 26 fusions/fissions ( reversals, translocations, transpositions, fusions/fissions )
Mammalian rearrangements events • Predicted 1109 events at a 10Kb resolution: • 831 reversals • 237 transpositions • 15 translocations • 26 fusions/fissions ( reversals, translocations, transpositions, fusions/fissions )
Human-specific breakpoints areenriched in SDs • Human-specific breakpoint regions are significantly enriched in SDs as compared to size-matched random regions (p-value < 0.001). • Indeed, 93.2% of the human-specific breakpoint regions (69 out of 74) contain SDs. • This is true for only approximately 60% of size-matched random regions.
Homologous matching pairs of SDs are enriched in human-specific breakpoints • Taking the 74 human-specific breakpoints identified in this study, we observed 100 pairs of regions with matching pairs of SDs instead of an average of 25 pairs observed in the random simulated data sets.
Primate reversals are associated with SDs • The average percent identity of the SDs that are associated with reversals correlates with the relative age of these events. • This helps confirms the direct link between SDs and many rearrangements events.
If not SDs, what? • Extension from primate specific reversals to all the predicted mammalian reversals • We used BLAST to detect homology between breakpoints of the predicted reversals • Many reversals are flanked by regions of high sequence identity (BLAST score >1000)
Homology flanking mammalian reversals • We found that 58%, 29%, 24%, 42%, 47% and 20% of the human, chimp, rhesus, rat, mouse and dog reversals are supported by regions with Blast scores greater than 1000. • What is the source of this homology? Is it expected? • We restricted our analysis to the reversals with breakpoints defined within 100Kb and assessed the overlap between these regions of homology and repeats. • We annotated each reversal to a particular repeat family when the overlap between the homologous segment identified and a repeat instance was greater than 50% and compared the results to matched simulated data sets.
Outline • Genome Rearrangements in Evolution • [ ??? ] • Cancer genomics
Sequencing Revolution • Sanger sequencing (1970s) • Next-Generation sequencing (2007-now) 454 Illumina SOLiD
Data Explosion • Sequencing is no longer the rate limiting step • This year, we expect: • 2X increase in CPU • 2X increase in memory • 10X increase in sequencing (estimate from Illumina and SOLiD) or even 100X increase (Helios, Complete Genomics, etc.) • Informatics challenges that we face now will only grow…
Paradigm Shift • Things that are out: • Storing all primary data (images) • “All versus all” types of analysis • Single large repository (NCBI) • Careless data management (duplicated files, extra transferring steps, etc.) • Things that are in: • Clusters and high performance storage • Cloud computing • Careful data management & planning • Bioinformaticians & IT engineers (even for relatively small labs)
Sequencing Human Genomes 2001 2009 2011 (?) 1000 Genomes Project The Human Genome Your Genome $$$$$$ $$$ $
New opportunities… Evolution In the study of … Populations Cancer
Outline • Genome Rearrangements in Evolution • [ ??? ] • Cancer genomics
Gene Identification Signature Ng, et al., Nature Methods, 2005
PET technology ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Cancer Cell cDNA PET Human Genome
Highly rearranged cancer genome Provided by Nalla Palanisamy, GIS
Translocation Normal Cancer Impact of rearrangements on PETs Inversion Deletion Normal Cancer
GIS-PET MCF-7 Transcriptome 584,624 cDNA equivalents 135,757 Unique PETs 92,928 PETs (69%) 9,732 PETs (7%) 33,097 PETs (24%) One location (tag1) Unmappable (tag0) Multi-location
Sequence-based clustering All unmappable PETs (tag0) Cluster based on sequence similarity Align ---GGAGCCGCGGCCGCC-------ACGATCCCAC-AGCCTC ----GAGCCGCGGCCGCC---AAGAACGATACCAC-AGCCTC ATTGGAGCTGCGGCCGC--------ACGATCCCAC-AGCCTC --TGGAGCCGCGGCCGCCGA-----ACGATCCCAC-AGCCTC ------GCGGCGGCCGCC---AAGAACGATCCCAC-AGCCCC ----GAGCCGCGGCCGCCG---AGCACGATCCCACTAGCCTC 3’ Extract consensus 5’ ATTGGAGCCGCGGCCGCCGA AGAACGATCCCACAGCCTC Map to human genome 5’ 3’
20q13 17q23 BCAS3 BCAS4 Largest unmappable cluster 5’ 3’ 77 unique PETs 339 total PETs …
Fusion transcript discovery pipeline Ruan et al. Genome Res, 2007
Genomic PET (gPET) Genomic DNA fragmentation PET library construction & sequencing PET sequences mapping to reference genome 1Kb 10Kb PET mapping span 1Kb peak 10Kb peak