1 / 25

Simultaneous Structural Variation discovery among multiple genomes

Human genetic variation. Single nucleotide (SNPs)Few to ~50bp (small indels, microsatellites)>50bp to several megabases (structural variants): Deletions InsertionsNovel sequenceMobile elements (Alu, L1, SVA, etc.) Segmental Duplications Duplications of size = 1 kbp and sequence similarity =

shubha
Download Presentation

Simultaneous Structural Variation discovery among multiple genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Simultaneous Structural Variation discovery (among multiple genomes) Fereydoun Hormozdiari*, Iman Hajirasouliha*, Andrew McPherson, Evan E. Eichler, S. Cenk Sahinalp Lab for Computational Biology, Simon Fraser University, Canada Eichler Lab, Genome Sciences, University of Washington, USA 1

    2. Human genetic variation Single nucleotide (SNPs) Few to ~50bp (small indels, microsatellites) >50bp to several megabases (structural variants): Deletions Insertions Novel sequence Mobile elements (Alu, L1, SVA, etc.) Segmental Duplications Duplications of size = 1 kbp and sequence similarity = 90% Inversions Translocations Introduction

    3. Paired-end mapping and SV detection 3 Introduction

    4. Current PE-based methods PEMer (Gerstein group) developed mainly for Roche/454 data, uses BLAT to map reads to unique locations. Pindel (EBI and Leiden Medical Center), Spanner (Boston College) and BreakDancer (Wash U) focus only on the “best mapping”. MoDIL (U of Toronto) and its follow-up MoGUL. GASV (Brown University) Variation Hunter (SFU/UW) and NovelSeq (SFU/UW) 4 Introduction

    5. Maximum Parsimony SV Detection (Hormozdiari et al 2009, 2010, Hajirasouliha et al 2010, Ritz et al 2010) Objective: selecting the minimum number of SV events that explains all discordant paired-end reads. Valid Cluster: all discordant paired-end read alignments that “support” the same potential SV. Maximal Valid Cluster: no more discordant alignment can be added (Hormozdiari et al 2009, Sindi et al 2009) Set-Cover Approach: we can find an approximate maximum parsimony solution. 5 Introduction

    6. How do we currently handle multiple genomes An interesting problem is to identify SV events in multiple next-gen sequenced genomes simultaneously: Members of a family or closely related genomes. Samples from the same tumor evolving or from different patients with the same mechanisms. A classic answer will be: Detecting SVs in individual genomes independently. Checking whether the genomes indeed agree or disagree on the variations. 6 Introduction

    7. How should we handle multiple genomes A paradigm shift from the current model, independent structural variation detection and merging (ISV&M). Alternative: a new model in which genomic variation is detected among multiple genomes simultaneously. Simultaneous Structural Variation among Multiple Genomes problem (SSV-MG): generalizes the maximum parsimony approach in Variation Hunter to multiple donor genomes 7 Introduction

    8. Common-LAW (Common Loci structural Alteration discovery Widget ) Input: set of discordant paired-end reads and the maximal valid clusters (as explained in Variation Hunter). Output: a selected set of maximal valid clusters to which each discordant paired-end read can uniquely be assigned - under the maximum parsimony criteria. Common-LAW predicts common and unique SVs in multiple genomes by minimizing a weighted sum of structural differences between the genomes as well as one reference genome. 8 The formulation of SSV-MG

    9. The SSV-MG problem The weights of SV events are a function of The expected genomic proximity of the assigned donor genomes. Type, loci and length of the SV events. If an SV event is supported by many related genomes, its weight will be relatively small. Currently, the weights are computed fairly naďve. In this setting, the SSV-MG problem asks to identify a set of SV events whose total weight is as small as possible. 9 The formulation of SSV-MG

    10. Algorithmic Formulation of SSV-MG The aim is to find a unique assignment of each discordant read to exactly one of the maximal SV clusters and minimize: 10 The formulation of SSV-MG The Algorithmic Formulation of the SSV-MG Problem The aim is to find a unique assignment of each discordant read to exactly one of the maximal SV clusters and minimize: The Algorithmic Formulation of the SSV-MG Problem The aim is to find a unique assignment of each discordant read to exactly one of the maximal SV clusters and minimize:

    11. SSV-MG for two donor genomes In the special case when only two genomes (associated with colors, red and black), the cost function would be: 11 The formulation of SSV-MG SSV-MG for two donor genomes SSV-MG for two donor genomes

    12. Complexity and Algorithms for SSV-MG Problem The SSV-MG problem is NP-hard. It is also NP-hard to solve within an approximation factor of n is the total number of discordant reads, and are the maximum and minimum possible weights for of the SV events. 12 Algorithms and heuristics for SSV-MG The SSV-MG problem is NP-hard. It is also NP-hard to solve within an approximation factor of n is the total number of discordant reads, are the maximum and minimum possible weight for of an SV event, respectively. (Intuitively, the weight of a multicolor SV event which has assigned reads from all different genomes, and the weight of an SV event which has only assigned reads from one specific genome.) The SSV-MG problem is NP-hard. It is also NP-hard to solve within an approximation factor of n is the total number of discordant reads, are the maximum and minimum possible weight for of an SV event, respectively. (Intuitively, the weight of a multicolor SV event which has assigned reads from all different genomes, and the weight of an SV event which has only assigned reads from one specific genome.)

    13. A greedy algorithm A simple greedy algorithm, Simultaneous Set Cover method (SSC), gives the best possible approximation factor (asymptotically) ! At each iteration, the algorithm selects one set (i.e. maximal valid cluster i.e. SV event) which covers the maximum number of uncovered discordant reads. 13 Algorithms and heuristics for SSV-MG

    14. A faster algorithm for limited read mapping loci Further improvement for the special case where each discordant read maps to a small number of loci. For example, if each discordant read of two genomes maps to exactly two locations in the case where the number of donor genomes are two. The approximation factor will be Idea: Vertex cover! 14 Algorithms and heuristics for SSV-MG

    15. Efficient heuristic methods Simultaneous Set Cover with Weights (SSC-W) A greedy method similar to the weighted set cover algorithm. Simultaneous Set Cover with Weights and Conflict Resolution (SSC-W-CR) Employs the concept of Conflict Resolution (Hormozdiari et al. ISMB 2010) and takes the diploid nature of the human genome into consideration 15 Algorithms and heuristics for SSV-MG

    16. Simultaneous set cover with weights (SSC-W) Selects the SV clusters in iteratively based on their ”cost-effectiveness” value in each iteration. In a given iteration, the method selects the set with the best ”cost-effectiveness” value, based on the maximum number of colors that can be assigned to the set in that iteration. The cost-effectiveness of a SV cluster s in iteration i is Where is the weight of the subset of s which contains new covered elements. 16 Algorithms and heuristics for SSV-MG

    17. Alu insertion analysis of the YRI trio 17

    18. Simultaneous set cover with weights and Conflict Resolution (SSC-W-CR) Utilizes the concept of “conflict graph” based on the mathematical rules in VariationHunter-CR. Extends to multiple genomes such that we are not allowed to assign the reads from the same genome to three clusters forming a triangle in the conflict graph. Again, selects the clusters iteratively in a greedy manner based on their cost effectiveness. 18 Algorithms and heuristics for SSV-MG

    19. The parent-offspring trios Two high-coverage trios: A Yoruba (from Ibadan, Nigeria) father-mother-child trio A CEU (European ancestry from Utah) father-mother-child trio 19 Experimental results

    20. Alu insertion analysis of the YRI trio 20 Experimental results

    21. Deletions in the CEU and YRI trios We predicted medium to large size deletions (> 100bp and < 1Mbp) in both YRI and CEU trios. The validated SV events reported in the recent study of the 1000 Genomes Project Consortium (Mills et al. Nature 2011) were used to test the quality of the predictions in the CEU and YRI trio. 21 Experimental results Individuals ISV&M SSC-W SSC-W-CR NA12878 1349 1408 1723 NA12891 1191 1236 1468 NA12892 1351 1402 1814 Number of CEU (NA12878, NA12891, NA12892) YRI (NA18506, NA18507, NA18508) Predictions ISV&M SSC-W SSC-W-CR ISV&M SSC-W SSC-W-CR 2000 728 (725) 755 (751) 1412 (1396) 1280 (1279) 1293 (1291) 1536 (1520) 3000 1058 (1058) 1106 (1106) 1780 (1763) 1794 (1789) 1797 (1794) 2098 (2082) 4000 1277 (1281) 1342 (1345) 2003 (1982) 2192 (2183) 2200 (2197) 2554 (2534) 5000 1449 (1457) 1517 (1527) 2139 (2121) 2518 (2508) 2537 (2534) 2920 (2900) 6000 1584 (1596) 1667 (1678) 2234 (2219) 2771 (2765) 2804 (2802) 3207 (3186) 7000 1659 (1674) 1775 (1796) 2314 (2305) 2997 (2996) 3040 (3042) 3453 (3446) 8000 1738 (1757) 1861 (1886) 2368 (2363) 3192 (3195) 3231 (3241) 3662 (3682) 9000 1797 (1816) 1933 (1962) 2398 (2396) 3382 (3388) 3417 (3434) 3830 (3887) 10000 1852 (1875) 2005 (2038) 2411 (2410) 3512 (3532) 3548 (3594) 3970 (4084) 11000 1892 (1918) 2064 (2099) 2420 (2422) 3651 (3687) 3694 (3757) 4084 (4270) 12000 1942 (1968) 2118 (2159) 2437 (2441) 3753 (3787) 3786 (3874) 4173 (4425) 13000 1960 (1988) 2151 (2195) 2445 (2457) 3851 (3907) 3887 (4003) 4247 (4602) 14000 1986 (2015) 2177 (2225) 2455 (2460) 3958 (4010) 3968 (4126) 4314 (4756)Individuals ISV&M SSC-W SSC-W-CR NA12878 1349 1408 1723 NA12891 1191 1236 1468 NA12892 1351 1402 1814 Number of CEU (NA12878, NA12891, NA12892) YRI (NA18506, NA18507, NA18508) Predictions ISV&M SSC-W SSC-W-CR ISV&M SSC-W SSC-W-CR 2000 728 (725) 755 (751) 1412 (1396) 1280 (1279) 1293 (1291) 1536 (1520) 3000 1058 (1058) 1106 (1106) 1780 (1763) 1794 (1789) 1797 (1794) 2098 (2082) 4000 1277 (1281) 1342 (1345) 2003 (1982) 2192 (2183) 2200 (2197) 2554 (2534) 5000 1449 (1457) 1517 (1527) 2139 (2121) 2518 (2508) 2537 (2534) 2920 (2900) 6000 1584 (1596) 1667 (1678) 2234 (2219) 2771 (2765) 2804 (2802) 3207 (3186) 7000 1659 (1674) 1775 (1796) 2314 (2305) 2997 (2996) 3040 (3042) 3453 (3446) 8000 1738 (1757) 1861 (1886) 2368 (2363) 3192 (3195) 3231 (3241) 3662 (3682) 9000 1797 (1816) 1933 (1962) 2398 (2396) 3382 (3388) 3417 (3434) 3830 (3887) 10000 1852 (1875) 2005 (2038) 2411 (2410) 3512 (3532) 3548 (3594) 3970 (4084) 11000 1892 (1918) 2064 (2099) 2420 (2422) 3651 (3687) 3694 (3757) 4084 (4270) 12000 1942 (1968) 2118 (2159) 2437 (2441) 3753 (3787) 3786 (3874) 4173 (4425) 13000 1960 (1988) 2151 (2195) 2445 (2457) 3851 (3907) 3887 (4003) 4247 (4602) 14000 1986 (2015) 2177 (2225) 2455 (2460) 3958 (4010) 3968 (4126) 4314 (4756)

    22. Conclusion Introduction of a new combinatorial optimization problem (SSV-MG) for simultaneous SV discovery. Approximation algorithms and efficient heuristics for the SSV-MG problem. Improved SV predictions in two different high-coverage datasets using the proposed framework. COMMON-LAW will be at http://compbio.cs.sfu.ca/strvar 22 Conclusion

    23. Thank you! Cenk Sahinalp Fereydoun Hormozdiari Andrew McPherson Faraz Hach Phuong Dao Farhad Hormozdiari Deniz Yorukoglu Reza Shahidi-Nejad Marzieh Bakhshi Lucas Swanson The 1000 Genomes Project Structural Variation subgroup 23 Acknowledgements

    24. Outline 24 Introduction Structural variation (SV) and paired-end sequencing Current methods and their limitations Our contribution The Simultaneous Structural Variation – Multiple Genomes discovery problem (SSV-MG) The mathematical formulation of the SSV-MG problem Algorithms and efficient heuristics for SSV-MG Experimental Results Deletions and Alu insertions, a Yoruba trio Deletions, a CEPH trio from Utah (part of the 1000GP) Lab for Computational Biology at SFU

    25. Structural Variation (SV) 25 Introduction Using next generation sequencing, thousands of genomes are now available! A large number of structural variation (SV) events (>50bp; deletions, insertions, duplications, inversions, transpositions) in individual genomes are characterized. SVs have been associated with diverse diseases including autism, mental retardation, etc. A Large number of structural variation (SV) events (including large deletions, insertions, duplications, inversions) in individual genomes were observed. SVs have been associated with diverse diseases including autism and Crohn's disease A Large number of structural variation (SV) events (including large deletions, insertions, duplications, inversions) in individual genomes were observed. SVs have been associated with diverse diseases including autism and Crohn's disease

More Related