1 / 25

Rare-Allele Detection Using Compressed Se(que)nsing

Rare-Allele Detection Using Compressed Se(que)nsing. Or Zuk Broad Institute of MIT and Harvard orzuk@broadinstitute.org In collaboration with: Amnon Amir Dept. of Physics of Complex Systems, Weizmann Inst. of Science Noam Shental

zaynah
Download Presentation

Rare-Allele Detection Using Compressed Se(que)nsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rare-Allele Detection Using Compressed Se(que)nsing Or Zuk Broad Institute of MIT and Harvard orzuk@broadinstitute.org In collaboration with: Amnon Amir Dept. of Physics of Complex Systems, Weizmann Inst. of Science Noam Shental Dept. of Computer Science, The Open University of Israel

  2. Outline • Motivation • The Problem • Compressed Sensing • Conclusion

  3. Rare recessive genetic diseases Genotype Phenotype Normal Healthy Carrier Healthy! Affected Sick

  4. Nationwide carrier screen

  5. Large scale carrier screen (rates vary across ethnic groups)

  6. Specific mutations HEXA gene on chromosome 15 over 100 mutations are known

  7. Specific mutations - notation …AGCGTTCT… “A” Reference genome …AGTGTTCT… “B” Single-nucleotide polymorphism (SNPs) …AGGTTCT “B” Insertions/Deletions (InDels) Carrier test screen: Amplify a sample of DNA and then test 0 1/2 fraction of B’s out of tested alleles “AA” “AB”

  8. Genome Wide Association Studies Cases Controls collect DNA samples Count: Statistical test, p-value BB AA AA AB AA AA AA AB AA AA AB AB AA BB AA AB AB AB Try ~105 – 106 different SNPs. Significant ones called ‘discoveries’/’associations’

  9. Published Genome-Wide Associations through 12/2009, 658 published GWA at p<5x10-8 [NHGRI GWAS Catalog www.genome.gov/GWAStudies]

  10. Goal: push further What Associations are Detected? [T.A. Manolioet al. Nature 2009]

  11. Outline • Motivation • The Problem • Compressed Sensing • Conclusion

  12. Naïve Approach – One Test per Individual collect DNA samples Apply 9 independent tests AA AA AA AA AA AA AA AB AB fraction of B’s out of tested alleles 0 0 0 0 1/2 0 0 0 1/2 Problem: Rare alleles require profiling a high number of individuals. Still very costly

  13. Outline • Motivation • The Problem • Compressed Sensing • Conclusion

  14. Compressed Sensing Based Group Testing Next Generation Sequencing Technology fraction of B’s infer/reconstruct compressed sensing a few tests instead of 9

  15. Results (example) f - sparsity Can reconstruct over 10,000 people with no errors, using only 200 lanes arxiv0909.0400v1

  16. Rare Allele Identification in a CS Framework # rare alleles individuals in the pool

  17. Compressed Sensing (CS) • The standard CS problem: • n variables • k << n equations • But: x is sparse: • Matrix should obey certain properties (Robust Isometry Property) • Example: random Gaussian or Bernoulli matrix • Then: Can reconstruct x uniquely with k = O(s log(n/s)) equations (a.k.a. ‘measurements’) • Can do so efficiently, even for large matrices (L1 minimization)

  18. Measuring Device – NGST Roche/454 Illumina Solexa Helicos Applied Biosystems SOLiD

  19. NGST Output output: “reads” Illumina: A few millions reads per lane 454: almost 1 million Read length – a few dozens to a few hundreds line = “read”

  20. NGST – Targeted Sequencing We measure the number of reads containing B out of the total number of reads. Here: 1/16

  21. Model Formulation Ideal measurement - the fraction of “B” reads: NGST measurement: • 1. sampling noise: finite number of reads from each site - r , Estimated frequency: r is itself a random variable 2. Technical errors: read errors: 0.5-1% DNA preparation errors Parts of this modeling appeared in [P. Prabhu & I. Pe’er, Genome Research July 09]

  22. Unique Properties of this Application 2. the sensing matrix is known up to noise: DNA preparation errors potential technical problems 3. potential constraints on the matrix M - sparseness: total amount of DNA 1. measurement noise is pool dependent

  23. Outline • Motivation • The Problem • Compressed Sensing • Conclusion

  24. Conclusions • Generic approach: puts together sequencingandCS to identify rare allele carriers. • The method naturally deals with all possible scenarios of multiple carriers and heterozygous or homozygous rare alleles. • Much higher efficiency over the naive approach. • Direction for improvement: • x is trinary (0,1,2): how does one incorporate this into optimization? • Dependence among loci, prior information, … • Manuscript available on arxiv: • arxiv 0909.0400v1 [N. Shental, A. Amir and O. Zuk, in revision] • CS can be used for other problems in genomics. Examples we pursue: • Bacterial Community Reconstruction [A. Amir and O. Zuk, In revision] • Alternatively Spliced Isoforms Reconstruction [O. Zuk et al., in preparation]

  25. Thank You

More Related