250 likes | 330 Views
Rare-Allele Detection Using Compressed Se(que)nsing. Or Zuk Broad Institute of MIT and Harvard orzuk@broadinstitute.org In collaboration with: Amnon Amir Dept. of Physics of Complex Systems, Weizmann Inst. of Science Noam Shental
E N D
Rare-Allele Detection Using Compressed Se(que)nsing Or Zuk Broad Institute of MIT and Harvard orzuk@broadinstitute.org In collaboration with: Amnon Amir Dept. of Physics of Complex Systems, Weizmann Inst. of Science Noam Shental Dept. of Computer Science, The Open University of Israel
Outline • Motivation • The Problem • Compressed Sensing • Conclusion
Rare recessive genetic diseases Genotype Phenotype Normal Healthy Carrier Healthy! Affected Sick
Large scale carrier screen (rates vary across ethnic groups)
Specific mutations HEXA gene on chromosome 15 over 100 mutations are known
Specific mutations - notation …AGCGTTCT… “A” Reference genome …AGTGTTCT… “B” Single-nucleotide polymorphism (SNPs) …AGGTTCT “B” Insertions/Deletions (InDels) Carrier test screen: Amplify a sample of DNA and then test 0 1/2 fraction of B’s out of tested alleles “AA” “AB”
Genome Wide Association Studies Cases Controls collect DNA samples Count: Statistical test, p-value BB AA AA AB AA AA AA AB AA AA AB AB AA BB AA AB AB AB Try ~105 – 106 different SNPs. Significant ones called ‘discoveries’/’associations’
Published Genome-Wide Associations through 12/2009, 658 published GWA at p<5x10-8 [NHGRI GWAS Catalog www.genome.gov/GWAStudies]
Goal: push further What Associations are Detected? [T.A. Manolioet al. Nature 2009]
Outline • Motivation • The Problem • Compressed Sensing • Conclusion
Naïve Approach – One Test per Individual collect DNA samples Apply 9 independent tests AA AA AA AA AA AA AA AB AB fraction of B’s out of tested alleles 0 0 0 0 1/2 0 0 0 1/2 Problem: Rare alleles require profiling a high number of individuals. Still very costly
Outline • Motivation • The Problem • Compressed Sensing • Conclusion
Compressed Sensing Based Group Testing Next Generation Sequencing Technology fraction of B’s infer/reconstruct compressed sensing a few tests instead of 9
Results (example) f - sparsity Can reconstruct over 10,000 people with no errors, using only 200 lanes arxiv0909.0400v1
Rare Allele Identification in a CS Framework # rare alleles individuals in the pool
Compressed Sensing (CS) • The standard CS problem: • n variables • k << n equations • But: x is sparse: • Matrix should obey certain properties (Robust Isometry Property) • Example: random Gaussian or Bernoulli matrix • Then: Can reconstruct x uniquely with k = O(s log(n/s)) equations (a.k.a. ‘measurements’) • Can do so efficiently, even for large matrices (L1 minimization)
Measuring Device – NGST Roche/454 Illumina Solexa Helicos Applied Biosystems SOLiD
NGST Output output: “reads” Illumina: A few millions reads per lane 454: almost 1 million Read length – a few dozens to a few hundreds line = “read”
NGST – Targeted Sequencing We measure the number of reads containing B out of the total number of reads. Here: 1/16
Model Formulation Ideal measurement - the fraction of “B” reads: NGST measurement: • 1. sampling noise: finite number of reads from each site - r , Estimated frequency: r is itself a random variable 2. Technical errors: read errors: 0.5-1% DNA preparation errors Parts of this modeling appeared in [P. Prabhu & I. Pe’er, Genome Research July 09]
Unique Properties of this Application 2. the sensing matrix is known up to noise: DNA preparation errors potential technical problems 3. potential constraints on the matrix M - sparseness: total amount of DNA 1. measurement noise is pool dependent
Outline • Motivation • The Problem • Compressed Sensing • Conclusion
Conclusions • Generic approach: puts together sequencingandCS to identify rare allele carriers. • The method naturally deals with all possible scenarios of multiple carriers and heterozygous or homozygous rare alleles. • Much higher efficiency over the naive approach. • Direction for improvement: • x is trinary (0,1,2): how does one incorporate this into optimization? • Dependence among loci, prior information, … • Manuscript available on arxiv: • arxiv 0909.0400v1 [N. Shental, A. Amir and O. Zuk, in revision] • CS can be used for other problems in genomics. Examples we pursue: • Bacterial Community Reconstruction [A. Amir and O. Zuk, In revision] • Alternatively Spliced Isoforms Reconstruction [O. Zuk et al., in preparation]