220 likes | 329 Views
Rare-Allele Detection Using Compressed Se(que)nsing. Noam Shental Department of Computer Science, The Open University of Israel shental@openu.ac.il. Rare-Allele Detection Using Compressed Se(que)nsing. Or Zuk Broad Institute of MIT and Harvard In collaboration with: Amnon Amir
E N D
Rare-Allele Detection Using Compressed Se(que)nsing Noam Shental Department of Computer Science, The Open University of Israel shental@openu.ac.il
Rare-Allele Detection Using Compressed Se(que)nsing Or Zuk Broad Institute of MIT and Harvard In collaboration with: Amnon Amir Department of Physics of Complex Systems, Weizmann Institute of Science Noam Shental Department of Computer Science, The Open University of Israel
Rare recessive genetic diseases Genotype Phenotype Normal Healthy Carrier Healthy! Affected Sick
Large scale carrier screen (rates vary across ethnic groups)
Published Genome-Wide Associations through 12/2009, 658 published GWA at p<5x10-8 [NHGRI GWAS Catalog www.genome.gov/GWAStudies]
What Associations are Detected? [T.A. Manolio et al. Nature 2009]
Specific mutations HEXA gene on chromosome 15 over 100 mutations are known
Specific mutations - notation …AGCGTTCT… “A” Reference genome …AGTGTTCT… “B” Single-nucleotide polymorphism (SNPs) …AGGTTCT “B” Insertions/Deletions (InDels) Carrier test screen: Amplify a sample of DNA and then test 0 1/2 fraction of B’s out of tested alleles “AA” “AB”
naïve approach – one test per individual collect DNA samples Apply 9 independent tests AA AA AA AA AA AA AA AB AB fraction of B’s out of tested alleles 0 0 0 0 1/2 0 0 0 1/2
Compressed sensing based group testing Next Generation Sequencing Technology fraction of B’s infer/reconstruct compressed sensing a few tests instead of 9
Example arxive 0909.0400v1
Rare allele identification in a CS framework # rare alleles individuals in the pool
Measuring device – NGST Roche/454 Illumina Solexa Helicos Applied Biosystems SOLiD
NGST output output: “reads” Illumina: A few millions reads per lane 454: almost 1 million line = “read”
NGST – targeted sequencing We measure the number of reads containing B out of the total number of reads.
Model formulation Ideal measurement - the fraction of “B” reads: NGST measurement: • 1. sampling noise: finite number of reads from each site - r , Estimated frequency: r is itself a random variable 2. Technical errors: read errors: 0.5-1% DNA preparation errors Parts of this modeling appeared in P. Prabhu & I. Pe’er, Genome Research July 09
Unique properties of this application 2. the sensing matrix is known up to noise: DNA preparation errors potential technical problems 3. potential constraints on the matrix M - sparseness: total amount of DNA 1. measurement noise is pool dependent
Current work – Dor Yeshorim In collaboration with Y. Erlich, CSHL 8000 DNA samples
Conclusions • Generic approach that puts together sequencing and CS for identifying rare allele carriers. • The method naturally deals with all possible scenarios of multiple carriers and heterozygous or homozygous rare alleles. • Much higher efficiency over the naive approach. • Direction for improvement: • x is trinary (0,1,2): how does one incorporate this into optimization? • Dependence among loci
Related Work • Erlich et al.
Other Applications • Compressed Sensing / Sparse Reconstruction can be used for other problems in genomics. • Other problems: • Bacterial Community Reconstruction • Direction for improvement: • x is trinary (0,1,2): how does one incorporate this into optimization? • Dependence among loci