120 likes | 221 Views
ISB. Ravi Pandya | Bill Bolosky Microsoft June 28 2012. Genomics project. Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics
E N D
ISB Ravi Pandya | Bill Bolosky Microsoft June 28 2012
Genomics project Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics David Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA) 500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center Near term: Genome sequencing pipeline Motivated by Archon Genomics X-Prize (September 2013) 100 samples of DNA from centenarians (>105 years old) Sequence with best coverage, accuracy, and cost in 1 month Goal: 98% coverage, 99.9999% accuracy, $1000/genome Current tools (GATK, CLC) are not sufficient to meet the goal
Genomics pipeline Fast, accurate, scalable Apply state-of-the art computer science to sequencing problem Machine learning, distributed systems, high-performance computing Open source for Windows+Linux | Windows Azure cloud service SNAP (available now) Fast aligner using hash-based index of entire genome 10-40x faster than BWA FLASH (in progress) Comprehensive probabilistic model Reference-based alignment + targeted de novo contig assembly + scaffold assembly
Genomics pipeline SNAP Unaligned reads Aligned reads Hash clustering FLASH Optimization De novo assembly Scaffold assembly Call SNPs, indels, SVs
SNAP Reference genome CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG... Hash index of seed {locations} AGCTCAAA GAAAGAA 1. Lookup seeds 2. Map locations 3. Score matches Read sequence CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAG ~15 core-hours for 30x coverage
FLASH SNAP aligner Genomic prior knowledge Machine learning models SNAP Sparse Matrices Alignment Candidate Assembly Candidate assembly Depth Likelihood Separation Coverage Pair distance Overlap Optimize
Read alignment SNAP alignment Sequencing error Mutation frequency Variant databases Candidate Assembly Candidate Assembly 1 1 1 1 0.9 0.6 0.7 0.2 0.8 1B Reads 1B Reads Strands 3B bp Genome 3B bp Genome RGS = Read-Genome-Strand candidate assembly LRG = Likelihood of Read-Genome alignment
Coverage distribution Assembly Sequencer characteristics Alignment data Assembly RGS Assembly Assembly 22 24 29 35 34 0.1 0.12 0.14 0.12 0.1 Strands 3B bp Genome Coverage GSC = Genome-Strand Coverage LC = Likelihood of Coverage
Hash clustering Cluster unaligned reads with overlapping bases Starting point for assembling contigs 1. Count seeds 2. Bucket reads by seed 3. Connect overlapping reads 4. Cluster connected components 2 3 1 1 CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA
Targeted de novo assembly Contig “genome” Genomic prior knowledge Machine learning models calc infer Update Candidate Assembly Alignment Candidate assembly Depth Likelihood Coverage Separation Pair distance Overlap hash clusters Optimize
Scaffold assembly Maximum likelihood model Optimized reference contigs + de novo unaligned contigs Explore space of possible arrangements into a sample genome Optimize P(observed reads | candidate genome) = sequencing error + coverage depth + pair distance Incremental calculation using sparse matrix model
Next steps? … SNAP Apply to more datasets / platforms / organisms Validate accuracy / coverage FLASH Use Kaviar for population priors Different approaches to assembly / structural variation Biology What interesting research could this enable – scale, speed, accuracy, analysis?