ISB

ISB Ravi Pandya | Bill Bolosky Microsoft June 28 2012

Genomics project Collaboration with UC Berkeley AMP Lab Dave Patterson, Armando Fox, Michael Jordan (ML), Taylor Sittler (UCSF Med), students, … Long term: Cancer genomics David Haussler (UC Santa Cruz): Cancer Genomics Hub / Cancer Genome Atlas (TCGA) 500 Tb (growing to 20 Pb) of tumor/normal genomes at San Diego Supercomputer Center Near term: Genome sequencing pipeline Motivated by Archon Genomics X-Prize (September 2013) 100 samples of DNA from centenarians (>105 years old) Sequence with best coverage, accuracy, and cost in 1 month Goal: 98% coverage, 99.9999% accuracy, $1000/genome Current tools (GATK, CLC) are not sufficient to meet the goal

Genomics pipeline Fast, accurate, scalable Apply state-of-the art computer science to sequencing problem Machine learning, distributed systems, high-performance computing Open source for Windows+Linux | Windows Azure cloud service SNAP (available now) Fast aligner using hash-based index of entire genome 10-40x faster than BWA FLASH (in progress) Comprehensive probabilistic model Reference-based alignment + targeted de novo contig assembly + scaffold assembly

Genomics pipeline SNAP Unaligned reads Aligned reads Hash clustering FLASH Optimization De novo assembly Scaffold assembly Call SNPs, indels, SVs

SNAP Reference genome CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCA...GTTTAGCTCAAAGAG... Hash index of seed  {locations} AGCTCAAA GAAAGAA 1. Lookup seeds 2. Map locations 3. Score matches Read sequence CCCAGCTCAAAGGCTGCAGCACGCTTTAACCGAAAGAATGCAG ~15 core-hours for 30x coverage

FLASH SNAP aligner Genomic prior knowledge Machine learning models SNAP Sparse Matrices Alignment Candidate Assembly Candidate assembly Depth Likelihood Separation Coverage Pair distance Overlap Optimize

Read alignment SNAP alignment Sequencing error Mutation frequency Variant databases Candidate Assembly Candidate Assembly 1 1 1 1 0.9 0.6 0.7 0.2 0.8 1B Reads 1B Reads Strands 3B bp Genome 3B bp Genome RGS = Read-Genome-Strand candidate assembly LRG = Likelihood of Read-Genome alignment

Coverage distribution Assembly Sequencer characteristics Alignment data Assembly RGS Assembly Assembly 22 24 29 35 34 0.1 0.12 0.14 0.12 0.1 Strands 3B bp Genome Coverage GSC = Genome-Strand Coverage LC = Likelihood of Coverage

Hash clustering Cluster unaligned reads with overlapping bases Starting point for assembling contigs 1. Count seeds 2. Bucket reads by seed 3. Connect overlapping reads 4. Cluster connected components 2 3 1 1 CGCAGCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAAC GCTCAAAGGCTGCAGCACGCTTTGAAAGAATGCAGTTTAACCACGAGAACTGGA CCGATCGTTTGAATTAGATGTATTAGAGGTTAGTACCCTAGCCTAGTCGTAAGA

Targeted de novo assembly Contig “genome” Genomic prior knowledge Machine learning models calc infer Update Candidate Assembly Alignment Candidate assembly Depth Likelihood Coverage Separation Pair distance Overlap hash clusters Optimize

Scaffold assembly Maximum likelihood model Optimized reference contigs + de novo unaligned contigs Explore space of possible arrangements into a sample genome Optimize P(observed reads | candidate genome) = sequencing error + coverage depth + pair distance Incremental calculation using sparse matrix model

Next steps? … SNAP Apply to more datasets / platforms / organisms Validate accuracy / coverage FLASH Use Kaviar for population priors Different approaches to assembly / structural variation Biology What interesting research could this enable – scale, speed, accuracy, analysis?

ISB

ISB

Presentation Transcript

Downham Ageing Well (ISB) Project

MBA ISB

Handling advice for ISB

Porter’s 5 Forces - ISB

ISLAMIC BANKING (ISB 300)

ISB Appraiser Training

IB DP at ISB

Cummins ISB Soot Effects?

World Space Week Activities at ISB

ISB Presentation Status Report for Digital Archives

Frédéric Schütz Frederic.Schutz@isb-sib.ch

Infrastructure Services Board (ISB)

Cummins ISB LTMS2

MBA & ISB

The IB at isb

ISB MBA Essays

ISB Admission Consultants | ISB YLP | Goalisb

Executive mba from isb | ISB pgppro | GoalISB

ISB

ISB