320 likes | 443 Views
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science. Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday , June 10, 2009.
E N D
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday, June 10, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Cloud Computing @ Maryland • Teaching • Cloud computing course (version 1.0): Spring 2008Part of the Google/IBM Academic Cloud Computing Initiative • Cloud computing course (version 2.0): Fall 2008Sponsored by Amazon Web Services through a teaching grant • Research • Web-scale text processing • Statistical machine translation • Bioinformatics
Learning Translation Models Maria Mary Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique. bofetada slap MrProdi has put an emphatic stop to this kind of action, which has hopefully resonated with MrMoscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz. una a bruja witch verde green Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro. brujaverde green witch These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, MrMoscovici, would be even more serious. no dio did not We built systems for “learning” translation models in Hadoop… … sort of like the word count example, but with more math
Translation as a “Tiling” Problem a Maria no dio una bofetada la bruja verde Mary not give a slap to the witch green did not by a slap green witch to the no slap did not give to the slap the witch Example from Koehn (2006)
From Text to DNA Sequences • Text processing: [0-9A-Za-z]+ • DNA sequence processing: [ATCG]+ Easier, right? (Nope, not really) Michael Schatz (Ph.D. student, Computer Science; Spring 2008)Ben Langmead(M.S. student, Computer Science; Fall 2008)
Strangely-Formatted Manuscript • Dickens: A Tale of Two Cities • Text written on a long spool It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
… With Duplicates • Dickens: A Tale of Two Cities • “Backup” on four more copies It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruction It was the best of It was the best of times, it was the worst times, it was the worst of times, it was the of times, it was the age of wisdom, it was age of wisdom, it was the age of foolishness, … the age of foolishness, … • Dickens accidently shreds the manuscript • How can he reconstruct the text? • 5 copies x 138,656 words / 5 words per fragment = 138k fragments • The short fragments from every copy are mixed together • Some fragments are identical It was the best It was the best of times, it was the of times, it was the worst of times, it was worst of times, it was the age of wisdom, it the age of wisdom, it was the age of foolishness, was the age of foolishness, It was the It was the best of times, it was best of times, it was the worst of times, it the worst of times, it was the age of wisdom, was the age of wisdom, it was the age of it was the age of foolishness, … foolishness, … It was It was the best of times, it the best of times, it was the worst of times, was the worst of times, it was the age of it was the age of wisdom, it was the age wisdom, it was the age of foolishness, … of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It It was the best of times, was the best of times, it was the worst of it was the worst of times, it was the age times, it was the age of wisdom, it was the of wisdom, it was the age of foolishness, … age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Generally prefer longer overlaps to shorter overlaps In the presence of error, we might allow the overlapping fragments to differ by a small amount It was the best of Overlaps age of wisdom, it was best of times, it was it was the age of It was the best of 4 word overlap it was the age of was the best of times, it was the worst of of times, it was the It was the best of of times, it was the 1 word overlap of times, it was the of wisdom, it was the the age of wisdom, it It was the best of the best of times, it 1 word overlap of wisdom, it was the the worst of times, it times, it was the age times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times, was the worst of times, wisdom, it was the age worst of times, it was
The repeated sequence makes the correct reconstruction ambiguous Greedy Assembly It was the best of age of wisdom, it was It was the best of best of times, it was was the best of times, it was the age of the best of times, it it was the age of best of times, it was it was the worst of of times, it was the of times, it was the of times, it was the of times, it was the times, it was the worst of wisdom, it was the times, it was the age the age of wisdom, it the best of times, it the worst of times, it times, it was the age times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times, was the worst of times, wisdom, it was the age worst of times, it was
GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT ? TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Subject genome Sequencer
DNA Sequencing • Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG • Bacteria: ~5 million bp • Humans: ~3 billion bp • Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp) • Shorter reads, but much higher throughput • Per-base error rate estimated at 1-2% (Simpson, et al, 2009) • Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads • ~144 GB of compressed sequence data ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG
Human Genome A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project • 11 years, cost $3 billion… your tax dollars at work!
Subject reads CTATGCGGGC CTAGATGCTT ATCTATGCGG TCTAGATGCT GCTTATCTAT ATCTATGCGG ATCTATGCGG ATCTATGCGG TTATCTATGC CTATGCGGGC GCTTATCTAT Alignment CGGTCTAGATGCTTAGCTATGCGGGCCCCTT Reference sequence
Subject reads ATGCGGGCCC CTAGATGCTT CTATGCGGGC TCTAGATGCT ATCTATGCGG CGGTCTAG ATCTATGCGG CTT CGGTCT TTATCTATGC CCTT CGGTC GCTTATCTAT GCCCCTT GCTTATCTAT CGG GGCCCCTT CGGTCTAGATGCTTATCTATGCGGGCCCCTT Reference sequence
Reference: ATGAACCACGAACACTTTTTTGGCAACGATTTAT… Query: ATGAACAAAGAACACTTTTTTGGCCACGATTTAT… Insertion Deletion Mutation
CloudBurst • Map: Catalog K-mers • Emit every k-mer in the genome and non-overlapping k-mers in the reads • Non-overlapping k-mers sufficient to guarantee an alignment will be found • 2. Shuffle: Coalesce Seeds • Hadoop internal shuffle groups together k-mers shared by the reads and the reference • Conceptually build a hash table of k-mers and their occurrences • 3. Reduce: End-to-end alignment • Locally extend alignment beyond seeds by computing “match distance” • If read aligns end-to-end, record the alignment Map shuffle Reduce Human chromosome 1 Read 1, Chromosome 1, 12345-12365 … Read 1 … Read 2 Read 2, Chromosome 1, 12350-12370
Results from a small, 24-core cluster, with different number of mismatches Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on EC2 Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
de Bruijn Graph Construction • Dk = (V,E) • V = All length-k subfragments (k > l) • E = Directed edges between consecutive subfragmentsNodes overlap by k-1 words • Locally constructed graph reveals the global sequence structure • Overlaps implicitly computed Original Fragment Directed Edge It was the best of It was the best was the best of de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001
de Bruijn Graph Assembly It was the best was the best of the best of times, best of times, it of times, it was times, it was the the age of foolishness it was the worst it was the age was the worst of was the age of the worst of times, worst of times, it the age of wisdom, age of wisdom, it of wisdom, it was wisdom, it was the
Compressed de Bruijn Graph It was the best of times, it • Unambiguous non-branching paths replaced by single nodes • An Eulerian traversal of the graph spells a compatible reconstruction of the original text • There may be many traversals of the graph • Different sequences can have the same string graph • It was the best of times, it was the worst of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … it was the worst of times, it of times, it was the the age of foolishness it was the age of the age of wisdom, it was the
Bottom Line: Bioinformatics • Great use case of Hadoop • Interesting computer science problems • Help unravel life’s mysteries?
Questions? Comments? Thanks to the organizations who support our work: