400 likes | 518 Views
EDBT 2010 March 22-26 2010. Tien Huynh 1 , Michalis Vlachos 2 , Isidore Rigoutsos 3. Anchoring Millions of Distinct Reads on the Human Genome Within Seconds. 1 IBM TJ Watson Research, NY, USA 2 IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA. Introduction - Past.
E N D
EDBT 2010March 22-26 2010 Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds 1IBM TJ Watson Research, NY, USA 2IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA
Introduction - Past • Bioinformatics Sequencing Evolution Human Genome 3.3B nucleotides 825MB raw data 20MB compressed 25,000 genes
Introduction – Past, Present, Future In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson. By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg. It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001. “Science 2009”
C G A ? G A T A … millions C G T T A C G A C G T T A C G 20-75 nts What we want to do Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome New generation Sequencer Reference genome
Applications • Cancer Research: help isolate cancer-initiating mutations • Better design of siRNA’s (short interfering RNA’s) • RNA sequences of ~21 nucleotides -> RNA interference • Therapeutic gene silencing • Cancer therapy by tumor-related gene targeting • Junk DNA analysis • Re-sequencing • etc
How can we achieve that? • Solve as hash join between the reference and the query database • Reference Genome/Database (eg human genome) • Static, one time • Query Database (produced fragments) • Dynamicrecreated on every experiment • How fast can we achieve this on a commodity desktop? • We call the techniqueQPick (=Quick Pick)
short-read generator TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC output(with wildcards) head tail acggtttaTTTAGGGGGCCAAAAAAATTT head & tail separation cggtttatTTAGGGGGCCAAAAAAATTTA key extraction &bit packetization query set expansion(wildcards + reverse strand) hashtablekey bit packetization bit packetization / hashtable construction 000..011 001..001 aaaaaaaa 001..011 011..001 aaaaaaaa acaaaaaa agggaaaa … … hashtable join cggtttat 011..011 010..111 011..111 cggtttat 011..011 010..111 ctttttat ggtttgat ggtataaa ggtttttt … … tttttttt 101..000 tttttttt 101..000 target database query database genomic sequence (the database) ACGG…TTT k-mer extraction CGG…TTTA GG…TTTAG
QPick - Advantages • Disk Based. Small Memory Requirements • Other techniques run out of memory on large datasets… • 3GB RAM is sufficient to • Index • Search Human Genome • Query Millions of Short Fragments
QPick - Advantages • Highly Parallelizable • Single CPU/Core are sufficient.
C G A C G A T A T T A A C C G G A C G T T A C G T T QPick - Advantages • Flexibility • Rigid and wildcard matches wildcard wildcard query C G A ? G A T A ? C G A C G A T A G C G A C G A T A G A C G T T A C G A A reference genome ……..
QPick - Advantages • Completeness of results • Various competitors failed to return all matches C G A ? G A T A … QPick does not miss any matches C G T T A C G A C G T T A C G
Technical Highlights • Data Compression • Exploit small DNA alphabet • Fast bitwise operations • Take advantage of 64bit word comparisons • Simple implementation (hash based) • Data Pruning A C G T …0011100011… 64bits
C G A C G A T A C G A C G A T A C G C C G G A A C C G G A A T T A A T T T T T T Data Representation reference genome C G A C G A T A G C G A C G A T reference genome window Lmax
A A A A A A C C C C C C G G G G G G T T T T T T Codebook 10 10 10 01 11 01 11 01 11 00 00 00 01 00 00 A 01 C 10 G 11 T tail Data Representation - Head reference genome head tail A C window Binary: 0001101100011011000110110001 Decimal: 28,422,577 position28,422,577
Data Representation reference genome Advantages • Head = key for hashtable • Reduction in dataset size • We don’t have to explicitly store the head
. . . . A A C C G G T T 0100 1000 etc … 0000 1 position of thewindow(how many shiftswe have made since the beginning ofthe string…) [p & q = q] pattern p query q Data Representation - Tail reference genome tail head G T A C padded toproper length tail: 16 nts 16x4bits = one 64bit word Codebook 0001 A 0010 C 0100 G 1000 T 0000 .
30 symbols atagactaaaaaaa AAAAAAAATT...... 1 tagactaaaaaaaa AAAAAAATT ....... 2 … aaaaaaaaaaaaaa A T T . . . . . . . . . . ... 8 aaaaaaaaaaaaaa T T . . . . . . . . . . . . .. 9 aaaaaaaaaaaaat T . . . . . . . . . . . . . . . 10 Example reference genome ATAGACTAAAAAAAAAAAAAAATT …… 24 symbols padding
Up to now we have seen how to encode the reference genome Now we show how to encode the query sequences
A A C C G G T T Encoding the query DB Now, instead of one long sequence, many shorter ones A C . G T A C . G millions A C . G T . A C G T G
A A C C G G T T Encoding the query DB Now, instead of one long sequence, many shorter ones A C . G T A C . G millions A C . G T . A C G T G We do similar extractions in heads and tails head tail 14 symbols: key 16 symbols = 64bits
A C G T Encoding the query DB Differences between target db and query db: • Query sequences may contain wildcards(in the db only used for padding) A C . G T .
A A C C G G T T Encoding the query DB Differences between target db and query db: • Query sequences may contain wildcards(in the db only used for padding) • Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db) A C . G T . A C . G A C . G T . A C G T G
A A C G T C A C G T G A C G T T A C G T Encoding the query DB - wildcards Treated differently depending where they appear • Head. Expanded to the possible symbols A C . G T
Encoding the query DB - wildcards Treated differently depending where they appear • Head. Expanded to the possible symbols • Tail. Encoded as binary wildcard 0000 A A C G T C A C G T A C . G T G A C G T T A C G T
C C G G A A C C G G A A T T A A T T Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand If we explicitly encode the reverse strand we would be indexing twice as much data
C C G G A A C C G G A A T T A A T T T T G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences
C C G G A A C C G G A A T T A A T T T T G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences
C C G G A A C C G G A A T T A A T T T T G G C C T A G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences
YES! 98.5% of buckets contain less than 10 entries Was our hash key fuction a correct one? …a bad key function would have led to many collisions…
Experiments • Comparison with other Short Sequence Anchoring Tools • QPick Performance • Number of Wildcards • Index Creation Time • Query Time • Datasets Mus Musculus Homo sapiens ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/ ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/
Up to 60x faster … COMPARISON Short Sequence Anchoring Tools
Comparison with Hash-Based techniques QPick – 16x faster than FetchGWI
Varying the wildcards • 5-6 seconds retrieval time for 0-1 wildcards • <60 sec for up to 4 wildcards
Full Human-Genome Search • Time to index • Reference Genome • Query sequences • Time to Search • 1M- 10M queries (short DNA fragments)
Full Genome Search – Index Time ~ 7 hours (on single core CPU) < 50 sec for query hashtables
Full Genome Search – Search Time • ~130sec to find 1M matches (single core)
Conclusions • QPick: Fast and complete search for short sequence fragments • Takes Advantage of: • Small DNA alphabet • Bit packetization • Hash Joins • Up to 60x faster the competitive techniques • Applications for …. • Future… A C G T …0011100011…