Anchoring Distinct Reads on Human Genome for Rapid Analysis

EDBT 2010March 22-26 2010 Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds 1IBM TJ Watson Research, NY, USA 2IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA

Introduction - Past • Bioinformatics Sequencing Evolution Human Genome 3.3B nucleotides 825MB raw data 20MB compressed 25,000 genes

Introduction – Past, Present, Future In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson. By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg. It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001. “Science 2009”

C G A ? G A T A … millions C G T T A C G A C G T T A C G 20-75 nts What we want to do Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome New generation Sequencer Reference genome

Applications • Cancer Research: help isolate cancer-initiating mutations • Better design of siRNA’s (short interfering RNA’s) • RNA sequences of ~21 nucleotides -> RNA interference • Therapeutic gene silencing • Cancer therapy by tumor-related gene targeting • Junk DNA analysis • Re-sequencing • etc

How can we achieve that? • Solve as hash join between the reference and the query database • Reference Genome/Database (eg human genome) • Static, one time • Query Database (produced fragments) • Dynamicrecreated on every experiment • How fast can we achieve this on a commodity desktop? • We call the techniqueQPick (=Quick Pick)

short-read generator TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC output(with wildcards) head tail acggtttaTTTAGGGGGCCAAAAAAATTT head & tail separation cggtttatTTAGGGGGCCAAAAAAATTTA key extraction &bit packetization query set expansion(wildcards + reverse strand) hashtablekey bit packetization bit packetization / hashtable construction 000..011 001..001 aaaaaaaa 001..011 011..001 aaaaaaaa acaaaaaa agggaaaa … … hashtable join cggtttat 011..011 010..111 011..111 cggtttat 011..011 010..111 ctttttat ggtttgat ggtataaa ggtttttt … … tttttttt 101..000 tttttttt 101..000 target database query database genomic sequence (the database) ACGG…TTT k-mer extraction CGG…TTTA GG…TTTAG

QPick - Advantages • Disk Based. Small Memory Requirements • Other techniques run out of memory on large datasets… • 3GB RAM is sufficient to • Index • Search Human Genome • Query Millions of Short Fragments

QPick - Advantages • Highly Parallelizable • Single CPU/Core are sufficient.

C G A C G A T A T T A A C C G G A C G T T A C G T T QPick - Advantages • Flexibility • Rigid and wildcard matches wildcard wildcard query C G A ? G A T A ? C G A C G A T A G C G A C G A T A G A C G T T A C G A A reference genome ……..

QPick - Advantages • Completeness of results • Various competitors failed to return all matches C G A ? G A T A … QPick does not miss any matches C G T T A C G A C G T T A C G

Technical Highlights • Data Compression • Exploit small DNA alphabet • Fast bitwise operations • Take advantage of 64bit word comparisons • Simple implementation (hash based) • Data Pruning A C G T …0011100011… 64bits

C G A C G A T A C G A C G A T A C G C C G G A A C C G G A A T T A A T T T T T T Data Representation reference genome C G A C G A T A G C G A C G A T reference genome window Lmax

A A A A A A C C C C C C G G G G G G T T T T T T Codebook 10 10 10 01 11 01 11 01 11 00 00 00 01 00 00 A 01 C 10 G 11 T tail Data Representation - Head reference genome head tail A C window Binary: 0001101100011011000110110001 Decimal: 28,422,577 position28,422,577

Data Representation reference genome Advantages • Head = key for hashtable • Reduction in dataset size • We don’t have to explicitly store the head

. . . . A A C C G G T T 0100 1000 etc … 0000 1 position of thewindow(how many shiftswe have made since the beginning ofthe string…) [p & q = q] pattern p query q Data Representation - Tail reference genome tail head G T A C padded toproper length tail: 16 nts  16x4bits = one 64bit word Codebook 0001 A 0010 C 0100 G 1000 T 0000 .

30 symbols atagactaaaaaaa AAAAAAAATT...... 1 tagactaaaaaaaa AAAAAAATT ....... 2 … aaaaaaaaaaaaaa A T T . . . . . . . . . . ... 8 aaaaaaaaaaaaaa T T . . . . . . . . . . . . .. 9 aaaaaaaaaaaaat T . . . . . . . . . . . . . . . 10 Example reference genome ATAGACTAAAAAAAAAAAAAAATT …… 24 symbols padding

Up to now we have seen how to encode the reference genome Now we show how to encode the query sequences

A A C C G G T T Encoding the query DB Now, instead of one long sequence, many shorter ones A C . G T A C . G millions A C . G T . A C G T G

A A C C G G T T Encoding the query DB Now, instead of one long sequence, many shorter ones A C . G T A C . G millions A C . G T . A C G T G We do similar extractions in heads and tails head tail 14 symbols: key 16 symbols = 64bits

A C G T Encoding the query DB Differences between target db and query db: • Query sequences may contain wildcards(in the db only used for padding) A C . G T .

A A C C G G T T Encoding the query DB Differences between target db and query db: • Query sequences may contain wildcards(in the db only used for padding) • Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db) A C . G T . A C . G A C . G T . A C G T G

A A C G T C A C G T G A C G T T A C G T Encoding the query DB - wildcards Treated differently depending where they appear • Head. Expanded to the possible symbols A C . G T

Encoding the query DB - wildcards Treated differently depending where they appear • Head. Expanded to the possible symbols • Tail. Encoded as binary wildcard 0000 A A C G T C A C G T A C . G T G A C G T T A C G T

C C G G A A C C G G A A T T A A T T Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand If we explicitly encode the reverse strand we would be indexing twice as much data

C C G G A A C C G G A A T T A A T T T T G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences

C C G G A A C C G G A A T T A A T T T T G G C C T A G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences

Overall search process ….

YES! 98.5% of buckets contain less than 10 entries Was our hash key fuction a correct one? …a bad key function would have led to many collisions…

Experiments • Comparison with other Short Sequence Anchoring Tools • QPick Performance • Number of Wildcards • Index Creation Time • Query Time • Datasets Mus Musculus Homo sapiens ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/ ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/

Up to 60x faster … COMPARISON Short Sequence Anchoring Tools

Comparison with Hash-Based techniques QPick – 16x faster than FetchGWI

Varying the wildcards • 5-6 seconds retrieval time for 0-1 wildcards • <60 sec for up to 4 wildcards

Full Human-Genome Search • Time to index • Reference Genome • Query sequences • Time to Search • 1M- 10M queries (short DNA fragments)

Full Genome Search – Index Time ~ 7 hours (on single core CPU) < 50 sec for query hashtables

Full Genome Search – Search Time • ~130sec to find 1M matches (single core)

Conclusions • QPick: Fast and complete search for short sequence fragments • Takes Advantage of: • Small DNA alphabet • Bit packetization • Hash Joins • Up to 60x faster the competitive techniques • Applications for …. • Future… A C G T …0011100011…

Anchoring Distinct Reads on Human Genome for Rapid Analysis

Anchoring Distinct Reads on Human Genome for Rapid Analysis

Presentation Transcript

2010-2011 Competitive NGO - Technical Assistance March 22, 2010

March, 26, 2010

March, 2010

.. : , , 22 2010 .

March 2010

March 2010

Independent/Special Schools March 26, 2010

Geneva, 26 March 2010

Week of March 26, 2010

CSUN 2010 March 26, 2010

March 2010

University of Essex 26 March 2010

March 2010

TES INDABA CONFERENCE 26 MARCH 2010

Monday, March 22, 2010

March 2010

OKAIRP March 26, 2010

March 2010

Clickers 26 th March 2010

EDBT 2010 March 22-26 2010

LBS#12 , 22 March 2010

March 2010