230 likes | 351 Views
H. Bos – Leiden University 13/02/2004. 1. Using Network Processors in Genomics. Herbert Bos * † Kaiming Huang * {herbertb,khuang}@liacs.nl * Leiden Universiteit, Netherlands † Vrije Universiteit, Netherlands http://www.liacs.nl/~herbertb/projects/biocomp/.
E N D
H. Bos – Leiden University 13/02/2004 1 Using Network Processors inGenomics Herbert Bos* † Kaiming Huang* {herbertb,khuang}@liacs.nl *Leiden Universiteit, Netherlands † Vrije Universiteit, Netherlands http://www.liacs.nl/~herbertb/projects/biocomp/
H. Bos – Leiden University 13/02/2004 2 Case study: BLAST • search nucleotide/protein database for query • BLAST discovers similarity rather than exact match • two main phases: • scoring (registering where query and DNADB match) • alignment (dynamic programming) • only the first phase on NPUs
H. Bos – Leiden University 13/02/2004 3 Window matching
H. Bos – Leiden University 13/02/2004 4 Window matching
H. Bos – Leiden University 13/02/2004 5 Window matching
H. Bos – Leiden University 13/02/2004 6 Window matching
H. Bos – Leiden University 13/02/2004 7 Window matching • naïve approach: roughly W*N*M comparisons • does not scale • string search algorithms: Aho-Corasick • all windows matched at the same time • shifting genome one nucleotide at a time • matching algorithm transformed in a DFA • DFA may be quite large
H. Bos – Leiden University 13/02/2004 8 Aho-Corasick • Alphabet: acgt • Window size: 3 • Query: acgccga • Windows: {acg,cgc,gcc,ccg,cga}
H. Bos – Leiden University 13/02/2004 9 Aho-Corasick • Alphabet: acgt • Window size: 3 • Query: acgccga • Windows: {acg,cgc,gcc,ccg,cga} a c g t 0 1 2 3 c g c 4 5 6 a 12 c g 10 11 g c c 7 8 9
H. Bos – Leiden University 13/02/2004 10 Aho-Corasick • Alphabet: acgt • Window size: 3 • Query: acgccga • Windows: {acg,cgc,gcc,ccg,cga} a c g t 0 1 2 3 c g c 4 5 6 a 12 c g 10 11 g c c 7 8 9
H. Bos – Leiden University 13/02/2004 11 Aho-Corasick • Alphabet: acgt • Window size: 3 • Query: acgccga • Windows: {acg,cgc,gcc,ccg,cga} a c g t 0 1 2 3 c g c 4 5 6 a 12 c g 10 11 g c c 7 8 9 tacgcga
SRAM H. Bos – Leiden University 13/02/2004 12 IXPBlast Architecture Gbps ports NPU (IXP1200) ME ME scratch ME ME DRAM Control Processor ME ME Pentium StrongARM Microengines PCI Bus PCI
SRAM H. Bos – Leiden University 13/02/2004 13 IXPBlast Architecture Gbps ports NPU (IXP1200) ME ME scratch ME ME DRAM Control Processor ME ME Pentium StrongARM Microengines PCI Bus PCI
SRAM H. Bos – Leiden University 13/02/2004 14 IXPBlast Architecture Gbps ports NPU (IXP1200) ME ME scratch ME ME DRAM Control Processor ME ME Pentium StrongARM Microengines PCI Bus PCI
a c g 0 1 2 3 t c g c 4 5 6 a 12 SRAM c g 10 11 g c c 7 8 9 H. Bos – Leiden University 13/02/2004 15 IXPBlast Architecture Gbps ports NPU (IXP1200) ME ME scratch ME ME DRAM Control Processor ME ME Pentium StrongARM Microengines PCI Bus PCI
a c g 0 1 2 3 t c g c 4 5 6 a 12 SRAM c g 10 11 g c c 7 8 9 H. Bos – Leiden University 13/02/2004 16 IXPBlast Architecture Gbps ports NPU (IXP1200) ME ME scratch ME ME DRAM Control Processor ME ME Pentium StrongARM Microengines PCI Bus PCI
a c g 0 1 2 3 t c g c 4 5 6 a 12 SRAM c g 10 11 g c c 7 8 9 H. Bos – Leiden University 13/02/2004 17 IXPBlast Architecture Gbps ports NPU (IXP1200) ME ME scratch ME ME DRAM Control Processor ME ME Pentium StrongARM Microengines PCI Bus PCI
H. Bos – Leiden University 13/02/2004 18 IXPBlast: packet handling 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 • packets read and processed in batches of 100.000 • “spilling” must be taken into account • currently no feedback
H. Bos – Leiden University 13/02/2004 19 Results • 232 MHz IXP1200 ~ 1.8GHz Pentium-4 • 1611 Nucleotide query (MyD88) • 1.4 GB genome (Zebrafish) • IXP1200: 90 sec with DFA • IXP1200: 129 sec with “trie” • P4: 132: 132 sec with “trie” • number of matches: 524856
H. Bos – Leiden University 13/02/2004 20 Results
H. Bos – Leiden University 13/02/2004 21 Conclusions • NPUs are useful in other application domains • Newer hardware is expected to perform much better • “Throughput processors” • Adapting our current approach to use BLAST tricks/heuristics
H. Bos – Leiden University 13/02/2004 22 Network processors • geared for high throughput • used exclusively in network systems • example: intrusion detection • similar to looking for gene onin genomes • differences Radisysixp1200 board
H. Bos – Leiden University 13/02/2004 23 Application domain: “Genomics” • example: search genome for occurrence of “patterns” • similar problems as IDS, poor performance on GPP cannot exploit parallelism • throughput-driven • how about FPGAs? • how about clusters? • NPU • easier to program than FPGAs • cheaper than cluster computing • “on the desktop” IP never leaves the room