Hierarchy-conscious Data Structures for String Analysis

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002

Outline of the Talk • What is a memory hierarchy? • Computational models • Data structures We will describe data structures to efficiently deal with memory hierarchieswhile analyzing strings

ALU L1 inst L1 data L2 cache L3 cache main memory disk(s), network Memory Hierarchy • Imposed by technology • Increasing size • Decreasing speed • Orders of magnitude! • Relevant data must be keptin the faster levels • Who should do it?Maybe the programmer

Computational Models • No model describes the whole hierarchy • External memory: I/O model • RAM: limited capacity M • Disk: transfers in blocks of size B • Cost: number of I/O transfers only • Internal memory: ideal cache model • Cache  RAM, RAM  disk • Cache updates are automatic and optimal

(a,2) (b,1) (a,1) (b,1) (a,2) a b (b,2) a b b a b b a Basic Data Structures: Tries • Trie: tree built on a set of stringsEvery string has a path in the tree • PATRICIA trie: nonbranching nodes collapsed, only 1st character stored • n strings Q(n) space • Loss of information

B-trees • Balanced search tree (enforced) • ~n sorted keys per node, whichpartition the keys in the subtrees • One node fits one disk block • Searching for keys matching prefix xtakes O(logBN + k/B) I/Os • Updates: O(logBN) I/Os k1 k2 k3 … kn

String B-trees (Ferragina & Grossi) • Useful for “big” or “irregular” strings • The keys are pointers to strings • Unsorted strings are in a separate array • Who knows lexicographic order? • Each node contains a PATRICIA triebuilt on the corresponding strings • search(x): O(logBn + (|x|+k)/B) I/Os • update(x): O(logBn + |x|/B) I/Os

Cache-oblivious B-trees (1) • Developed by Bender, Demaine, Farach • The B-tree is stored in memoryaccording to the van Emde Boas layout h 1 1 2 3 4 5 recursively! 2 3 4 5 • Balancing rule; long-distance nodes • search(x): O(logBN + k/B) I/Os • update(x): O(logBN) I/Os for suitable B

Cache-oblivious B-trees (2) • Developed by Brodal, Fagerberg and Jacob • Build a dynamic, binary, balanced search tree • Embed it into a static, binary tree • Store it using the van Emde Boas layout • Same I/Os with a simpler data structure • Cache obliviousness: The ideal cache model can be simulated on a realistic, multilevel modelwith constant slowdown (expected)

Suffix Trees • Flat memory model: s.t. can be built in • linear time if |S| is finite • sorting time if |S| is infinite • I/O model: studied by Farach et al. • Build the odd tree (through aux suffix tree) • Build the even tree (use the odd tree) • Merge the two trees (through anchor nodes) • Sorting I/O complexity (optimal)

Suffix Arrays (1) • “Simplified suffix tree” • AS[i] = position of i-th suffix of S • search(x): O((|x|/B)logN + k/B) I/Os • Array construction, Manber & Myers: • Put suffixes into buckets • Stage i: sort according to first 2i symbols • Read buckets in order • Takes 8N space and O(NlogN) I/Os

Suffix Arrays (2) • Array construction, Gonnet et al. :cubic I/O complexity, but… • worst case analysis is too pessimistic • nearly all I/Os are bulk, i.e. sequential Gonnet’s algorithm beats the one of Manber & Myers in practice • Ferragina and Grossi: new algorithm • Better I/O complexity than Gonnet’s • Faster than Gonnet’s in practice

Hierarchy-conscious Data Structures for String Analysis