1 / 12

Hierarchy-conscious Data Structures for String Analysis

Hierarchy-conscious Data Structures for String Analysis. Carlo Fantozzi PhD Student (XVI ciclo). Bioinformatics Course - June 25, 2002. Outline of the Talk. What is a memory hierarchy? Computational models Data structures. We will describe data structures

Download Presentation

Hierarchy-conscious Data Structures for String Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002

  2. Outline of the Talk • What is a memory hierarchy? • Computational models • Data structures We will describe data structures to efficiently deal with memory hierarchieswhile analyzing strings

  3. ALU L1 inst L1 data L2 cache L3 cache main memory disk(s), network Memory Hierarchy • Imposed by technology • Increasing size • Decreasing speed • Orders of magnitude! • Relevant data must be keptin the faster levels • Who should do it?Maybe the programmer

  4. Computational Models • No model describes the whole hierarchy • External memory: I/O model • RAM: limited capacity M • Disk: transfers in blocks of size B • Cost: number of I/O transfers only • Internal memory: ideal cache model • Cache  RAM, RAM  disk • Cache updates are automatic and optimal

  5. (a,2) (b,1) (a,1) (b,1) (a,2) a b (b,2) a b b a b b a Basic Data Structures: Tries • Trie: tree built on a set of stringsEvery string has a path in the tree • PATRICIA trie: nonbranching nodes collapsed, only 1st character stored • n strings Q(n) space • Loss of information

  6. B-trees • Balanced search tree (enforced) • ~n sorted keys per node, whichpartition the keys in the subtrees • One node fits one disk block • Searching for keys matching prefix xtakes O(logBN + k/B) I/Os • Updates: O(logBN) I/Os k1 k2 k3 … kn

  7. String B-trees (Ferragina & Grossi) • Useful for “big” or “irregular” strings • The keys are pointers to strings • Unsorted strings are in a separate array • Who knows lexicographic order? • Each node contains a PATRICIA triebuilt on the corresponding strings • search(x): O(logBn + (|x|+k)/B) I/Os • update(x): O(logBn + |x|/B) I/Os

  8. Cache-oblivious B-trees (1) • Developed by Bender, Demaine, Farach • The B-tree is stored in memoryaccording to the van Emde Boas layout h 1 1 2 3 4 5 recursively! 2 3 4 5 • Balancing rule; long-distance nodes • search(x): O(logBN + k/B) I/Os • update(x): O(logBN) I/Os for suitable B

  9. Cache-oblivious B-trees (2) • Developed by Brodal, Fagerberg and Jacob • Build a dynamic, binary, balanced search tree • Embed it into a static, binary tree • Store it using the van Emde Boas layout • Same I/Os with a simpler data structure • Cache obliviousness: The ideal cache model can be simulated on a realistic, multilevel modelwith constant slowdown (expected)

  10. Suffix Trees • Flat memory model: s.t. can be built in • linear time if |S| is finite • sorting time if |S| is infinite • I/O model: studied by Farach et al. • Build the odd tree (through aux suffix tree) • Build the even tree (use the odd tree) • Merge the two trees (through anchor nodes) • Sorting I/O complexity (optimal)

  11. Suffix Arrays (1) • “Simplified suffix tree” • AS[i] = position of i-th suffix of S • search(x): O((|x|/B)logN + k/B) I/Os • Array construction, Manber & Myers: • Put suffixes into buckets • Stage i: sort according to first 2i symbols • Read buckets in order • Takes 8N space and O(NlogN) I/Os

  12. Suffix Arrays (2) • Array construction, Gonnet et al. :cubic I/O complexity, but… • worst case analysis is too pessimistic • nearly all I/Os are bulk, i.e. sequential Gonnet’s algorithm beats the one of Manber & Myers in practice • Ferragina and Grossi: new algorithm • Better I/O complexity than Gonnet’s • Faster than Gonnet’s in practice

More Related