1 / 10

Advanced Patricia Tree Indexing for Text Sequences

Learn about the Patricia Tree indexing algorithm for arbitrary character sequences in text, as proposed by Gonnet in 1983. This method, based on the Patricia Tree structure, is commonly used for efficient indexing, such as in the Oxford English Dictionary (Morrison, 1968). Explore SISTRINGS (Semi-Infinite-Strings) and how they can be processed and sorted to identify minimal distinguishing prefixes. Step through the process of creating a Digital Trie and simplifying the tree using skipped bits. Discover practical applications such as Prefix Searching and Longest Repetition Search, along with solving the Most Frequent N-gram Problem using Patricia Trees.

carolynk
Download Presentation

Advanced Patricia Tree Indexing for Text Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAT Trees • Index for arbitrary character sequence in text • Gonnet(1983) – based on Patricia Tree • Used for indexing OED (Morrison 68) • SISTRINGS — Semi-Infinite-Strings • pos • A 13219 .I rise on a point of order which … • B 41131 .I rise on a point of objection to …. • B < A in sistring order • What if we encountered all sistrings and sorted then?

  2. STEP 1 : SISTRINGS .can .a.can.can.cans? STRING: . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET .can.a.can.can.cans? 0 can.a.can.can.cans? 1 an.a.can.can.cans? 2 n.a.can.can.cans? 3 .a.can.can.cans? 4 a.can.can.cans? 5 .can.can.cans? 6 can.can.cans? 7 an.can.cans? 8 n.can.cans? 9 .can.cans? 10 can.cans? 11 an.cans? 12 n.cans? 13 .cans? 14 cans? 15 ans? 16 ns? 17 s? 18 ? 19 –

  3. STEP 2 : Sort and find minimal distinguishing prefixes .can .a.can.can.cans? STRING : . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET MINIMAL DISTINGUISHING PREFIX .a.can.can.cans? 4 .a .can.a.can.can.cans? 0 .can.a .can.can.cans? 6 .can.can. .can.cans? 10 .can.cans .cans? 14 .cans ? 19 ? a.can.can.cans? 5 a. an.a.can.can.cans? 2 an.a an.can.cans? 12 an.cans ans? 16 ans can.a.can.can.cans? 1 can.a can.can.cans? 7 can.can. can.cans? 11 can.cans cans? 15 cans n.a.can.can.cans? 3 n.a n.can.cans? 9 n.can. n.cans? 13 n.cans ns? 17 ns s? 18 s

  4. STEP 4 : Create a Digital Trie from Prefixes ⓐ ⓒ ⓝ ⓢ ⓝⓐ ⓢ ⓐ ⓒ ⓢ ⓝ ⓐ ⓒ ⓐ ⓢ ⓝⓐ ⓒ ⓐ ⓢ ⓐ ⓝⓝ ⓐ ⓒⓢ ⓢ ⓐ ⓝ ⓢ ( Label with substring beginning ) • ? 18 19 • • 5 12 • 16 3 4 • 15 1 14 • • • 0 11 9 13 7 • 10 16

  5. STEP 5 : Simplify tree with use of skipped bits ⓐ ⓒ ⓝ ⓢ x x x x ⓐ ⓒ ⓝ ⓢⓢ x x x x ⓢ ⓢⓐ ⓒ ⓐ ⓒ x x x x x x ⓐ ⓒ ⓐ ⓒⓢ ⓐⓢ x x x x x x x x ⓢ ⓢ # of missing letters. c a n s ? • ? 19 18 • • • 4 5 15 • • 16 1 3 • 0 2 7 12 9 13 • • 16 10 8 12

  6. Patricia Tree Binary digital trie Convert character to ASCII bits and make trie 0 1 a c n 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 0

  7. Applications • Prefix Searching • If no branch for next character, • then fail. • c a n Enumerate all leaves sharing prefix O(height)  O(return set) O(log n)

  8. Applications • Longest Repetition Search .can.can farthest internal node from root — Simplify on-line calculation by storing a bit to show which direction longest subtree goes • Most Frequent N-gram

  9. Problem with Pat Tree Enumerating subtree costly

  10. Create Suffix Arrays

More Related