1 / 10

PAT Trees Index for arbitrary character sequence in text

PAT Trees Index for arbitrary character sequence in text Gonnet(1983) – based on Patricia Tree Used for indexing OED (Morrison 68) SISTRINGS — Semi-Infinite-Strings pos A 13219 .I rise on a point of order which …

carolynk
Download Presentation

PAT Trees Index for arbitrary character sequence in text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAT Trees • Index for arbitrary character sequence in text • Gonnet(1983) – based on Patricia Tree • Used for indexing OED (Morrison 68) • SISTRINGS — Semi-Infinite-Strings • pos • A 13219 .I rise on a point of order which … • B 41131 .I rise on a point of objection to …. • B < A in sistring order • What if we encountered all sistrings and sorted then?

  2. STEP 1 : SISTRINGS .can .a.can.can.cans? STRING: . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET .can.a.can.can.cans? 0 can.a.can.can.cans? 1 an.a.can.can.cans? 2 n.a.can.can.cans? 3 .a.can.can.cans? 4 a.can.can.cans? 5 .can.can.cans? 6 can.can.cans? 7 an.can.cans? 8 n.can.cans? 9 .can.cans? 10 can.cans? 11 an.cans? 12 n.cans? 13 .cans? 14 cans? 15 ans? 16 ns? 17 s? 18 ? 19 –

  3. STEP 2 : Sort and find minimal distinguishing prefixes .can .a.can.can.cans? STRING : . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET MINIMAL DISTINGUISHING PREFIX .a.can.can.cans? 4 .a .can.a.can.can.cans? 0 .can.a .can.can.cans? 6 .can.can. .can.cans? 10 .can.cans .cans? 14 .cans ? 19 ? a.can.can.cans? 5 a. an.a.can.can.cans? 2 an.a an.can.cans? 12 an.cans ans? 16 ans can.a.can.can.cans? 1 can.a can.can.cans? 7 can.can. can.cans? 11 can.cans cans? 15 cans n.a.can.can.cans? 3 n.a n.can.cans? 9 n.can. n.cans? 13 n.cans ns? 17 ns s? 18 s

  4. STEP 4 : Create a Digital Trie from Prefixes ⓐ ⓒ ⓝ ⓢ ⓝⓐ ⓢ ⓐ ⓒ ⓢ ⓝ ⓐ ⓒ ⓐ ⓢ ⓝⓐ ⓒ ⓐ ⓢ ⓐ ⓝⓝ ⓐ ⓒⓢ ⓢ ⓐ ⓝ ⓢ ( Label with substring beginning ) • ? 18 19 • • 5 12 • 16 3 4 • 15 1 14 • • • 0 11 9 13 7 • 10 16

  5. STEP 5 : Simplify tree with use of skipped bits ⓐ ⓒ ⓝ ⓢ x x x x ⓐ ⓒ ⓝ ⓢⓢ x x x x ⓢ ⓢⓐ ⓒ ⓐ ⓒ x x x x x x ⓐ ⓒ ⓐ ⓒⓢ ⓐⓢ x x x x x x x x ⓢ ⓢ # of missing letters. c a n s ? • ? 19 18 • • • 4 5 15 • • 16 1 3 • 0 2 7 12 9 13 • • 16 10 8 12

  6. Patricia Tree Binary digital trie Convert character to ASCII bits and make trie 0 1 a c n 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 0

  7. Applications • Prefix Searching • If no branch for next character, • then fail. • c a n Enumerate all leaves sharing prefix O(height)  O(return set) O(log n)

  8. Applications • Longest Repetition Search .can.can farthest internal node from root — Simplify on-line calculation by storing a bit to show which direction longest subtree goes • Most Frequent N-gram

  9. Problem with Pat Tree Enumerating subtree costly

  10. Create Suffix Arrays

More Related