100 likes | 134 Views
PAT Trees Index for arbitrary character sequence in text Gonnet(1983) – based on Patricia Tree Used for indexing OED (Morrison 68) SISTRINGS — Semi-Infinite-Strings pos A 13219 .I rise on a point of order which …
E N D
PAT Trees • Index for arbitrary character sequence in text • Gonnet(1983) – based on Patricia Tree • Used for indexing OED (Morrison 68) • SISTRINGS — Semi-Infinite-Strings • pos • A 13219 .I rise on a point of order which … • B 41131 .I rise on a point of objection to …. • B < A in sistring order • What if we encountered all sistrings and sorted then?
STEP 1 : SISTRINGS .can .a.can.can.cans? STRING: . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET .can.a.can.can.cans? 0 can.a.can.can.cans? 1 an.a.can.can.cans? 2 n.a.can.can.cans? 3 .a.can.can.cans? 4 a.can.can.cans? 5 .can.can.cans? 6 can.can.cans? 7 an.can.cans? 8 n.can.cans? 9 .can.cans? 10 can.cans? 11 an.cans? 12 n.cans? 13 .cans? 14 cans? 15 ans? 16 ns? 17 s? 18 ? 19 –
STEP 2 : Sort and find minimal distinguishing prefixes .can .a.can.can.cans? STRING : . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET MINIMAL DISTINGUISHING PREFIX .a.can.can.cans? 4 .a .can.a.can.can.cans? 0 .can.a .can.can.cans? 6 .can.can. .can.cans? 10 .can.cans .cans? 14 .cans ? 19 ? a.can.can.cans? 5 a. an.a.can.can.cans? 2 an.a an.can.cans? 12 an.cans ans? 16 ans can.a.can.can.cans? 1 can.a can.can.cans? 7 can.can. can.cans? 11 can.cans cans? 15 cans n.a.can.can.cans? 3 n.a n.can.cans? 9 n.can. n.cans? 13 n.cans ns? 17 ns s? 18 s
STEP 4 : Create a Digital Trie from Prefixes ⓐ ⓒ ⓝ ⓢ ⓝⓐ ⓢ ⓐ ⓒ ⓢ ⓝ ⓐ ⓒ ⓐ ⓢ ⓝⓐ ⓒ ⓐ ⓢ ⓐ ⓝⓝ ⓐ ⓒⓢ ⓢ ⓐ ⓝ ⓢ ( Label with substring beginning ) • ? 18 19 • • 5 12 • 16 3 4 • 15 1 14 • • • 0 11 9 13 7 • 10 16
STEP 5 : Simplify tree with use of skipped bits ⓐ ⓒ ⓝ ⓢ x x x x ⓐ ⓒ ⓝ ⓢⓢ x x x x ⓢ ⓢⓐ ⓒ ⓐ ⓒ x x x x x x ⓐ ⓒ ⓐ ⓒⓢ ⓐⓢ x x x x x x x x ⓢ ⓢ # of missing letters. c a n s ? • ? 19 18 • • • 4 5 15 • • 16 1 3 • 0 2 7 12 9 13 • • 16 10 8 12
Patricia Tree Binary digital trie Convert character to ASCII bits and make trie 0 1 a c n 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 0
Applications • Prefix Searching • If no branch for next character, • then fail. • c a n Enumerate all leaves sharing prefix O(height) O(return set) O(log n)
Applications • Longest Repetition Search .can.can farthest internal node from root — Simplify on-line calculation by storing a bit to show which direction longest subtree goes • Most Frequent N-gram
Problem with Pat Tree Enumerating subtree costly