100 likes | 139 Views
Learn about the Patricia Tree indexing algorithm for arbitrary character sequences in text, as proposed by Gonnet in 1983. This method, based on the Patricia Tree structure, is commonly used for efficient indexing, such as in the Oxford English Dictionary (Morrison, 1968). Explore SISTRINGS (Semi-Infinite-Strings) and how they can be processed and sorted to identify minimal distinguishing prefixes. Step through the process of creating a Digital Trie and simplifying the tree using skipped bits. Discover practical applications such as Prefix Searching and Longest Repetition Search, along with solving the Most Frequent N-gram Problem using Patricia Trees.
E N D
PAT Trees • Index for arbitrary character sequence in text • Gonnet(1983) – based on Patricia Tree • Used for indexing OED (Morrison 68) • SISTRINGS — Semi-Infinite-Strings • pos • A 13219 .I rise on a point of order which … • B 41131 .I rise on a point of objection to …. • B < A in sistring order • What if we encountered all sistrings and sorted then?
STEP 1 : SISTRINGS .can .a.can.can.cans? STRING: . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET .can.a.can.can.cans? 0 can.a.can.can.cans? 1 an.a.can.can.cans? 2 n.a.can.can.cans? 3 .a.can.can.cans? 4 a.can.can.cans? 5 .can.can.cans? 6 can.can.cans? 7 an.can.cans? 8 n.can.cans? 9 .can.cans? 10 can.cans? 11 an.cans? 12 n.cans? 13 .cans? 14 cans? 15 ans? 16 ns? 17 s? 18 ? 19 –
STEP 2 : Sort and find minimal distinguishing prefixes .can .a.can.can.cans? STRING : . c a n . a . c a n . c a n . c a n s ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SISTRING OFFSET MINIMAL DISTINGUISHING PREFIX .a.can.can.cans? 4 .a .can.a.can.can.cans? 0 .can.a .can.can.cans? 6 .can.can. .can.cans? 10 .can.cans .cans? 14 .cans ? 19 ? a.can.can.cans? 5 a. an.a.can.can.cans? 2 an.a an.can.cans? 12 an.cans ans? 16 ans can.a.can.can.cans? 1 can.a can.can.cans? 7 can.can. can.cans? 11 can.cans cans? 15 cans n.a.can.can.cans? 3 n.a n.can.cans? 9 n.can. n.cans? 13 n.cans ns? 17 ns s? 18 s
STEP 4 : Create a Digital Trie from Prefixes ⓐ ⓒ ⓝ ⓢ ⓝⓐ ⓢ ⓐ ⓒ ⓢ ⓝ ⓐ ⓒ ⓐ ⓢ ⓝⓐ ⓒ ⓐ ⓢ ⓐ ⓝⓝ ⓐ ⓒⓢ ⓢ ⓐ ⓝ ⓢ ( Label with substring beginning ) • ? 18 19 • • 5 12 • 16 3 4 • 15 1 14 • • • 0 11 9 13 7 • 10 16
STEP 5 : Simplify tree with use of skipped bits ⓐ ⓒ ⓝ ⓢ x x x x ⓐ ⓒ ⓝ ⓢⓢ x x x x ⓢ ⓢⓐ ⓒ ⓐ ⓒ x x x x x x ⓐ ⓒ ⓐ ⓒⓢ ⓐⓢ x x x x x x x x ⓢ ⓢ # of missing letters. c a n s ? • ? 19 18 • • • 4 5 15 • • 16 1 3 • 0 2 7 12 9 13 • • 16 10 8 12
Patricia Tree Binary digital trie Convert character to ASCII bits and make trie 0 1 a c n 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 0
Applications • Prefix Searching • If no branch for next character, • then fail. • c a n Enumerate all leaves sharing prefix O(height) O(return set) O(log n)
Applications • Longest Repetition Search .can.can farthest internal node from root — Simplify on-line calculation by storing a bit to show which direction longest subtree goes • Most Frequent N-gram
Problem with Pat Tree Enumerating subtree costly