290 likes | 416 Views
Finite State Automata and Tries. Sambhav Jain IIIT Hyderabad. Think !!!. How to store a dictionary in computer? How to search for an entry in that dictionary? Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’
E N D
Finite State Automata and Tries SambhavJain IIIT Hyderabad
Think !!! • How to store a dictionary in computer? • How to search for an entry in that dictionary? • Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’ Eg. aaaaaaaaaa, abcdefghij, …. etc Language = [a-z]{10} - RegEx Finite State Automata and Tries
A Simple Way • aaaaaaaaaa • aaaaaaaaab • aaaaaaaaac • …. • …. • …. • …. • zzzzzzzzzz A Linear Sorted List of Entries Finite State Automata and Tries
A Simple Way • aaaaaaaaaa • aaaaaaaaab • aaaaaaaaac • …. • …. • …. • …. • zzzzzzzzzz Character to be stored = 2610 = 1.41167096 × 1014 Each character take 1 Byte ~ 141 TB Finite State Automata and Tries
Smart Way ! …………………………………………….. a b c d w x y z a b c d w x y z …………………………………………….. ………………………………..………………………………………………………………………………………. …………………………………………….. a b c d w x y z Finite State Automata and Tries
Smart Way ! …………………………………………….. a b c d w x y z a b c d w x y z …………………………………………….. • Total Storage = 26x10 = 260 bytes • Traverse 10 nodes ………………………………..………………………………………………………………………………………. …………………………………………….. a b c d w x y z Finite State Automata and Tries
Does it work for Natural Language • Oxford Advanced English Learner 20th Edition • A quarter of a million distinct English words, excluding inflections, and words from technical and regional vocabulary not covered by the OED • After inflections ? • eat,eats,eaten,eating ….. • What after multiple inflexion ??? • beauty, beautiful, beautifully … Finite State Automata and Tries
Example (Store & Search) e a t s e i n g n Finite State Automata and Tries
Example b e a t s e i n g n Finite State Automata and Tries
Example b f e a a s t s e i n g n Finite State Automata and Tries
Example b f e a a s t s e i n g n w r i t e Finite State Automata and Tries
Inflectional morphology • Deals with word forms of a root, when there is no change in lexical category. • Each word form gives different values of features like gender, number, person, etc. Finite State Automata and Tries
Paradigm • For a given root, there are many word forms with different features. • Ex. Forms of Hindi root laDakA (boy) Finite State Automata and Tries
Paradigm - 'laDakoM' is plural with oblique case - given by feature structure {num=pl, case=obl} - 'laDake' stands for two feature structures + Singular oblique (Ex. laDake ne kahA ...) - where oblique means 'laDake' is followed by a postposition marker + plural direct case (Ex. laDake Aye) Finite State Automata and Tries
Paradigm o Paradigms - What operation is done on root to obtain word forms - Model using pairs: (delete string, add string) | direct oblique ---|----------------------- sg| (O,O) (A,e) pl | (A,e) (A,oM) o List roots with paradigms they follow: - ghoDA follows paradigm laDakA - charkhA follows paradigm laDakA - laDakA follows paradigm laDakA Finite State Automata and Tries
l k | | a a | | D p | | -------- a | | | a A D | | | k ------- | | | | ------------ | I i | | | ------- | A e o | | | A | | | | | | A e o M M | M Finite State Automata and Tries
Abstracting out suffixes k l | | a a | | p D | | a --------- | | | D #1 a A | | k (#1) I #1: Corresponds to paradigm for 'laDakA' Finite State Automata and Tries
- Suffix trie (forward) #1 | -------------- | | | e o A | M Finite State Automata and Tries
Can we further optimize our search ? - Use knowledge of paradigms - Use suffix tree Finite State Automata and Tries
Store suffix tree in main memory • Store rest of the categorized by paradigm in hard disk • Do backward search for suffix tree • Identify the paradigm • Search only in that paradigm set • Eg. if ‘–ing’ occur you first won’t be searching word like home, cat, god … Finite State Automata and Tries
Finite State Automata • Trie is a data structure • FSA is the computational approach • Slight difference in representation • Putting characters on edges rather than nodes Finite State Automata and Tries
+ / \ l / \ k + + a | | a | | + + D | | p | | + + a | | a | | + + k | | D | | + + \ / 0 \ / 0 +______ e/ \o \ A / \ \ (+) + (+) | |M (+) Finite State Automata and Tries
FSA o A deterministic finite-state machine formally is - Q: A finite set of states (Ex.:{q0,q1,q2}) - SIGMA: A finite set of input alphabet (Ex.: {a,b,c}) - Start state: A state in Q, from which machine starts (Ex.: q0) - F: A set of accepting states (Ex.: {q2}) - DELTA (q,i): A transition function or transition matrix where: - q MEMBER Q, i MEMBER SIGMA, - DELTA(q,i) MEMBER Q Thus, DELTA(q,i): Q x SIGMA --> Q Finite State Automata and Tries
RECOGNITION Problem • Till now we were handling only RECOGNITION problem • If FSA reach a final state at the end of input string thenEXIST • ElseNOT Finite State Automata and Tries
But we seek analyzed output • We want the machine to tell • Root • Gender • Number • Person • Case • Etc …… Finite State Automata and Tries
Finite State Transducer FST is like the finite state automation defined earlier, except each arc is labelled by a pair of symbols: i:o where i: symbol in input string o: symbol output by FST when are is taken + Ex. arc in finite state transducer corresponding to 'e' in 'ladake' e : ((+pl, -direct), (+sg, +dir)) q1 +----------------->--------------------+ q2 Two pairs of symbols: i : o - i is: 'e' - o is: '((+pl, -direct), (+sg, +dir))' + Ex. Morph Analyzer: Match input with i, if successful go ahead & produce o in output Finite State Automata and Tries
o Formally: Finite state transducer - Q: Finite set of states q0, ..., qN - SIGMA_IN: Finite set of input symbols - SIGMA_OUT: Finite set of pairs output symbols - q0: Start state (q0 IN Q) - F: Set of final accepting states (F SUBSET Q) - DELTA (q, i:o) : For every state q, gives a set of states that can be reached from q with i in SIGMA_IN, and o in SIGMA_OUT. Finite State Automata and Tries
Example • on board Finite State Automata and Tries
Tools for FSA • Lex • OpenFST • (www.openfst.org/) • AT&T FSM Toolkit • (http://www2.research.att.com/~fsmtools/fsm/) Finite State Automata and Tries