1 / 29

Finite State Automata and Tries

Finite State Automata and Tries. Sambhav Jain IIIT Hyderabad. Think !!!. How to store a dictionary in computer? How to search for an entry in that dictionary? Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’

toril
Download Presentation

Finite State Automata and Tries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finite State Automata and Tries SambhavJain IIIT Hyderabad

  2. Think !!! • How to store a dictionary in computer? • How to search for an entry in that dictionary? • Say you have each word length exactly equal to 10 characters and can take any letter from ‘a-z’ Eg. aaaaaaaaaa, abcdefghij, …. etc Language = [a-z]{10} - RegEx Finite State Automata and Tries

  3. A Simple Way • aaaaaaaaaa • aaaaaaaaab • aaaaaaaaac • …. • …. • …. • …. • zzzzzzzzzz A Linear Sorted List of Entries Finite State Automata and Tries

  4. A Simple Way • aaaaaaaaaa • aaaaaaaaab • aaaaaaaaac • …. • …. • …. • …. • zzzzzzzzzz Character to be stored = 2610 = 1.41167096 × 1014 Each character take 1 Byte ~ 141 TB Finite State Automata and Tries

  5. Smart Way ! …………………………………………….. a b c d w x y z a b c d w x y z …………………………………………….. ………………………………..………………………………………………………………………………………. …………………………………………….. a b c d w x y z Finite State Automata and Tries

  6. Smart Way ! …………………………………………….. a b c d w x y z a b c d w x y z …………………………………………….. • Total Storage = 26x10 = 260 bytes • Traverse 10 nodes ………………………………..………………………………………………………………………………………. …………………………………………….. a b c d w x y z Finite State Automata and Tries

  7. Does it work for Natural Language • Oxford Advanced English Learner 20th Edition • A quarter of a million distinct English words, excluding inflections, and words from technical and regional vocabulary not covered by the OED • After inflections ? • eat,eats,eaten,eating ….. • What after multiple inflexion ??? • beauty, beautiful, beautifully … Finite State Automata and Tries

  8. Example (Store & Search) e a t s e i n g n Finite State Automata and Tries

  9. Example b e a t s e i n g n Finite State Automata and Tries

  10. Example b f e a a s t s e i n g n Finite State Automata and Tries

  11. Example b f e a a s t s e i n g n w r i t e Finite State Automata and Tries

  12. Inflectional morphology • Deals with word forms of a root, when there is no change in lexical category. • Each word form gives different values of features like gender, number, person, etc. Finite State Automata and Tries

  13. Paradigm • For a given root, there are many word forms with different features. • Ex. Forms of Hindi root laDakA (boy) Finite State Automata and Tries

  14. Paradigm - 'laDakoM' is plural with oblique case - given by feature structure {num=pl, case=obl} - 'laDake' stands for two feature structures + Singular oblique (Ex. laDake ne kahA ...) - where oblique means 'laDake' is followed by a postposition marker + plural direct case (Ex. laDake Aye) Finite State Automata and Tries

  15. Paradigm o Paradigms - What operation is done on root to obtain word forms - Model using pairs: (delete string, add string) | direct oblique ---|----------------------- sg| (O,O) (A,e) pl | (A,e) (A,oM) o List roots with paradigms they follow: - ghoDA follows paradigm laDakA - charkhA follows paradigm laDakA - laDakA follows paradigm laDakA Finite State Automata and Tries

  16. l k | | a a | | D p | | -------- a | | | a A D | | | k ------- | | | | ------------ | I i | | | ------- | A e o | | | A | | | | | | A e o M M | M Finite State Automata and Tries

  17. Abstracting out suffixes k l | | a a | | p D | | a --------- | | | D #1 a A | | k (#1) I #1: Corresponds to paradigm for 'laDakA' Finite State Automata and Tries

  18. - Suffix trie (forward) #1 | -------------- | | | e o A | M Finite State Automata and Tries

  19. Can we further optimize our search ? - Use knowledge of paradigms - Use suffix tree Finite State Automata and Tries

  20. Store suffix tree in main memory • Store rest of the categorized by paradigm in hard disk • Do backward search for suffix tree • Identify the paradigm • Search only in that paradigm set • Eg. if ‘–ing’ occur you first won’t be searching word like home, cat, god … Finite State Automata and Tries

  21. Finite State Automata • Trie is a data structure • FSA is the computational approach • Slight difference in representation • Putting characters on edges rather than nodes Finite State Automata and Tries

  22. + / \ l / \ k + + a | | a | | + + D | | p | | + + a | | a | | + + k | | D | | + + \ / 0 \ / 0 +______ e/ \o \ A / \ \ (+) + (+) | |M (+) Finite State Automata and Tries

  23. FSA o A deterministic finite-state machine formally is - Q: A finite set of states (Ex.:{q0,q1,q2}) - SIGMA: A finite set of input alphabet (Ex.: {a,b,c}) - Start state: A state in Q, from which machine starts (Ex.: q0) - F: A set of accepting states (Ex.: {q2}) - DELTA (q,i): A transition function or transition matrix where: - q MEMBER Q, i MEMBER SIGMA, - DELTA(q,i) MEMBER Q Thus, DELTA(q,i): Q x SIGMA --> Q Finite State Automata and Tries

  24. RECOGNITION Problem • Till now we were handling only RECOGNITION problem • If FSA reach a final state at the end of input string thenEXIST • ElseNOT Finite State Automata and Tries

  25. But we seek analyzed output • We want the machine to tell • Root • Gender • Number • Person • Case • Etc …… Finite State Automata and Tries

  26. Finite State Transducer FST is like the finite state automation defined earlier, except each arc is labelled by a pair of symbols: i:o where i: symbol in input string o: symbol output by FST when are is taken + Ex. arc in finite state transducer corresponding to 'e' in 'ladake' e : ((+pl, -direct), (+sg, +dir)) q1 +----------------->--------------------+ q2 Two pairs of symbols: i : o - i is: 'e' - o is: '((+pl, -direct), (+sg, +dir))' + Ex. Morph Analyzer: Match input with i, if successful go ahead & produce o in output Finite State Automata and Tries

  27. o Formally: Finite state transducer - Q: Finite set of states q0, ..., qN - SIGMA_IN: Finite set of input symbols - SIGMA_OUT: Finite set of pairs output symbols - q0: Start state (q0 IN Q) - F: Set of final accepting states (F SUBSET Q) - DELTA (q, i:o) : For every state q, gives a set of states that can be reached from q with i in SIGMA_IN, and o in SIGMA_OUT. Finite State Automata and Tries

  28. Example • on board Finite State Automata and Tries

  29. Tools for FSA • Lex • OpenFST • (www.openfst.org/) • AT&T FSM Toolkit • (http://www2.research.att.com/~fsmtools/fsm/) Finite State Automata and Tries

More Related