210 likes | 355 Views
LING/C SC/PSYC 438/538. Lecture 13 10/7 Sandiway Fong. Administrivia. Homework 3 graded Homework 4 out today, due next Tuesday by midnight. Homework 4. Article247_3400.txt. Raw data: line breaks. (ANC – American National Corpus: 100 million words)
E N D
LING/C SC/PSYC 438/538 Lecture 13 10/7 Sandiway Fong
Administrivia • Homework 3 • graded • Homework 4 • out today, due next Tuesday by midnight
Homework 4 Article247_3400.txt Raw data: line breaks • (ANC – American National Corpus: 100 million words) • Genre: journal, (Slate Magazine article from 1998)
Homework 4 • Write a Perl program that takes the supplied raw text and produces html-style formatted output. • Example: <s>Will There Be Life After Greenspan?</s> <p> <s>If you blinked you missed it, but for a short while yesterday morning the stock and bond markets dived after a rumor that Alan Greenspan was resigning hit the Street.</s> <s>The story was quickly ... well, it wasn't exactly refuted, since Greenspan didn't say actually say "I'm not resigning," but it was rejected as unlikely, and both markets rebounded nicely.</s> <p>
Homework 4 • 2nd corpus file (also from Slate.com): Article559_10072013.txt
Homework 4 • Instructions: • i.e. your program should produce reformatted raw text as • <p> • <s>sentence 1</s> • <s>sentence 2</s> • … • Use <p> as a paragraph separator. • Each <s>..</s> should occupy exactly one line of your output. • Leading and trailing spaces of a sentence should be deleted, e.g. • <s> If you blinked you missed it, … • vs. <s>If you blinked you missed it, … • Print out the total number of lines detected for each article • Print out the total number of paragraphs for each article • Submit your program and its output on Article247_3400.txt and Article559_10072013.txt
Applications of FSA • Let’s take a look at one • Efficient String Matching
String Matching • Example: • Pattern (P): aaba • Corpus (C): aaabbaabab • Naïve algorithm: • compare Pattern against Corpus from left to right a character at a time • P: aaba • C: aaabbaabab • P: _aaba • C: aaabbaabab • P: __aaba • C: aaabbaabab • P: ___aaba • C: aaabbaabab • P: ____aaba • C: aaabbaabab • P: _____aaba • C: aaabbaabab Matched!
Knuth-Morris-Pratt (KMP) • can do better (i.e. use fewer comparisons) if we use a FSA to represent the Pattern • plus extra links for restarts in the event of a failure • Example • Pattern: aaba all backpointers are restarts for failed character a a a a b b a a 0 0 1 1 2 2 3 3 4 4 ^a ^b ^a ^a • Suppose the alphabet was limited to just {a,b}, then restarts can be for the character following the failed character b a b b
Knuth-Morris-Pratt (KMP) • Example: • Pattern: aaba • Corpus {a,b}: aaabbaabab • KMP: a a b a 0 1 2 3 4 • P: aaba • C: aaabbaabab • P: _aaba • C: aaabbaabab • P: _____aaba • C: aaabbaabab • Other possible algorithms: • e.g. Boyer-Moore (start match from the back of the pattern) b a b b
Regular Languages • Three formalisms, same expressive power • Regular expressions • Finite State Automata • Regular Grammars We’ll look at this case using a logic programming language: Prolog
SWI Prolog • Install on your laptop: • On Mac or Linux • Use terminal/shell • Executable is /opt/local/bin/swipl http://www.swi-prolog.org/download/stable
Chomsky Hierarchy Chomsky Hierarchy • division of grammar into subclasses partitioned by “generative power/capacity” • Type-0 General rewrite rules • Turing-complete, powerful enough to encode any computer program • can simulate a Turing machine • anything that’s “computable” can be simulated using a Turing machine • Type-1 Context-sensitive rules • weaker, but still very power • anbncn • Type-2 Context-free rules • weaker still • anbn Pushdown Automata (PDA) • Type-3 Regular grammar rules • very restricted • Regular Expressionsa+b+ • Finite State Automata (FSA) finite state machine tape read/write head Turing machine: artist’s conception from Wikipedia
FSA Regular Expressions Regular Grammars Chomsky Hierarchy Type-1 Type-3 Type-2 DCG = Type-0
Prolog Grammar Rule System • known as “Definite Clause Grammars” (DCG) • based on type-2 (context-free grammars) • but with extensions • (powerful enough to encode the hierarchy all the way up to type-0) • Prolog was originally designed (1970s) to also support natural language processing • we’ll start with the bottom of the hierarchy • i.e. the least powerful • regular grammars (type-3)
Definite Clause Grammars (DCG) • Background • a “typical” formal grammar contains 4 things • <N,T,P,S> • a set of non-terminal symbols (N) • these symbols will be expanded or rewritten by the rules • a set of terminal symbols (T) • these symbols cannot be expanded • production rules (P) of the form • LHS RHS • In regular and CF grammars, LHS must be a single non-terminal symbol • RHS: a sequence of terminal and non-terminal symbols: possibly with restrictions, e.g. for regular grammars • a designated start symbol (S) • a non-terminal to start the derivation • Language • set of terminal strings generated by <N,T,P,S> • e.g. through a top-down derivation
Background a “typical” formal grammar contains 4 things <N,T,P,S> a set of non-terminal symbols (N) a set of terminal symbols (T) production rules (P) of the form LHS RHS a designated start symbol (S) Example grammar (regular) S aB BaB BbC B b CbC Cb Notes: Start symbol: S Non-terminals: {S,B,C} (uppercase letters) Terminals: {a,b} (lowercase letters) Definite Clause Grammars (DCG)
Example Formal grammar Prolog format S aBs--> [a],b. BaBb--> [a],b. BbCb--> [b],c. B bb--> [b]. CbCc--> [b],c. Cbc--> [b]. Notes: Start symbol: S Non-terminals: {S,B,C} (uppercase letters) Terminals: {a,b} (lowercase letters) Prolog format: both terminals and non-terminal symbols begin with lowercase letters variables begin with an uppercase letter (or underscore) --> is the rewrite symbol terminals are enclosed in square brackets (list notation) nonterminals don’t have square brackets surrounding them the comma (,: and) represents the concatenation symbol a period (.) is required at the end of every DCG rule DefiniteClause Grammars (DCG)
Regular Grammars • Regular or Chomsky hierarchy type-3grammars • are a class of formal grammars with a restricted RHS • LHS → RHS “LHS rewrites/expands to RHS” • all rules contain only a single non-terminal, and (possibly) a single terminal) on the right hand side • Canonical Forms: x --> y, [t].x --> [t]. (left recursive) or x --> [t], y.x --> [t]. (right recursive) • where x and y are non-terminal symbols and • t (enclosed in square brackets) represents a terminal symbol. • Note: • can’t mix these two forms (and still have a regular grammar)! • can’t have both left and right recursive rules in the same grammar Terminology: or “left/right linear”
Definite Clause Grammars (DCG) • What language does our regular grammar generate? • by writing the grammar in Prolog, • we have a ready-made recognizer program • no need to write a separate grammar rule interpreter (in this case) • Example queries • ?-s([a,a,b,b,b],[]).Yes • ?-s([a,b,a],[]).No • Note: • Query uses the start symbol s with two arguments: • (1) sequence (as a list) to be recognized and • (2) the empty list [] 1. s --> [a],b. 2. b--> [a],b. 3. b--> [b],c. 4. b --> [b]. 5. c --> [b],c. 6. c --> [b]. one or more a’s followed by one or more b’s Prolog lists: In square brackets, separated by commas e.g. [a] [a,b,c]
Definite Clause Grammars (DCG) • Tree representations • Examples • ?-s([a,a,b,b,b],[]). • ?-s([a,a,b],[]).