80 likes | 241 Views
Computational linguistics. September 4, 2002. RE review (Perl syntax). single-character disjunction: [aeiou] ranges: [0-9] negation: [^aeiou] conjunction: /cat/ matching zero or one: /cats?/ Kleene * and +: /[ab]+/ matches ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc
E N D
Computational linguistics September 4, 2002 CSE 467/567
RE review (Perl syntax) single-character disjunction: [aeiou] ranges: [0-9] negation: [^aeiou] conjunction: /cat/ matching zero or one: /cats?/ Kleene * and +: /[ab]+/ matches ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc wildcard: /c.t/ matches “cat”, “cbt”, “cct”, … anchors: ^, $, \b, \B grouping: () disjunction: | CSE 467/567
Replacement In addition to matching, we can do replacements when a match is found: Example: To replace the British spelling of color with the American spelling, we can write: s/colour/color/ CSE 467/567
Eliza • Published by Weizenbaum in 1966 • Modelled a Rogerian therapist • Had no intelligence – worked by pattern matching and replacement • Had some people convinced that it really understood! • demo at http://chayden.net/eliza/Eliza.shtml CSE 467/567
Wordcount program • Unix wordcount program (wc) counts lines, words and characters • Determining probabilities of words has many applications: • augmentative communiction • context-sensitive spelling error correction • speech recognition • hand-writing recognition CSE 467/567
Counting words in a corpora (preview) #!/usr/bin/perl #FROM Perl BOOK, PAGE 39$/ = ""; # Enable paragraph mode.$* = 1; # ENABLE multi-line patterns.# Now read each paragraph and split into words. Record each# instance of a word inthe %wordcount associative array.$total = 0;while (<>){ s/-\n//g; # Dehyphenate hyphenations (across lines) s/<s>//g; # Remove <s> tr/A-Z/a-z/; # Canonicalize to lowercase. @words = split(/\W*\s+\W*/, $_); foreach $word (@words) { $wordcount{$word}++; # Increment the entry. $total++; }}# Now print out all the entries in the %wordcount arrayforeach $word (sort keys(%wordcount)) { printf "(%8.6f\%) %20s occurs %3d time(s)\n", (100 * $wordcount{$word}/$total), $word, $wordcount{$word}; }printf "Total number of distinct words is %d.\n", $total; CSE 467/567
Your turn! regular expressions finite automata regular languages CSE 467/567