100 likes | 304 Views
RE review (Perl syntax). single-character disjunction: [aeiou] ranges: [0-9] negation: [^aeiou] conjunction: /cat/ matching zero or one: /cats?/ Kleene * and +: /[ab]+/ matches ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc wildcard: /c.t/ matches “cat”, “cbt”, “cct”, … anchors: ^, $, b, B
E N D
RE review (Perl syntax) single-character disjunction: [aeiou] ranges: [0-9] negation: [^aeiou] conjunction: /cat/ matching zero or one: /cats?/ Kleene * and +: /[ab]+/ matches ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc wildcard: /c.t/ matches “cat”, “cbt”, “cct”, … anchors: ^, $, \b, \B /projects/CSE467/Resources/Code/Perl CSE 467/567
Conjunction Two regular expressions are conjoined by juxtaposition (placing the expressions side by side). Examples: /a/ matches ‘a’ /m/ matches ‘m’ /am/ matches ‘am’ but not ‘a’ or ‘m’ alone CSE 467/567
Disjunction We have already seen disjunction of characters using the square bracket notation General disjunction is expressed using the vertical bar (|), also called the pipe symbol. This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form. CSE 467/567
Grouping • Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern. • Ex: /[Gg](ee)|(oo)se/ CSE 467/567
Replacement In addition to matching, we can do replacements when a match is found: Example: To replace the British spelling of color with the American spelling, we can write: s/colour/color/ CSE 467/567
Registers – saving matches • To save a match from part of a pattern, to reuse it later on, Perl provides registers • Registers are named \#, where # is the number of the register • Ex. DE DO DO DO DE DA DA DA IS ALL I WANT TO SAY TO YOU /(D[AEO].)*/ will match the first line /(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically This pattern also matches strings like DA DE DE DE DA DO DO DO \s matches a whitespace character CSE 467/567
For more information • PERL Regular Expression TUTorial • http://perldoc.perl.org/perlretut.html • PERL Regular Expression reference page • http://perldoc.perl.org/perlre.html CSE 467/567
Eliza • Published by Weizenbaum in 1966 • Modelled a Rogerian therapist • Had no intelligence – worked by pattern matching and replacement • Had some people convinced that it really understood! • demo at http://chayden.net/eliza/Eliza.shtml CSE 467/567
Wordcount program • Unix wordcount program (wc) counts lines, words and characters • Determining counts & probabilities of words has many applications: • augmentative communiction • context-sensitive spelling error correction • speech recognition • hand-writing recognition CSE 467/567
Counting words in a corpora (preview) #!/usr/bin/perl #FROM Perl BOOK, PAGE 39$/ = ""; # Enable paragraph mode.$* = 1; # ENABLE multi-line patterns.# Now read each paragraph and split into words. Record each# instance of a word in the %wordcount associative array.$total = 0;while (<>){ s/-\n//g; # Dehyphenate hyphenations (across lines) s/<s>//g; # Remove <s> tr/A-Z/a-z/; # Canonicalize to lowercase. @words = split(/\W*\s+\W*/, $_); foreach $word (@words) { $wordcount{$word}++; # Increment the entry. $total++; }}# Now print out all the entries in the %wordcount arrayforeach $word (sort keys(%wordcount)) { printf "(%8.6f\%) %20s occurs %3d time(s)\n", (100 * $wordcount{$word}/$total), $word, $wordcount{$word}; }printf "Total number of distinct words is %d.\n", $total; CSE 467/567