1 / 20

LING 388: Language and Computers

LING 388: Language and Computers. Sandiway Fong Lecture 6: 9/19. homework 2 acknowledgements mailed out today @ 2:45pm (no wifi: apologies for the delay) if you didn’t get an email, please resubmit will be reviewed on Thursday. this thursday lab class meet in SBS 224 homework 3.

nuncio
Download Presentation

LING 388: Language and Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 388: Language and Computers Sandiway Fong Lecture 6: 9/19

  2. homework 2 acknowledgements mailed out today @ 2:45pm (no wifi: apologies for the delay) if you didn’t get an email, please resubmit will be reviewed on Thursday this thursday lab class meet in SBS 224 homework 3 Administrivia

  3. Today’s Topic • Regular Expressions (RE)

  4. FSA Regular Expressions Regular Grammars Regular Expressions • (formally) equivalent to • finite state automata (FSA), and • regular grammars • used in • string pattern matching • typically for a single word form • search text: unix (e)grep, perl, microsoft word • caution: • differences in notation and implementation

  5. Regular Expressions • Regular Expressions shorthand for describing sets of strings • String • sequence of zero or more characters • (typically, unbroken by spaces) • Examples • aaa • john • mary45 • mary 45 • NT$ •  (empty string)

  6. Regular Expressions • Regular Expressions • shorthand • stringn • exactly n occurrences of string • n = 0,1,2,3,... • examples • a4 b3 = aaaabbb • (uv)2 = uvuv • ((ab)2(ba)2)2 = ababbabaababbaba • Note: • parentheses are used to group sequences of characters (strings)

  7. Regular Expressions shorthand for describing sets of strings string+ set of one or more occurrences of string i.e. the set {string1, string2, string3, ... } Note: set is infinite examples a+ = {a, aa, aaa, aaaa, aaaaa, …} (abc)+ = {abc, abcabc, abcabcabc, …} Regular Expressions

  8. Regular Expressions shorthand for describing sets of strings string* set of zero or more occurrences of string i.e. the set {string0, string1, string2, string3, ... } string0=  (the empty string) examples a* = {, a, aa, aaa, aaaa, …} (abc)* = {, abc, abcabc, …} Note: a a* = a+ a {, a, aa, aaa, aaaa, …} = {a , aa, aaa, aaaa, aaaaa, …} Regular Expressions Language = a set of strings

  9. Wildcard Characters matches a range of characters .(period) matches any single character examples .+ed = set of all strings of length 3 or greater containing ed and having at least one character preceding it worked bed pre-education ed education .*fix = set of all strings of length 3 or greater containing fix prefix infix infixed suffix fix Regular Expressions

  10. Regular Expressions • Wildcard Characters matches a range of characters [characters] (list of matching characters) matches any single character in the list • examples • [s,z]ation • organization • organisation • [a-z] • any character in the • range lowercase a to z • Note: not uppercase • [0-9] • any digit ASCII chart: computers only understand numbers American Standard Code for Information Interchange.

  11. Regular Expressions • One of the most popular programs for searching files and returning lines that match a regular expression pattern is called GREP • name comes from Unix ed command g/re/p • “search globally for lines matching the regular expression, and print them” • [Source: http://en.wikipedia.org/wiki/Grep]

  12. excerpts from the manpage The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. The symbol \b matches the empty string at the edge of a word The symbols \< and \> respectively match the empty string at the beginning and end of a word. terminology word unbroken sequence of digits, underscores and letters Regular Expressions: grep

  13. Regular Expressions: grep • Excerpts from the manpage • A regular expression may be followed by one of several repetition operators: • ? The preceding item is optional and matched at most once. • * The preceding item will be matched zero or more times. • + The preceding item will be matched one or more times. • {n} The preceding item is matched exactly n times • {n,} The preceding item is matched n or more times. • {n,m} The preceding item is matched at least n times, but not more than m times.

  14. concatenation Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. disjunction Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. Regular Expressions: GNU grep Excerpts from the manpage

  15. Regular Expression gupp(y|ies) examples guppy guppies Regular Expression beds? examples bed beds Regular Expressions: Examples

  16. Regular Expressions: Examples • Example • \b99 matches • 99 in “there are 99 bottles …” • but not in • 99 in “there are 299 bottles …” • Note: • $99 contains two words, so \b99 will match 99 here • word • unbroken sequence of digits, underscores and letters

  17. Regular Expressions: Examples • Example (sheeptalk) • ba! • baa! • baaa! … • regular expression • baa*! • ba+!

  18. Regular Expressions: Microsoft Word • terminology: • wildcard search

  19. Regular Expressions: Microsoft Word

  20. Next Time • In the lab, we’ll be doing some regular expression exercises using Microsoft Word

More Related