1 / 24

Lecture 4: Matching Things. Regular Expressions

Lecture 4: Matching Things. Regular Expressions. Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material. Today. Regular Expressions Snippet on Speech Recognition At least half of it. Regular Expressions.

drew
Download Presentation

Lecture 4: Matching Things. Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4: Matching Things. Regular Expressions Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material

  2. Today • Regular Expressions • Snippet on Speech Recognition • At least half of it.

  3. Regular Expressions • Can be viewed as a way to specify • Search patterns over a text string • Design a particular kind of machine, a Finite State Automaton (FSA) • we probably won’t cover this today. • Define a formal “language” i.e. a set of strings

  4. Uses of Regular Expressions • Simple powerful tools for large corpus analysis and ‘shallow’ processing • What word is most likely to begin a sentence • What word is most likely to begin a question? • Are you more or less polite than the people you correspond with?

  5. Definitions • Regular Expression: Formula in algebraic notation for specifying a set of strings • String: Any sequence of characters • Regular Expression Search • Pattern: specifies the set of strings we want to search for • Corpus: the texts we want to search through

  6. Simple Example

  7. More Examples

  8. And still more examples

  9. Optionality and Repetition /[Ww]oodchucks?/ /colou?r/ /he{3}/ /(he){3}/ /(he){3},/

  10. Character Groups Some groups of characters are used very frequently, so the RE language includes shorthands for them

  11. Special Characters These enable the matching of multiple occurrences of a pattern

  12. Escape Characters Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.

  13. RE Matching in Python NLTK • Set up: • import re • from nltk.util import re_show • sent = “colourless green ideas sleep furiously • re_show(pattern, str) • shows where the pattern matches

  14. Substitutions • Replace every l with an s • re.sub(‘l’, ‘s’, sent) • ‘cosoursess green ideas sseepfurioussy’ • re.sub(‘green’, ‘red’, sent) • ‘colourless red ideas sleep furiously’

  15. Findall • re.findall(pattern, sent) • will return all of the substrings that match the pattern • re.findall(‘(green|sleep)’, sent) • [‘green’, ‘sleep’]

  16. Match • Matches from the beginning of the string • match(pattern, string) • Returns: a Match object or None (if not found) • Match objects contain information about the search

  17. Methods in Match

  18. More Match Methods

  19. Search • re.search(pattern, string) • Finds the pattern anywhere in the string. • re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’ • re.search(‘\d+’, ‘ abc123 ‘).group() • ‘123’

  20. Splitting • ‘text can be made into lists’.split() • re.split(pattern, split) • uses the pattern to identify the split point • re.split(‘\d+’, “I want 4 cats and 13 dogs”) • [“I want ”, “ cats and ”, “ dogs”] • re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”) • [“I want”, “cats and”, “dogs”]

  21. Joining ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’] This simple formatting can be helpful to report results or merge information

  22. Stemming with Regular Expressions def stem(word): regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' stem, suffix = re.findall(regexp, word)[0] return stem

  23. Play with some code

  24. Snippet on Speech Recognition

More Related