190 likes | 294 Views
Strings and regular expressions Day 10. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. http://www.tulane.edu/~ling/NLP/ NLTK is installed on the computers in this room! How would you like to use the Provost's $150?
E N D
Strings and regular expressionsDay 10 LING 681.02 Computational Linguistics Harry Howard Tulane University
Course organization • http://www.tulane.edu/~ling/NLP/ • NLTK is installed on the computers in this room! • How would you like to use the Provost's $150? • Please become a fan of Tulane Linguistics on Facebook. LING 681.02, Prof. Howard, Tulane University
NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level
Syntax of single-line strings • Strings are specified with single quotes, or double quotes if a single quote is one of the characters: 'Monty Python' "Monty Python's Flying Circus" 'Monty Python\s Flying Circus' LING 681.02, Prof. Howard, Tulane University
Syntax of multi-line strings • A sequence of strings can be joined into a single one with … • a backslash at the end of each line: 'first half'\ 'second half' = 'first halfsecond half' • parentheses to open and close the sequence: ('first half' 'second half') = 'first halfsecond half' • triple double quotes to open and close the sequence and maintain line breaks: """first half second half""" = 'first half/nsecond half' LING 681.02, Prof. Howard, Tulane University
Basic opertions • Concatenation (+) • >>> 'really' + 'really' • 'reallyreally' • Repetition (*) • >>> 'really' * 4 • 'reallyreallyreallyreally' LING 681.02, Prof. Howard, Tulane University
Your Turn p. 88 !!! LING 681.02, Prof. Howard, Tulane University
Printing strings • Make a couple of string assignments: harry = 'Harry Potter' prince = 'Half-Blood Prince' • Inspection of a variable produces Python's representation of its value: >>> harry 'Harry Potter' • Printing a variable produces its value: >>> print harry Harry Potter • What do you expect? >>> print harry + prince >>> print harry, prince >>> print harry, 'and the', prince LING 681.02, Prof. Howard, Tulane University
Using indices • Every character of a string is indexed from 0 (and -1) >>> harry[0] 'H' >>> harry[-1] 'r' >>> harry[:2] 'Har' >>> harry[-12:-10] 'Har' >>> for char in prince: ... print char, H a l f - B l o o d P r i n c e LING 681.02, Prof. Howard, Tulane University
More string operations • See Table 3-2 LING 681.02, Prof. Howard, Tulane University
Strings vs. lists • Both are sequences and so support joining by concatenation and separation by slicing. • But they are different, so they cannot be concatenated. • Granularity • Strings have a single level of resolution, the individual character > good for writing to screen or file. • Lists can have any level of resolution we want: character, morpheme, word, phrase, sentence, paragraph > good for NLP. • So the second step in the NLP pipeline is to tokenize a string into a list. LING 681.02, Prof. Howard, Tulane University
NLPP §3 Processing raw text §3.3 Text processing with Unicode
Unicode • The format for representing special characters that go beyond ASCII • Let's skip this until we really need it. LING 681.02, Prof. Howard, Tulane University
NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats
Getting started • To use regular expressions in Python, we need to import the re library. • We also need a list of words to search. • we'll use the Words Corpus again (Section 2.4). • We will preprocess it to remove any proper names. >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()] LING 681.02, Prof. Howard, Tulane University
Different terminologies • In textbook, regex = «ed$» • In re, regex = 'ed$' (i.e. a string) LING 681.02, Prof. Howard, Tulane University
Searching • re.search(p, s) • p is a pattern – what we are looking for, and • s is a candidate string for matching the pattern. LING 681.02, Prof. Howard, Tulane University
Some examples • Find words ending in -ed: >>> [w for w in wordlist if re.search('ed$', w)] • Find a word that fits a certain group of blanks in a crossword puzzle that is 8 letters long, with j as the 3rd letter and t as the 6th letter: >>> [w for w in wordlist if re.search('^..j..t..$', w)] • Find the strings email or e-mail: >>> [w for w in wordlist if re.search('^e-?mail$', w)] LING 681.02, Prof. Howard, Tulane University
Next time More on RegEx