320 likes | 452 Views
Session 2.5 Wharton Summer Tech Camp. Regex Data Acquisition. 1: REGEX Intro 2: Data Acquisition. Agenda. Regular Expression. What is Regular Expression (RE) ?. RE or REGEX is a way to describe string patterns to computers Basically, an advanced “Find” and “Find and Replace ”
E N D
Session 2.5Wharton Summer Tech Camp Regex Data Acquisition
What is Regular Expression (RE) ? • RE or REGEX is a way to describe string patterns to computers • Basically, an advanced “Find” and “Find and Replace” • Originated from theoretical comp sci – • For the Interested: “Formal Language Theory”, “Chomsky hierarchy”, “Automata theory” • Theory that guides programming language • Popularized by PERL, Ubiquitous in Unix • Almost all programming languages support REGEX and they are mostly the same
What is Regular Expression (RE) ? • Given a text T, RE matches the part of T represented by the RE • RE(T) = Subset_of_matched(T) • Then you can do whatever you wish with the matched part • Regular expression can be complicated and can consist of multiple patterns • You can match multiple patterns at the same time • With the matched part of T, you can do something with it or substitute part of the matched partwith something else you wish
Well, Google is a text search tool, albeit for different purposes. The power comes from the fact that by learning regex, you are essentially learning to represent complex text patterns to computers efficiently.The size of data may be too big for humans to go through or too tediousLearn their language and tell computers what to do!
True (paraphrased) quotes from some doctoral students/faculties before I introduced them to REGEX “I despise aggregating data from the AMT – it took me a week to go through them all”“[Grunt noise]. I had to filter out IP addresses from surveys by hand and it took me forever”“I have this data with many different ways of representing the same variables and need to do “fuzzy” matching but don’t know a good way to do this”
Reasons to use regex • Regular expression will be very useful for data cleaning and aggregating • Very useful in basic web scraping. • Text data is everywhere and “If you take “text” in the widest possible sense, perhaps 90% of what you do is 90% text processing” (Programming Perl book). • Once you learn regex, you can use it in any language since they are similarly implemented. • learning regex is one of the first step in learning NLP (natural language processing) • You are learning a language of the machines
Usage Examples • You get an output from Amazon Mech Turk (or Qualtrics) and need to extract and aggregate data and make it usable by R or Stata • You can check survey outcomes for quality control. Useful for checking if the participants are paying attention or quality control at a massive scale. Related use in web development is checking to see if input format is correct (Password requirement). • You want to scrape simple information from a website for your project • One simple algorithm in NLP is matching and counting words. Regex can do that. • You want to obtain email addresses for your evil spamming purposes. You can do that but don’t. • Etc. Many possibilities for increase in productivity
But it takes some time to master You will need to practice with a cheat sheet next to you. Literally, this is a language (“regular language”) you are learning. Just like any language, this one has vocabularies and grammars to learn.
Tools to practice REGEX • There are great tools to practice regex • Website • http://gskinner.com/RegExr/ • If you have mac • http://reggyapp.com/ Reggy • If you have windows • http://www.regexbuddy.com/Regexbuddy
Basics of REGEX • Can represent strings literally or symbolically • Literal representations are not powerful but convenient for small tasks • Symbolic representation is the workhorse • There are a few concepts you need to learn to use this representation • There are also many special characters with special meanings in REGEX. e.g., . ^ $ * + ? { } [ ] \ | ( ) • http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf Cheat sheet
Literal Matching • Match strings literally. • String = “I am a string” • RE= “string” • Matched string = “string” That’s it
Literal Matching & Quantifiers • Symbolic matching has many special characters to learn. • Quantifier is one concept • + means match whatever comes before match it 1 or more • "ba" matches only "ba" • "ba+" matches baa, baaa, baaaa, etc • ? means match whatever comes before 0 or 1 time • "ba?" matches b or ba • * means match whatever comes before 0 or more • “ba*” matches b or ba or baa or baaa and so on
More Quantifiers • {start,end} means match whatever comes before “start” to “end” many times • "ba{1,3}" matches ba, baa, baaa • “ba{2,}” matches baa, baaa, baaaa and so on
Special Meta characters • As you’ve seen, some characters have special meanings • . ^ $ * + ? { } [ ] \ | ( ) • . Means any one character except the newline character \n • ^ dictates that the regex pattern should only be matched if it occurs in the beginning • String= “the book” RE= “book” YES RE= “^book” NO • $ is similar to ^ but for ending • [] is used to signify ranges [0-9] means anything from 0 to 9 • () used as grouping variable • Used to group patterns • Can be used to memorize a certain part of the regex • | is used as “OR” (5|4) matches 5 or 4 • \ <-special character to rule them all – used to escape all special meta characters to represent them as is. \. Matches actual period . • [^stuff] means match anything that’s not “stuff” [^9] match anything but 9
Hey Jude Hey Jude, don't make it bad Take a sad song and make it better Remember to let her under your skin Then you'll begin to make it (better ){6}, oh (na ){7}, (na ){4}, Hey Jude
Special Vocabulary Shortcuts • Some vocabularies are so common that shortcuts were made • \d matches any digit [0-9] • \w any alphanumeric plus underscore [a-zA-Z0-9_] • \s white spaces – tabs newlines etc. [ \t\n] • notice that space in the beginning • \W any non alphanumeric plus underscore [^a-zA-Z0-9_] • \S guess? • \D again?
Flags • Changes the way regex works • i ignore case • s changes the way . works. Usually . Matches anything except new line \n this flag makes . match everything • m multiline. Changes the way ^ $ works with newline. Usually, ^ $ matches strictly start or end of string but this flag makes it match on each line.
REGEX in python • Python library re • import re • The function used is re.search(pattern, string, flags=0) Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern. • Pattern: specifies what to be matched • String: actual string to match from • Flags: options – basically changes the way regex works again, flag "i" says ignore case.
REGEX in python re.search(pattern, string, flags=0) re.findall(pattern, string, flags=0) • Pattern: always wrap the pattern with r"" for python. r"" says interpret everything between "" to be raw string – particular to python due to the way python interprets some characters. s = "This is an example string" matchedobject=re.search(r"This", s) matchedobject=re.search(r"this", s)
Regex is easy to learn but hard to master Example of complex regex The regex in the next slide is taken from http://ex-parrot.com/~pdw/Mail-RFC822-Address.html It validates email based on RFC822 grammar which is now obsolete. It’s not written by hand. It’s produced by combining set of simpler regex.
NO+!+ (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)
Lab • Try some REGEX tutorial • http://regexone.com/ • http://www.regexlab.com/ • http://www.regular-expressions.info/tutorial.html • The scripts I uploaded • Playaround with the regextool • 5-10 minutes
THE BIGGEST concern for doctoral students doing empirical work (year 2-4) excluding the quals/prelims “WHERE AND HOW DO I GET DATA?!“ Mr. Data: “I believe what you are experiencing is frustration”
Data sources • Companies • Wharton Organizations • Scraping Web • APIs : application programming interface
We are going to use the following for the next session • Download WGET and make sure it works • You may already have wget if you use mac (in terminal, type wget) • http://www.gnu.org/software/wget/ • Get Firefox Developer’s Toolbox • Data acquisition(Wharton, Company, Scraping, API)