750 likes | 895 Views
The Digitization in the Humanities Workshop @ Rice University April 5-7, 2013. Text Extraction using Regular Expressions. Shih-Pei Chen Project Manager, China Biographical Database, Harvard University. Downloads for today. Slides and sample texts (a package) On OWL-Space Text editor(s)
E N D
The Digitization in the Humanities Workshop @ Rice University April 5-7, 2013 Text Extraction using Regular Expressions Shih-Pei Chen Project Manager, China Biographical Database, Harvard University
Downloads for today • Slides and sample texts (a package) • On OWL-Space • Text editor(s) • Mac users: please install TextWrangler • PC users: EmEditor, UltraEdit, or both • CBDB Regex Machine • http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.page515758 -- download the CBDBRegexMachine_July2012.zip on this page
The China Biographical Database – Modeling Life Histories– from anecdote to data Biography Prosopography Social Network Analysis Geospatial Analysis
Big Data • What you are going to do with the great amount of texts on the Web? • Is there information you want to search? • Is there thing you want to analyze? • CBDB experience: we use regular expressions to extract biographical data from thousands of historical records (in their full texts)
What regular expressions can do for you • Beyond keyword search • Search for written variations • Search for patterns • Search and replace => tagging • You don’t have to learn programming in order to use regular expressions • Just use a text editor which supports regex
Today • Part 1: Learn regular expressions • Hands on exercises of matching regexes against some texts in a text editor. • Part 2: A real play • Using regexes + Search and Replace in a text editor • Part 3: CBDB Regex Machine • Using a graphical user interface to design regexes and test them against a text. Tagging the matches in XML tags.
Regular expressions • Is a powerful way of describing patterns of strings • You describe the pattern, the machine matches it against the text (a string of letters, digits, and symbols)
Automata • Imagine a belt sending characters in line: • The string in line (the input): abcde • It can match this pattern: abcde • It can also match this pattern: bc (but only the substring “bc” in the input will be matched – partial match) abcde? b e a c d
Comparing the input against the regex character by character Input: b e a c d Regex: d e b c Match! a
Comparing the input against the regex character by character Input: b e a c d Regex: b c Match! Behind the scenes: The robot picks up a in the input, and finds that a does not match b, the first character in the regex. Then, the robot throws a out, and picks up the next character in the input, which is b. This time robot finds the two b’s match each other.
Comparing the input against the regex character by character Input: b e a c d × Regex: No match! b d
Switch to a good text editor • Text editors which support regex • Windows: EmEditor or UltraEdit (both not free) • Mac: TextWrangler (free)
What you can describe using regular expressions? Regular expressions – the syntax
Examples come from: Regular-expressions.info Characters • Literal characters • abcde , bc , bd (string match) • Non-Printable Characters • \t (tab), \r (carriage return), \n (line feed) • Line breaks: \r (Mac), \n (Unix), or \r\n (Windows) • Special characters (reserved characters / metacharacters) • [ ] \ ^ $ . | ? * + ( )
Exercise #1 • Download and install one of the above text editors. • Download the “regex text.txt” file. Open it in your text editor. • Call up the “Search” or “Find” function in your editor, and try the regexes in Exercise#1 to see which regexes can be matched.
Character Classes – what can appear at a certain position? • gr[ae]y can match gray or grey • Characters in [ ] form a class (bag of characters) • gr[ae]y will not match graay nor graey ! • Common character classes • [a-z] , [A-Z] , [a-zA-Z] , [0-9] • Exercise#2 Input: r y g a e Regex: r g [ae] y
Shorthand Character Classes • \d (digit) : shorthand of [0-9] • \D (non-digit character) • \w(word character): [A-Za-z0-9_] • \W (non-word character) • \s (whitespace character): [ \t\r\n] (white space, tab, carriage return, line feed) • \S (non-whitespace character)
Negated Character Classes • Any character except these • [^aeiou] : not one of a, e, i, o, u • [^\d] : not digit • [^\s] : not white space
Dot . • . can match any single character (almost) • Except the newline character => . is shorthand for [^\n] (Unix), [^\r] (Mac), [^\r\n] (Windows) • Exercise#3
Optional and Repeat operators • 3 operators for expressing repentance • ? : zero or one time (optional) • + : repeat for one or more times • * : repeat for zero or more times • Repeat certain times: • \d{1,4} : one to four digits • \d{1,} : one digit or more (EQ to \d+ ) • Exercise#4
Examples come from: Regular-expressions.info Alternation (list of words) • Useful when you have a list of words, and you want to find the occurrence of each • cat|dog|mouse|fish : find any one of the four • regex|regular expression : find either regex or regular expression • Exercise#5
Examples come from: Regular-expressions.info Capturing writing variations • Suppose you want to find all the occurrences mentioning regular expressions, but it can be written as “regular expression(s)” or “regex(es)”. • Use this pattern to find them all: reg(ular expressions?|ex(es)?)
What can regular expressions do for you • Provide better full-text search • Find a word without worrying its variations • Find specific info written in regular forms: • dates, phone numbers, email addresses, HTML/XML tags, quotes, all capital abbreviations… • Find two words near each other • Perform formatting tasks toward a text • Automate tagging
Examples come from: Regular-expressions.info Find information written in regular forms • Exercise #6: finding dates as of mm/dd/yy • \d\d.\d\d.\d\d • \d\d[- /.]\d\d[- /.]\d\d • [0-1]\d[- /.][0-3]\d[- /.]\d\d • (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\d • Exercise #7: finding texts within double quotes • ".*” • "[^"\r\n]*” • "[^"]*"
Examples come from: Regular-expressions.info Grouping and back references • Exercise #8: finding HTML/XML tags • <([a-z]+)\b[^>]*>.*?</\1> • <date format=“mmddyy”>04/07/13</date> • () : capturing group • \1: back reference the 1st captured group • If there are more than 1 pairs of (), use \2, \3, etc. • The whole matched string is referenced as \0
Examples come from: Regular-expressions.info Formatting task • Trimming unnecessary white spaces • Replace [ \t]{2,} with a single space • Delete leading whitespace within a line: replace ^[ \t]+ with nothing (empty string) • Trim trailing whitespace of a line: replace [ \t]+$ with nothing (empty string) • Transform a text to a list of words • Append a line break after each word • Replace uppercase letters -> lowercase • Replace punctuation symbols with nothing • Rount frequency of each word in MS Excel
Automate tagging • Idea: Find dates via some regex, and then surround each of the matches with tags: <date>some date</date> • Replace our date pattern: (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\dwith :<date>\0</date> • Try it in the date exercise • Once you can tag useful info in a text, it will be easy to pull them out.
Resources for regular expressions • Regular-expressions.info • http://www.regular-expressions.info/ • Profhacker article: “Finding the Women of Heimskringla with Regular Expressions” • http://chronicle.com/blogs/profhacker/finding-the-women-of-heimskringla-with-regular-expressions/38631 • <oo>→<dh> Digital humanities article: • http://dh.obdurodon.org/regex.html
Our texts today • Get familiar with it • Use regex to do some search • Search in files • Then use the techniques to prepare the text for Regex Machine
Texts for today: Old Bailey Proceedings • You can find samples in today’s package under “Old Bailey Proceedings” • Or, you can download them on your own: • select all and copy • paste it to a text editor • save it as UTF-8 without BOM (byte order mark)
Old Bailey’s Proceeding: the HTML presentation
Try some search • Search for: t\d{8}-\d{3} • Replace it with: <refNo>\0</refNo>
Exercise: Preparing your text in a specific format (to feed to some software)
How to convert? Observe! • Goal: to make each case a single line • Patterns? • Every case begins with a line of “Reference Number” and ends before the next “Reference Number” • Got to remove all the line breaks • Tricky things: does the text contain XML reserved characters &, <, >,…
Conversion Steps:Search and Replace + regexes • Replace the XML reserved characters: • & => &% => % • < => >> => < • Get rid of “285.”: ^\d{3}\. => nothing (empty string) • Replace all the line breaks (\r, \n, \r\n) with nothing • Reassign the line breaks by “Reference number:” • Reference number: => \rReference number: • Optional: Get rid of “See original” • The order above is crucial
Credit: ElifYamagil What does the Regex Machine do? • A graphical user interface (GUI) that enables people who do not have programming skills to • graphically design patterns • match them against a corpus of texts • see results immediately via a user-friendly color-coding scheme (quick feedback) • export to XML => automates (part of) the tagging procedure
Downloading CBDB RegexMachine • Regex Machine (on CBDB website) • http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.page515758 -- download the CBDBRegexMachine_July2012.zip on this page • Prerequisites: • Make sure your machine has Java Runtime Enrironment (JRE) installed. If not, you can download it here: http://www.java.com/en/download/
Run the Regex Machine • Double click the CBDBRegexMachine.jar • In the “Select Your User Director” window, select the folder where you put your text files. • Tip: don’t double click the folder! Single click is all you need.
GUI List of “terms” List of active regex Your Text Info Box
Open the text we just prepared • File Open. Select your text file.
Create Active Regexes • First regex: capture the reference number • Example: t18500107-285 • Pattern: t\d{8}-\d{3} • It’s always good to test it first in a text editor • Create it in Regex Machine • Think first: is it one unit? Does it contain diff parts?
1. Click 2. Click 3. Fill in your regex and give it a name
4. Give the whole regex a name. Then choose a color! 5. Click on the Regex. Matches are highlighted!
Export to XML 6. File Export 7. Set records per file to 1000 8. Then an XML should be generated in the same folder of the text file!
XML header added. The number is now tagged with the Handle you specified! Each line is surrounded by the tag <bio> with line number.
Try another regex • Second regex: capture the “Reference number:” and the number • Example: Reference Number: t18500107-285 • Pattern: Reference Number: t\d{8}-\d{3} • Create it in Regex Machine • Think first: Do you want it to be tagged as a whole? Should the match contain diff parts?
Using multiple groups in an Active Regex • Add another Active Regex. Create two groups: • Group #1: Reference Number: • Group #2: t\d{8}-\d{3}