250 likes | 413 Views
Introduction to Regular Expressions. Christine Moulen MIT Libraries ELUNA 2014. What is a regular expression?. Regular expressions are : A language or syntax that lets you specify patterns for matching e.g. filenames or strings U sed to identify the files or lines you want to work with
E N D
Introduction to Regular Expressions Christine Moulen MIT Libraries ELUNA 2014
What is a regular expression? • Regular expressions are : • A language or syntax that lets you specify patterns for matching e.g. filenames or strings • Used to identify the files or lines you want to work with • Used inside of substitution functions to change the contents of a string
Command line examples • ls 14* • * is a wildcard here, not regex • 14 followed by zero or more of any character • ls 14[0-1][0-9]* • [0-1] and [0-9] are regex character classes, specifying a single character within the the list of characters from 0 to 1, and 0 to 9, respectively • ls 14[0-1][0-9][0-3][0-9]* • 6 digits that look like a date YYMMDD, mostly
More command line examples • mv [b-z]* $data_scratch • An alphabetical class, which depending on your system might match the lower case letters from b through z, OR a mix of upper and lower case: b C c D d ... Z z • grep 'MIT01$' sysnos.txt • Find lines that end ($) with MIT01 • ^ can be used to match at the beginning of a line
UNIX/Linux editors • In vi, you can use regular expressions with the s/// substitution operator • With emacs, use M-x query-replace-regexp • Replace $ with MIT01 • Take a list of system numbers and make it valid input to an Aleph service by adding the library code to the end of each line
Matching example in Perl • Look through a MARC file in Aleph sequential format for lines with tag 260 • 001234567 260 L $$aCambridge$$bMIT Press • if ($matched =~ m/^\d{9}\s260.+/) { ... } • $matched is the while loop variable representing the line we're working on • =~ is a pattern operator used with the matching (m), substitution (s), and translation (tr) functions • m// is the pattern matching function
m/^\d{9}\s260.+/ • ^ start at the beginning of the line • \d Perl-speak for the digits character class • {9} a quantifier. Find exactly 9 of \d • \s Perl-speak for the whitespace char class • 260 the MARC tag I'm looking for • . any character • + a quantifier. Find 1 or more of .
Working with MARC fields • Look for deleted records • LDRposition 05 is d • $my_LDR =~ /LDR L .....d/ • Look for e-resource records • $my_245 =~ /\$\$h\[electronic resource\]/ • Look for OCLC numbers • $my_035 =~ /(\(OCoLC\)\d{8,10})/ • Note the double use of () here
Counting up records at the end of a script if ($hash{$tmp} =~ m/SKIP/ || $hash{$tmp} =~ m/NEW/) { $new_count++ if (m/ FMT L /); $skip_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP/); $bre_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Brief/); $bks_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Books24x7/); $eebo_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EEBO/); $epda_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EPDA/); $sta_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP STA/); }
Substitution example in Perl • We have a browse index of URLs • An Aleph browse index only sorts the first 69 characters of the field • When we have many URLs from the same site, we need to get the unique part closer to the beginning • Following is an SFXOpenURL from the MARCit! service
This OpenURL ... • http://owens.mit.edu/sfx_local? url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&rfr_id=info:sid/sfxit.com:opac_856&url_ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignore_date_threshold=1&rft.object_id=3710000000092335&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&
... becomes this. • http://owens.mit.edu/sfx_local?rft.object_id=3710000000092335&url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&rfr_id=info:sid/sfxit.com:opac_856&url_ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignore_date_threshold=1&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&
The substitution expression • $my_856 =~ s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/; • s is the substitution operator • substitute/this/for this/ • Parentheses used here to group different sections of the pattern, and then re-arrange them
s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/
s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/ • Now change the order from $1$2$3$4 to $1$3$2$4
Parsing thesis notes • Thesis degree, year, and department are stored in a single free text MARC field 502 • We have applied some structure to this, but it has varied over time • In DSpace, we want to get these 3 bits into separate fields, so the note is parsed on the way from MARC to Dublin Core
Parsing thesis notes • $MIT = 'Massachusetts Institute of Technology\.?|M\.\s?I\.\s?T\.'; • ? is the zero or one quantifier. • | match the pattern alternative before or after this • $Dept = '[Dd]epartment\s[Oo]f|[dD]ept\.\s+[Oo]f'; • A few small character classes, to allow for case variation, and Department vs Dept.
Parsing thesis notes • $Month = 'January|February|March|April|May|June|July|August|September|October|November|December'; • match any one month name when $Month is used inside a pattern
Thesis. 1975. Sc.D.--Massachusetts Institute of Technology. Dept. of Mechanical Engineering • /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)--($MIT)\.?\s+($Dept)?\s*(.+)$/o
Thesis. 1975. Sc.D.--Massachusetts Institute of Technology. Dept. of Mechanical Engineering • /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)--($MIT)\.?\s+($Dept)?\s*(.+)$/o
More thesis examples • Massachusetts Institute of Technology. Dept. of Economics. Thesis. 1968. Ph.D. • Massachusetts Institute of Technology, Dept. of Civil Engineering, Thesis. 1965. Sc. D. • /^($MIT)(\.|,)?\s+($Dept)?\s*([\w\s\.,]+)\s+Thesis.\s*(\d{4})\.?\s*(.*)$/o
More thesis examples • Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1973. • Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics an Astronautics. • Thesis. (M.S.)--Sloan School of Management, 1983. • Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 1951. • Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, February 2004. • /^Thesis\.?\s*\(([^\)]*)\)(\s*--?\s*|\s+)?(($MIT)[\.,]?)?\s*($Dept)?\s*(.*)(,\s+(\d{4}))?\.?$/o
More thesis examples • Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; and the Woods Hole Oceanographic Institution), 2013. • /^Thesis\.?\s*\(([^\)]*)\)(\s*--(Joint Program in ([\w\.\s]+)\((($MIT)[\.,]?)?\s*($Dept)?\s*([\w,;\s]+)\)))(,\s+(\d{4}))?\.?$/o
Questions? orbitee@mit.edu