340 likes | 579 Views
Regular Expressions in MarcEdit. Terry Reese Gray Family Chair for Innovative Library Resources. Topics . MarcEdit Regular Expression Support Information Understanding .NET Regular Expressions Major components of the language Understanding grouping mechanisms and references
E N D
Regular Expressions in MarcEdit Terry ReeseGray Family Chair for Innovative Library Resources
Topics • MarcEdit Regular Expression Support Information • Understanding .NET Regular Expressions • Major components of the language • Understanding grouping mechanisms and references • MarcEdit specific regular expression quirks • What is included in Regular Expression evaluations • In the Replace Function • In the Subfield Function • In the Delete Field Function • Getting Regular Expression Help
Getting started • Files we are going to be working with: • data_file1.mrc • test.mrk
MarcEdit Regular Expression Support • Functions that presently support regular expressions • Extract/Delete Selected MARC Records
MarcEdit Regular Expression Support • Functions that presently support regular expressions • Find Function (MarcEditor)
MarcEdit Regular Expression Support • Functions that presently support regular expressions • Replace Function (MarcEditor)
MarcEdit Regular Expression Support • Functions that presently support regular expressions • Delete Field Function (MarcEditor)
MarcEdit Regular Expression Support • Functions that presently support regular expressions • Edit Subfield Field Function (MarcEditor)
MarcEdit Regular Expression Support • When processing regular expressions with MarcEdit, MarcEdit makes entire fields or subfields available for processing • i.e., when processing a delete field function – all data from =[field number] are part of the field that can be queried. • MarcEdit’s regular expression by default deals with one field at a time (i.e., regular expressions do not allow you to find data across fields by default) • MarcEdit’s Regular Expression Support Pre-5.x was a custom regular expression engine. • MarcEdit’s Regular Expression Support 5.x+ is defined by Microsoft .NET’s Regular Expression object • This object uses a syntax that looks Perl-like, but has some differences.
MarcEdit Regular Expression Support • When working with regular expressions with the Replace Function, MarcEdit will remember the last 10 replacements. This should help with trial and error. • When dealing with Regular Expressions or any global replacements, MarcEdit has a Special Undo function that will undo your last global update.
Microsoft’s Regular Expression language • Concepts: • Character escapes • Anchors • Character classes • Grouping • Qualifiers • Substitutions URL: http://msdn.microsoft.com/en-us/library/az24scfc
How we use Regular Expressions in MarcEdit • Your most important parts of the regular expression language are: • Character escapes: \d\r\n\$\x## • Character Classes [] & [^] • Grouping Elements () • Anchors: ^$ • Quantifiers: *?+{#} • Substitutions: $#
Examples • Looking at example.txt using the replace function: • Add a period to the 500 if it is missing • Add a $h of cartographic resources between the $a and $c . • Split the 856 into two fields, breaking on the $u.
Examples 1 • Add a period to the 500 if it is missing • Find What: (=500 ..)(.*[^.]$) • Replace With: $1$2. • Explanation: • (=500 ..) • Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the periods. • (.*[^.]$) • Take any characters, and match on a field where the last character in the field isn’t a period.
Example 2 • Add a $h of cartographic resources between the $a and $c . • Find What: (=245.{4})(\$a.*)(/.*) • (=245.{4}) • Match the 245 field with any value in the next 4 characters being valid. • (\$a.*) • Select everything within the subfield a • (/\$c.*) • Select the / value and the subfield c (and other data) • Replace With: $1$2$$h[cartographic resource] $3
Example 3 • Split the 856 into two fields, breaking on the $u. • Find What: (=856.{4})(\$u.*[^$])(\$u.*) • (=856.{4}) • Matches the 856 field • (\$u.*[^$]) • Match $u, but stop at the end of the subfield • (\$u.*) • Match reminder of field • Replace With: $1$2\n=856 41$3
Lcase/ucase • MarcEdit’s regular expression engine includes to extension functions for dealing with case switching of characters. • lcase & ucase • Usage: (=450.{4})(\$a.)(.*) • $1$2lcase($3) • Example: Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.
Example (Lcase) • Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case. • Find What: (=500.{4})(\$a.)([A-Z .]*) • Replace With: $1$2lcase($3)
Multi-Field Replacements • By default, MarcEdit handles one field at a time when doing regular expressions. • However, when you need to do evaluations against multiple fields, you can by adding /m to the end of your replacement in the Replace Function in the MarcEditor • This is a special function added to the MarcEdit regular expression engine
Example • Using test.mrk • The file has multiple 028 fields. The first field has a $a and $b, the second a $b. Copy the $b to the second 028, but only if they are consecutive
Multi-Line Example • The file has multiple 028 fields. The first field has a $a and $b, the second a $b. Copy the $b to the second 028, but only if they are consecutive • Find What: (=028.{4}\$a[^\$]+)(\$b[^\$]+)(\r?\n)(=028.{4}\$a[^\$\r\n]+)(\r?\n)/m • Replace With: $1$2$3$4$2$3
Move Data Example • You can use regular expressions to re-order data within a field • Taking example 3, move the $h in the 245 to be resituated between the $a and $c – taking into account punctuation
Move Data Example • Taking example 3, move the $h in the 245 to be resituated between the $a and $c – taking into account punctuation • Find With: (=245.{4})(\$a.*[^/\$])(/\$c.*[^$])(\$h.*) • Replace With: $1$2$4 $3
Delete Field Function • The delete field function exposes all the data in the field to be acted upon as a regular expression. • i.e. =856 .* • So the first value in the Delete Field evaluation is an =, not the subfield data • The reason to do this is to allow for explicit evaluations of indicators.
Delete Field Examples • Using example2.mrk • Delete 856 fields that have a first indicator of 4 and does not have a second indicator of one (1) • Using test.mrk • Delete all 6xx fields that have diacritics (values with {})
Delete Example 1 • Delete 856 fields that have a first indicator of 4 and does not have a second indicator of one (1) • Field: 856 • Find: ^=856.{2}(4[02-9])|^=856.{2}5 • Keeps 1 fields
Delete Example 2 • Delete all 6xx fields that have diacritics (values with {}) • Field: 6xx • Find: \{.*\} • Removes one field
Edit Subfield Function • Edit subfield function exposes data starting with the subfield code • i.e., $atest would expose, atest • Can be used for both control data and variable field data • Regular expressions can be used both for Replacements and Removing data • Can use the lcase and ucase functions with the edit subfield function.
Edit Subfield Function Example • Using example.mrk • Make the 049$a all upper cased
Getting Regular Expression Help • The MarcEdit Listserv has a number of regular expression experts that provide a lot of help to users looking for it • http://metis3.gmu.edu/cgi-bin/wa?A0=MARCEDIT-L