340 likes | 570 Views
Perl Regular Expressions in SAS 9. Ruth Yuee Zhang, CFE Jan 10, 2005. Outline. Introduction SAS Syntax Meta-Characters Examples. Introduction. Regular Expressions A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs. Locate a pattern in text strings
E N D
Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005
Outline • Introduction • SAS Syntax • Meta-Characters • Examples
Introduction Regular Expressions • A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs. • Locate a pattern in text strings • Obtain the position of the pattern • Extract a substring • Substitute a string by another
Introduction • SAS Regular Expressions RX functions: RXPARSE, RXMATCH, RXCHANGE etc. • Perl Regular Expressions PRX functions: PRXPARSE, PRXMATCH, PRXCHANGE, PRXPOSN, PRXDEBUG, etc.
SAS Syntax: Function PRXPARSE PRXPARSE (perl-regular-expression); To define a Perl regular expression to be used later by other Perl regular expression functions. perl-regular-expression: define the Perl regular expression.
SAS Syntax Data _NULL_; ** create the regular expression only once **; if _N_ = 1 then myregex = PRXPARSE(“/cat/”); ** exact match for the word “cat” **; retain myregex; input string $30.; ** matching the regular expression **; position = PRXMATCH (myregex, string); datalines; It is a cat; Does not match a CAT; cat in the beginning; Run; Position: 9 0 1
Meta-Characters Position Characters ^ and $ • “^cat”: matches the beginning of a string; Matches “cat” and “cats” but not “the cat” • “cat$”: matches the end of a string; Matches “the cat” and “cat” but not “cat in the hat” • “^cat$”: a string that starts and ends with “cat” -- that could only be “cat” itself! • “cat”: a string that has the text “cat” in it. Matches “cat”, “cats”, “the cat”, “catch”
Meta-Characters • “\d”: matches a digit 0 to 9. “\d\d\d” matches any three-digit number (123,389) • “\w”: matches any upper and lower case letters, blank and underscore. “\w\w\w” matches any three-letter word • “\s”: matches a white space character or a tab.”\d\s\w” matches “1 a”, “6 x”.
Meta-Characters Quantifiers *, + and ? • “c(at)*”: matches a string that has a “c” followed by zero or more “at” (“c”, “cat”, “catatat”); • “c(at)+”: same, but there's at least one “at” (“cat”, “catat”, etc.); • “c(at)?”: same, but there's zero or one “at” (“c”, “cat”); • “c?a+t$”: a possible “c” followed by one or more “a” ending with “t” (“cat”, “at”, “aaat”).
Meta-Characters Quantifiers • “\d{3}”: matches any 3-digit number and is equivalent to “\d\d\d” • “\w{3,}”: matches 3- or more letter words and is equivalent to “\w\w\w+” (“cat”, “_NULL_”) • “\w{3,5}”: matches 3- or more but no more than 5-letter words (“cat”, “cats”, “catch”)
Meta-Characters • “.”: matches exactly one character. “c.t” matches “cat”, “cut”, “cot”, “cit”. • “c(a|u)t”: matches “cat”, “cut” • “c[auo]t”: matches “cat”, “cut”, “cot” • “[a-e]”: matches the letters “a” to “e”. “c[a-e]t” matches “cat”, “cbt”, “cct” • “[^abc]”: matches any characters except “abc”. “c[^abc]t” matches “cut”, “cot” but not “cat”, “cbt”
Ex #1 A Simple Search ** create the regular expression only once **; Retain myregex; If _N_ = 1 then do; myregex = PRXPARSE (“/m[ea]th[ea][dt]one?/i”); /* “e?”: zero or one “e” “i”: ignore case when matching */; end; ** create a flag of whether matching or not **; myflag = min ( (PRXMATCH(myregex, drugname),1); Matched: methadone, Metheton, methadon, mathatone, METHEDONE, METHADON
Function PRXMATCH PRXMATCH ( pattern-id, string); Returns the first position in the string where the regular expression match is found. If the pattern is not found, it returns 0. • pattern-id: the value returned from the PRXPARSE function. • string: the variable that you are interested in.
Ex #2 Validating the format • A sample of the data: Hydro-Chlorothiazide 25.5 Ziagen 200mg Zerit mg Insulin 20 cc Dapsone 100 g Kaletra 3 tabs • Improperly formatted data: Hydro-Chlorothiazide 25.5 Zerit mg
Ex #2 Validating the format ** create the regular expression **; myregex = PRXPARSE (“/^\D+\d{1,4}\.?\d{0,4}\s?(tabs?|caps?|cc|m?g)/i”); /* ”^\D+”: starts with a group of non-digits “\d{1,4}”: followed by one to four digits “\.?”: an optional period “\d{0,4}”: may be followed by up to four more digits “\s?”: an optional space “(tabs?|caps?|cc|m?g|)”: units of measures: tab, tabs, cap, caps, cc, mg, g “/i”: ignore the case */ ** catch poorly formatted data **; If PRXMATCH (myregex, medication) = 0;
Ex #3 Extracting Text To extract what the patients are reporting to the investigators: Patient reported headache and nausea. MD noticed rash. Pt. Rptd. Backache. Patient reported seeing spots. Elevated pulse and labored breathing. Headache. Extracted field: headache and nausea Backache seeing spots
Function CALL PRXPOSN CALL PRXPOSN (pattern-id, capture-buffer-number, start <, length>); Returns the position and length for a capture buffer. Used in conjunction with PRXPARSE and PRXMATCH. • pattern-id: the value returned from the PRXPARSE function. • capture-buffer-number: a number indicating which capture buffer is to be evaluated. • start: the value of the first position where the particular capture buffer is found. • length: the length of the found pattern.
Ex #3 Extracting Text ** create the regular expression **; myregex = PRXPARSE (“/(reported|rptd?\.?)(.*\.)/i”); /* “(reported|rptd?\.?)”: 1st capture buffer. Capture the word “reported”, “rpt”, “rpt.”, “rptd”, “rptd.” “(.*\.)”: 2nd capture buffer. Followed by any characters until a period is reached “/i”: ignore the case */
Ex #3 Extracting Text ** only call PRXPOSN if matching **; if PRXMATCH (myregex, comments) then do; /* get the position and length of the matching of 2nd capture buffer */ CALL PRXPOSN (myregex, 2, pos, len); /* extract the substring excluding the end period */ pt_comments = substr (comments,pos,len-1); End;
Ex #4 Substitute one string for another Replace all the following by “Multi-vitamin”: multivitamin multi-vitamin multi-vita multivit multi-vit multi vitamin
Function CALL PRXCHANGE CALL PRXCHANGE (pattern-id, times, old-string <,new-string <,result-length <, truncation-value <, number-of-changes>>>>); To substitute one string for another. • times: the number of times to search for and replace a string. • oldstring: the string that you want to replace.
Ex #4 Substitute one string for another ** create regular expression **; myregex = PRXPARSE ( “s/multi[- ]?vita?(min)?/Multi-vitamin/”); /* “s/”: indicates that the regular expression will be used in a substitution “[- ]?”: optional “-” or space “a?”: optional “a” “(min)?”: optional “min” */
Ex #4 Substitute one string for another ** using the myregex id created above **; CALL PRXCHANGE (myregex, -1, drugname); /* “-1” indicates that the pattern should be changed at every occurrence */
Function CALL PRXNEXT CALL PRXNEXT (pattern-id, start, stop, position, length); To locate the nth occurrence of a pattern. The next occurrence of the pattern will be identified at each time you call the function • start: the starting position to begin the search • stop: the last position in the string for the search • position: the starting position of the nth occurrence of the pattern • length: the length of the pattern
Ex #5 Finding digits in random positions ** create the regular expression **; myregex = PRXPARSE(“/\d+/”); ** “\d+”: look for one or more digits **; Start = 1; Stop = length(string); ** get the position and length of the first occurrence **; Call PRXNEXT (myregex, start, stop, string, pos, len);
Ex #5 Finding digits in random positions Array x[5]; **continue until no more digits are found (pos=0)**; Do i = 1 to 5 while (pos gt 0); ** extract the current occurrence **; X[i] = input (substr (string, pos, len), 9.); ** get the position and length of the next occurrence **; Call PRXNEXT (myregex, start, stop, string, pos, len); End;
Ex #6 Locating zip codes String: John Smith 12 Broad street Flemington, NJ 08822 Philip Judson Apt #1, Building 7 777 Route 730 Kerrville, TX 78028 Dr. Roger Alan 44 Commonwealth Ave. Boston, MA 02116-7364 Zip_code: 08822 78028 02116-7364
Function CALL PRXSUBSTR CALL PRXSUBSTR (pattern-id, string, start <,length>); Returns the starting position and the length of the match. • string: the string to be searched • start: the starting position of the pattern • length: the length of the substring
Ex #6 Locating zip codes ** create the regular expression **; myregex = PRXPARSE(“/ \d{5}(-\d{4})?/”); /*match a blank followed by 5 digits followed by either nothing or a dash and 4 digits “\d{5}”: matches 5 digits “-”: matches a dash “\d{4}”: matches 4 digits “?”: matches zero or one of the preceding subexpression */
Ex #6 Locating zip codes Call PRXSUBSTR (myregex, string, start, length); ** only extract the substring if the pattern is found **; If start gt 0 then ** the start position is after the blank **; zip_code = substrn (string, start+1, length-1);
References • SAS Functions by Example, Ron Cody • An Introduction to Regular Expression with Examples from Clinical Data, Richard Pless, Ovation Research Group, Highland Park, IL • How Regular Expression Really Work, Jack N shoemaker, Greensboro, NC3232