1 / 32

Perl Regular Expressions in SAS 9

Perl Regular Expressions in SAS 9. Ruth Yuee Zhang, CFE Jan 10, 2005. Outline. Introduction SAS Syntax Meta-Characters Examples. Introduction. Regular Expressions A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs. Locate a pattern in text strings

Download Presentation

Perl Regular Expressions in SAS 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perl Regular Expressions in SAS 9 Ruth Yuee Zhang, CFE Jan 10, 2005

  2. Outline • Introduction • SAS Syntax • Meta-Characters • Examples

  3. Introduction Regular Expressions • A powerful tool for manipulating text data. Eg. Perl, Java, PHP and Emacs. • Locate a pattern in text strings • Obtain the position of the pattern • Extract a substring • Substitute a string by another

  4. Introduction • SAS Regular Expressions RX functions: RXPARSE, RXMATCH, RXCHANGE etc. • Perl Regular Expressions PRX functions: PRXPARSE, PRXMATCH, PRXCHANGE, PRXPOSN, PRXDEBUG, etc.

  5. SAS Syntax: Function PRXPARSE PRXPARSE (perl-regular-expression); To define a Perl regular expression to be used later by other Perl regular expression functions. perl-regular-expression: define the Perl regular expression.

  6. SAS Syntax Data _NULL_; ** create the regular expression only once **; if _N_ = 1 then myregex = PRXPARSE(“/cat/”); ** exact match for the word “cat” **; retain myregex; input string $30.; ** matching the regular expression **; position = PRXMATCH (myregex, string); datalines; It is a cat; Does not match a CAT; cat in the beginning; Run; Position: 9 0 1

  7. Meta-Characters Position Characters ^ and $ • “^cat”: matches the beginning of a string; Matches “cat” and “cats” but not “the cat” • “cat$”: matches the end of a string; Matches “the cat” and “cat” but not “cat in the hat” • “^cat$”: a string that starts and ends with “cat” -- that could only be “cat” itself! • “cat”: a string that has the text “cat” in it. Matches “cat”, “cats”, “the cat”, “catch”

  8. Meta-Characters • “\d”: matches a digit 0 to 9. “\d\d\d” matches any three-digit number (123,389) • “\w”: matches any upper and lower case letters, blank and underscore. “\w\w\w” matches any three-letter word • “\s”: matches a white space character or a tab.”\d\s\w” matches “1 a”, “6 x”.

  9. Meta-Characters Quantifiers *, + and ? • “c(at)*”: matches a string that has a “c” followed by zero or more “at” (“c”, “cat”, “catatat”); • “c(at)+”: same, but there's at least one “at” (“cat”, “catat”, etc.); • “c(at)?”: same, but there's zero or one “at” (“c”, “cat”); • “c?a+t$”: a possible “c” followed by one or more “a” ending with “t” (“cat”, “at”, “aaat”).

  10. Meta-Characters Quantifiers • “\d{3}”: matches any 3-digit number and is equivalent to “\d\d\d” • “\w{3,}”: matches 3- or more letter words and is equivalent to “\w\w\w+” (“cat”, “_NULL_”) • “\w{3,5}”: matches 3- or more but no more than 5-letter words (“cat”, “cats”, “catch”)

  11. Meta-Characters • “.”: matches exactly one character. “c.t” matches “cat”, “cut”, “cot”, “cit”. • “c(a|u)t”: matches “cat”, “cut” • “c[auo]t”: matches “cat”, “cut”, “cot” • “[a-e]”: matches the letters “a” to “e”. “c[a-e]t” matches “cat”, “cbt”, “cct” • “[^abc]”: matches any characters except “abc”. “c[^abc]t” matches “cut”, “cot” but not “cat”, “cbt”

  12. Ex #1 A Simple Search ** create the regular expression only once **; Retain myregex; If _N_ = 1 then do; myregex = PRXPARSE (“/m[ea]th[ea][dt]one?/i”); /* “e?”: zero or one “e” “i”: ignore case when matching */; end; ** create a flag of whether matching or not **; myflag = min ( (PRXMATCH(myregex, drugname),1); Matched: methadone, Metheton, methadon, mathatone, METHEDONE, METHADON

  13. Function PRXMATCH PRXMATCH ( pattern-id, string); Returns the first position in the string where the regular expression match is found. If the pattern is not found, it returns 0. • pattern-id: the value returned from the PRXPARSE function. • string: the variable that you are interested in.

  14. Ex #2 Validating the format • A sample of the data: Hydro-Chlorothiazide 25.5 Ziagen 200mg Zerit mg Insulin 20 cc Dapsone 100 g Kaletra 3 tabs • Improperly formatted data: Hydro-Chlorothiazide 25.5 Zerit mg

  15. Ex #2 Validating the format ** create the regular expression **; myregex = PRXPARSE (“/^\D+\d{1,4}\.?\d{0,4}\s?(tabs?|caps?|cc|m?g)/i”); /* ”^\D+”: starts with a group of non-digits “\d{1,4}”: followed by one to four digits “\.?”: an optional period “\d{0,4}”: may be followed by up to four more digits “\s?”: an optional space “(tabs?|caps?|cc|m?g|)”: units of measures: tab, tabs, cap, caps, cc, mg, g “/i”: ignore the case */ ** catch poorly formatted data **; If PRXMATCH (myregex, medication) = 0;

  16. Ex #3 Extracting Text To extract what the patients are reporting to the investigators: Patient reported headache and nausea. MD noticed rash. Pt. Rptd. Backache. Patient reported seeing spots. Elevated pulse and labored breathing. Headache. Extracted field: headache and nausea Backache seeing spots

  17. Function CALL PRXPOSN CALL PRXPOSN (pattern-id, capture-buffer-number, start <, length>); Returns the position and length for a capture buffer. Used in conjunction with PRXPARSE and PRXMATCH. • pattern-id: the value returned from the PRXPARSE function. • capture-buffer-number: a number indicating which capture buffer is to be evaluated. • start: the value of the first position where the particular capture buffer is found. • length: the length of the found pattern.

  18. Ex #3 Extracting Text ** create the regular expression **; myregex = PRXPARSE (“/(reported|rptd?\.?)(.*\.)/i”); /* “(reported|rptd?\.?)”: 1st capture buffer. Capture the word “reported”, “rpt”, “rpt.”, “rptd”, “rptd.” “(.*\.)”: 2nd capture buffer. Followed by any characters until a period is reached “/i”: ignore the case */

  19. Ex #3 Extracting Text ** only call PRXPOSN if matching **; if PRXMATCH (myregex, comments) then do; /* get the position and length of the matching of 2nd capture buffer */ CALL PRXPOSN (myregex, 2, pos, len); /* extract the substring excluding the end period */ pt_comments = substr (comments,pos,len-1); End;

  20. Ex #4 Substitute one string for another Replace all the following by “Multi-vitamin”: multivitamin multi-vitamin multi-vita multivit multi-vit multi vitamin

  21. Function CALL PRXCHANGE CALL PRXCHANGE (pattern-id, times, old-string <,new-string <,result-length <, truncation-value <, number-of-changes>>>>); To substitute one string for another. • times: the number of times to search for and replace a string. • oldstring: the string that you want to replace.

  22. Ex #4 Substitute one string for another ** create regular expression **; myregex = PRXPARSE ( “s/multi[- ]?vita?(min)?/Multi-vitamin/”); /* “s/”: indicates that the regular expression will be used in a substitution “[- ]?”: optional “-” or space “a?”: optional “a” “(min)?”: optional “min” */

  23. Ex #4 Substitute one string for another ** using the myregex id created above **; CALL PRXCHANGE (myregex, -1, drugname); /* “-1” indicates that the pattern should be changed at every occurrence */

  24. Ex #5 Finding digits in random positions

  25. Function CALL PRXNEXT CALL PRXNEXT (pattern-id, start, stop, position, length); To locate the nth occurrence of a pattern. The next occurrence of the pattern will be identified at each time you call the function • start: the starting position to begin the search • stop: the last position in the string for the search • position: the starting position of the nth occurrence of the pattern • length: the length of the pattern

  26. Ex #5 Finding digits in random positions ** create the regular expression **; myregex = PRXPARSE(“/\d+/”); ** “\d+”: look for one or more digits **; Start = 1; Stop = length(string); ** get the position and length of the first occurrence **; Call PRXNEXT (myregex, start, stop, string, pos, len);

  27. Ex #5 Finding digits in random positions Array x[5]; **continue until no more digits are found (pos=0)**; Do i = 1 to 5 while (pos gt 0); ** extract the current occurrence **; X[i] = input (substr (string, pos, len), 9.); ** get the position and length of the next occurrence **; Call PRXNEXT (myregex, start, stop, string, pos, len); End;

  28. Ex #6 Locating zip codes String: John Smith 12 Broad street Flemington, NJ 08822 Philip Judson Apt #1, Building 7 777 Route 730 Kerrville, TX 78028 Dr. Roger Alan 44 Commonwealth Ave. Boston, MA 02116-7364 Zip_code: 08822 78028 02116-7364

  29. Function CALL PRXSUBSTR CALL PRXSUBSTR (pattern-id, string, start <,length>); Returns the starting position and the length of the match. • string: the string to be searched • start: the starting position of the pattern • length: the length of the substring

  30. Ex #6 Locating zip codes ** create the regular expression **; myregex = PRXPARSE(“/ \d{5}(-\d{4})?/”); /*match a blank followed by 5 digits followed by either nothing or a dash and 4 digits “\d{5}”: matches 5 digits “-”: matches a dash “\d{4}”: matches 4 digits “?”: matches zero or one of the preceding subexpression */

  31. Ex #6 Locating zip codes Call PRXSUBSTR (myregex, string, start, length); ** only extract the substring if the pattern is found **; If start gt 0 then ** the start position is after the blank **; zip_code = substrn (string, start+1, length-1);

  32. References • SAS Functions by Example, Ron Cody • An Introduction to Regular Expression with Examples from Clinical Data, Richard Pless, Ovation Research Group, Highland Park, IL • How Regular Expression Really Work, Jack N shoemaker, Greensboro, NC3232

More Related