290 likes | 434 Views
Perl. Regular expression: string manipulation. substr function. string = substr(string2,start pos (starts with 0), offset) returns a substring after the start point to offset string2 is not changed $str2 = "Hi There"; $str = substr($str2, 3, 2);
E N D
Perl Regular expression: string manipulation
substr function • string = substr(string2,start pos (starts with 0), offset) • returns a substring after the start point to offset • string2 is not changed • $str2 = "Hi There"; • $str = substr($str2, 3, 2); • $str = "Th"; # from 4 position to 5 position; • substr(string,start pos, offset) = string2 • puts string2 after the start pos and removing old string characters to offset. • $str2 = "Hi There"; $str = "hi"; • substr($str2, 3,3) = $str; #insert and replace • $str2 = "Hi hire"; • substr($str2, 3,0) = $str; #insert only. • $str2 = "Hi hihire";
index and rindex • index string, substring [, offset] • returns the position before the substring in string, else -1 • with offset, position after the offset, else -1 • rindex string, substring [, offset] • return the last occurrence of the substring, else -1 • with offset, the right most position that may be returned. • $pos = index $str, $str2 • returns the position where $str2 is found in $str
example of substr and index • $str = "There there Jim"; • $sstr = "Jim"; • $replace = "Fred"; • substr($str,(index $str,$sstr),3)= $replace; • replace Jim with Fred in $str • $str = "There there Fred"; • The substitution operator is an easier way to do this.
grep • LIST = grep EXPR, LIST • LIST = grep BLOCK LIST • like map, each element is assigned to the $_, then processed by BLOCK or EXPR, results are put into the list. @new = grep /[a-zA-Z]/, @lines • NOTE: altering $_ will alter the original list @list = qw(barney fred dino wilma) @greplist = grep {s/^[bfd]//} @list • @greplist = "arney", "red", "ino" • @list = "arney", "red", "ino", "wilma"
s/// Operator (Substitution) • $str =~ s/pattern to match/replacement/; • find the first match and replace it • $str =~ s/pattern to match/replacement/g; • Find all matches and replace each of them. • Simple substitution • $str = "3 dogs bit 1 dog"; • $str =~ s/dog/cat/; • $str = "3 cats bit 1 dog"; • $str =~ s/dog/cat/g; • $str = "3 cats bit 1 cat";
s/// Operator (Substitution) (2) • s/pattern//; • remove the pattern found • $str = "abad"; • s/a//g; • $str ="bd"; • From substr and index slide $str =~ s/$sstr/$replace/; OR $str =~ s/Jim/Fred/;
case insensitive substitution • /i ignore case • $str = "Dog, dog, dOg"; • s/DOG/cat/ig; • $str = "cat, cat, cat"; • $str = "Dog, dog, dOg"; • s/DOG/cAt/ig; • $str = "cAt, cAt, cAt"; • The replacement string is replaced as written.
examples • $str = "fred xxx barney"; • $str =~ s/x/boom/; • $str = "fred boomxx barney" • $str =~ s/x/boom/g; • $str = "fred boomboomboom barney"; • $str =~ s/x+/boom/; • $str = "fred boom barney";
alternation and group matching • | allows an or'd matching • $str = "Wilma Flintstone"; • $str =~ s/Fred|Wilma|Pebbles/Dino/g; • $str = "Dino Flintstone"; • Replace all instances of Fred or Wilma or Pebbles with Dino. • $str = "1st time winner"; • $str =~ s/(1st|2nd|3rd) time/Last place/; • $1 is the match, “1st” Entire match is “1st time” • $str = "Last place winner"
single character substitution • Using [] • $str =~ s/[abc]/d/; #sub a, b, or c with d • $str =~ s/[Fred]/x/g; • If $str was "Fred", after it would be "xxxx" • $str =~ s/[^aeiouAEIOU]/_/g; • replace any non-vowel with an _ • Common mistake: • $str =~ s/[a-z]/[A-Z]/g; • Should replaces any lower case letter with upper case letters but replace side is literal (not a pattern) • if $str = "hi", then it would be "[A-Z][A-Z]"; • NOTE: $str = uc $str; #upper cases a string.
matching quantifiers • $str =~ s/a{3}/b/; • first instance of aaa is replace with b • $str = "aaaaa"; # use this for the rest of the slide • $str =~ s/a{3,}/b/; #max matching • $str = "b" • $str =~ s/a{3,}?/b/; #min matching • $str = "baa"; #only sub 3 to make a min match • $str =~ s/(a{3,}?)(a*)/b/; • $str = "b"; $1 = "aaa"; $2 = "aa"; • $str =~ s/(a{3,})(a*)/b/; • $str = "b"; $1 = "aaaaa"; $2 = ""; • $str =~ s/(a{3,}?)(a*?)/b/;# min match on both • $str = "baa"; $1 = "aaa"; $2 = "";
matching quantifiers (2) • $str = "aaaaab"; # use this for the rest of the slide • $str =~ s/a{3,}?b/c/; • $str = "c", why? in order to make the match, it used all the a's to include the b. • + 1 or more and ? 0 or 1 time (max match) • $str =~ s/(a+)(b?)/c/; • $str = "c", $1 = "aaaaa" and $2 = "b" • $str =~ s/(a+?)(b??)/c/; #min match • $str = "caaaab"; $1 ="a"; $2 = "";
matching quantifiers (3) • Example and perl doesn’t always do what you think. • $str = "ddogg"; • $str =~ s/d.*g/cat/; • $str = "cat" # max match, makes sense • $str = "ddogg"; • $str =~ s/d.*?g/cat/; • $str = "catg"; #min match, but not the best min match it can make.
matching quantifiers (4) • More Examples (with $_ variable) $_ = "a xxx c xxxxx c xxx d"; • s/x{1,}/d/g; produces "a d c d c d d" • s/x{1,}?/d/g; produces "a ddd c ddddd c ddd d" • s/x{1,2}/d/g; prodcues "a dd c ddd c dd d" • s/x{1,3}/d/g; produces "a d c dd c d d" • s/x{2,2}/d/g; produces "a dx c ddx c dx d" • or s/x{2}/d/g;
Anchoring • $str = "Fred Flintstone Fred" • $str =~ s/Fred/Wilma/g; • Replaces all instances of Fred with Wilma • $str =~ s/Fred$/Wilma/g; • Only the last instance, "Fred Flintstone Wilma", even with /g flag • $str =~ s/^Fred/Wilma/g; • only the first instance, "Wilma Flintstone Fred", even with the /g flag • $str = "abcd"; • $str =~ s/^[abc]+/d/; • $str = "dd";
Parentheses as memory • s/a(.)b(.)c\2d\1/a mess/; • "adbecedd" is converted to "a mess" • "adbecdde" is not converted. • s/a(.*)b\1c/a mess/; • "addbddc" changes to "a mess" • "adddbddc" is not changed • To kept the pattern found use \1 ..\9 in replacement • s/a(.*)b\1c/What is this: \1/; • "addbddc" converted to "What is this: dd" • again $1 = "dd"
metasymbols • a very common substitution • s/\s+/ /g; # replace all whitespace with single space. • " a b\t c" changes to " a b c" • remove word character duplicates • $str = "11aabbdccaa"; • $str =~ s/(\w)\1/\1/g; • $str = "1abcda" • Remove any duplicates • $str = "11 ,,aa" • $str =~ s/(.)\1/\1/g; • $str ="1 ,a"
Metasymbols (2) • \U Upper case until \E and \L lower case until \E • Example • s/a(.*)b\1c/What is this: \U\1\E/; • "addbddc" converted to "What is this: DD" • s/a(.*)b\1c/What is this: \L\1\E/; • "addbddc" converted to "What is this: dd" • \Q …\E stop regex characters in between
Exercise 10 • What is the outcome of the following substitutions? Use $_ = "ad dog cd" • s/dog//; • while (/ /) { s/ / /g;} • s/(\w+)\s+(\w+)/$2 $1/g; • s/(.+)d/Dd/g; • s/(.+?)d/Dd/g; • s/(\S+)/=\1=/g; • Write a substitution to change each vowel to an X.
s/// flags • like the match operator • /m let ^ and $ match next to embedded \n • /s let . match newline • /x ignore whitespace and permit comments • s/// flags only • /g replace globally, ie all occurrences • /e evaluate the right side as an expression • in other words, perl interprets the right side as perl code, where you have return value
/e flag • s/(\d+)/sprintf("%#x",$1)/ge; • covert all numbers to hex • "2581" would converted to "0xb23" • return to the leap year with a trinary operator s/(\d+)/ $1 % 4 ? "$1 (not a leap year)" : $1 % 100 ? "$1 (a leap year)" : $1 % 400 ? "$1 (not a leap year)" : "$1 (a leap year)" /gxe • "2000" changed to "2000 (a leap year)"
tr/// Operator (Transliteration) • same as sed, can as use y/// instead of tr/// • DOES NOT use pattern matching, instead it scans character by character and replaces each occurrence of a character with a replacement • tr/SEARCHLIST/REPLACEMENTLIST/cds; • Example: • $str = "AABBCCDDEE"; • $str =~ tr/ABC/XYZ/; • $str = "XXYYZZDDEE"; • $str =~ tr/DE/!/; #if the replacement list is too short, uses the last one as many times as needed. • $str = "XXYYZZ!!!!";
tr/// Operator (Transliteration) (2) • Duplicates in the Searchlist are ignored • $str = "AABBCCDDEE"; • $str =~ tr/AAB/xyz/; • $str = "xxzzCCDDEE"; • /c means letters not in the Searchlist • $str = "AABBCCDDEE"; • $str =~ tr/ABC/x/c; • $str = "AABBCCxxxx";
tr/// Operator (Transliteration) (3) • /d delete found, but non-replaced characters • Changes tr, so if your replacement list is short, those characters are removed • $str = "AABBCCDDEE"; • $str =~ tr/ABC/xy/d; • $str = "xxyyDDEE"; • $str =~ tr/DE//d; • $str = "xxyy";
tr/// Operator (Transliteration) (4) • /s removes duplicates in replaced characters • $str = "AABBCCDDEE"; • $str =~ tr/ABC/xyz/s; • $str ="xyzDDEE"; • tr/// returns the number of characters found/replaced. • $count = ($str =~ tr/ABC/xyz/); • $count = 6; $str = "xxyyzzDDEE"; • $count = ($str =~ tr/ABC//); • $count = 6; $str = "AABBCCDDEE"; • No replacement list, so it just counted them and made no replacements. Note s/// would have removed them.
More tr/// Examples • $str = "AABBCCDDEE"; • $str =~ tr/D//d; #delete found characters • $str = "AABBCCEE"; • $str = "AABBCCDDEE"; • $str =~ tr/ABD/xy/ds; #delete D, sub A for x and B for y and remove duplicates replacements • $str = "xyCCEE"; • $str =~ tr/a-zA-Z//dc; • remove any non letters from $str. • $str =~ tr/A-Za-z/N-ZA-Mn-za-m/; • rotate the characters by 13 letters for simple encryption.
Exercise 11 • What is the outcome of the following transliteration? Use $_ = "fred and barney" • tr/abcde/ABCDE/; • tr/a-z/ABCDE/d; • $count = tr/a-z/A-Z/; • tr/a-z/_/c; • tr/a-m/X/s; • tr/aeiou/X/cs; • $count = tr/aeiou//c; • Change the letters bdr to X and count the number of changes.
Q A &