200 likes | 272 Views
Matching in list context (Chapter 11 continued) @array = ( $str =~ /pattern/ );
E N D
Matching in list context (Chapter 11 continued) • @array = ($str =~ /pattern/); • This stores the list of the special ($1, $2,…) capturing variables into the @array only if there are grouped expressions in the pattern to capture matches. Otherwise, if there are no grouped expressions, either (1) or () is returned into the @array depending upon whether there are successful matches or not. • The following results in • ("cat chow" , "cat" , "chow") • being assigned to the @array. • @array = ("Purina cat chow" =~ • /((cat|dog|ferret) (food|chow))/);
The g command modifier causes matching to be done globally -- it doesn't quit after finding the first match. • @array = ($str =~ /pattern/g); • Use global matching only when there are no grouped expressions in the pattern. • The following results in the list ("an ", "amp") being assigned to the @array. • @array = ("an example" =~ /a../g); • In contrast the following would result in the one-element list ("an ") being assigned to the @array. • @array = ("an example" =~ /(a..)/);
The following statement parses out all of the HTML tags and stores this list ("<h1>", "</h1>") in the @tags array. • @tags = ("<h1>Title</h1>" =~ /<.+?>/g); • Suppose $document is a (perhaps long) string that contains some text document, and suppose we want to pull out all the social security numbers from the document. If we assume social security numbers look like 123-45-6789, then a solution is • @soc_numbers=($document =~ /\d{3}-\d{2}-\d{4}/g); • But what if the social security numbers are inconsistent in that some are missing the dashes?Then a solution is • @soc_numbers=($document=~/\d{3}-?\d{2}-?\d{4}/g);
Two very useful functions that take patterns and return lists.
We have used split often, even in the decoding routine where we split about a one-character string. • @nameValuePairs=split(/&/,$datastring); • A string with more complicated delimiting patterns can also be split. In the following case, a delimiter is one or more colons. • $str = "23:22::455:98:::85"; • @numbers = split( /:+/ , $str);
grep (get regular expression pattern) is different from split in that you send it an array rather than a string. It "filters" the array based upon the regular expression. That is only those array elements which match the pattern are returned. • Suppose @domains contains some large number of named Web addresses. One simple call to grep can filter out only those addresses in the ".edu" domain, for example • @edu_sites= grep (/\.edu/, @domains); • Note: The period had to be escaped since it is a metacharacter.
Example: Analyzing log files. A typical HTTP access log. See accesslog.txt.
The 10 different fields are actually standard. • Results when we split out the first line (around delimiting spaces). • @fields = split (/\s+/, $line);
Log file analysis can get very elaborate and there are many commercial and free software packages available for that. • For a simple example, we count the total number of hits (lines in the access log) and the total number of unique hits (different IP addresses). • Notice that requesting one page can result in numerous lines in the access log since all of the image transfers are separate HTTP transactions. (Some hit counters you find actually report the number of lines in the file!) • Counting lines is easy. To count the number of unique IP addresses, we add IP addresses to a hash as the keys. Thus a new hash entry only can originate from a new IP address. We then count the number of keys in the hash. • See source file hitcount.pl
The substitution operator • $scalar_variable =~ s/pattern/replacement_string/command_modifiers; • The binding operator "binds" the substitution onto the string. • The substitution operator s/// takes two arguments (in contrast to the match operator m// ). • It attempts to find a match for the pattern in the $scalar_variable, and if successful, replaces the match with the replacement_string. • Thus, the scalar variable is altered if a successful match is found. In contrast, match operator does not alter the string onto which it is bound.
The following attempts to replace the with my. • $str = "the cat in the hat"; • $str =~ s/the/my/; • This causes $str to contain "my cat in the hat". • By default, only the left-most occurrence is replaced. • The g (global) command modifier causes substitutions to be made globally. • $str = "the cat in the hat"; • $str =~ s/the/my/g; • This causes $str to contain "my cat in my hat".
The following results in $str having the value "puppy ferret category". (non-global substitution) • $str = "puppy dog category"; • $str =~ s/(cat|dog)/ferret/; • A similar global substitution results in $str containing "puppy ferret ferretegory". • $str = "puppy dog category"; • $str =~ s/(cat|dog)/ferret/g; • The following replaces all whitespace characters with the empty string, resulting in $str containing "hello". • $str = "h e l l o"; • $str =~ s/\s//g;
Captured matches can actually be included into the replacement string. • $str = "puppy dog category"; • $str =~ s/(\w+)/$1s/g; • This results in $str having the value "puppys dogs categorys". • There is only one set of grouping parentheses used in this example, so we only need to use $1. • As each match is found, $1is assigned that new match. Thus, $1 may be reused several times during a global substitution.
The transliteration operator • $scalar =~ tr/search_characters/replacement_characters/; • This replaces the search characters with the corresponding replacement characters. • It's usually used with single characters. • $str = "the cat in the hat"; • $str =~ tr/a/u/; • The result is "the cut in the hut"; • Transliteration can be done using substitutions, but tr automatically does global substitutions and only uses characters which means you don't have to escape metacharacters.
Example: Inspired by news sites which which display parts of stories and provide links pointing to the full stories. See partialcontent.cgi
Each story is a text file (.news) • Paragraphs must separated by at least a blank line /n/n • The program reads the directory and prints the first two paragraphs of only the .news files.
Acquiring only the .news stories from the directory is straight forward, especially with the power of grep. • opendir(D, "$storyDataDir"); • @storyFiles = readdir(D); • closedir(D); • @storyFiles = grep (/.news/ , @storyFiles); • We then loop over the .news files and process each one. • foreach $file (@storyFiles) { • if(open(STORY, "$storyDataDir$file")) { • my @wholeStory = <STORY>; • close(STORY); • # join whole story into one string • my $story = join("", @wholeStory);
We can then extract all of the paragraph with one global match!! • @paragraphs = ($story =~ /((.|\n)+?\n\s*\n)/g); • It's then trivial to print the first two paragraphs. • But the pattern certainly needs clarification. • First we need to identify the space between paragraphs. • \n\s*\n## matches one or more consecutive blank lines • ## That is, two newline characters with zero or more whitespace characters in between. • Since quantifiers are greedy, the pattern will not stop after finding the first in a sequence of blank lines.
Now we match paragraph content. • (.|\n)+## one or more of any character • ## (wildcard doesn't match /n characters) • Now the whole pattern which matches a paragraph. • /(.|\n)+?\n\s*\n/## one or more of anything, then a • ## then a blank line(s) • Notes: • One would have been tempted to identify paragraphs as one or more wildcard characters (.+). But that would miss parts of paragraphs containing an inadvertent hard return (\n) between sentences. • The extra metacharacter (?) specifies non-greedy matching. Otherwise, the pattern would not stop after the first paragraph.
There are still two subtle pitfalls regarding the structure of the news files. • A sequence of two blank lines (\n\n\n) or more at the beginning of the file will cause the first \n to be matched as the first paragraph. (That is not a problem for multiple blank lines between paragraphs since \n\s*\n is greedy.) • If there are no blank lines after the last paragraph in the file, the last paragraph will not be matched (hence not captured). That doesn't affect this application as long as there are three or more paragraphs in a file. • How would you fix those problems?