350 likes | 382 Views
Module 4b: Perl for Web Log Analysis. 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“
E N D
Module 4b:PerlforWeb LogAnalysis 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)"
Perl - introduction • A full-featured, fast, and easy to use scripting language • Very powerful pattern-matching facilities • More powerful than gawk; very popular for web programming and CGI files • Many Perl tutorials, e.g. learn.perl.org/ www.perl.com/pub/a/2000/10/begperl1.html www.perlmonks.org/index.pl?node=Tutorials
Perl – historical note • PERL stands for Practical Extraction and Reporting Language • Developed by Larry Wall • Perl 1.0 was released to usenet's alt.comp.sources in 1987 • Perl is the most popular web programming language – due to powerful text manipulation and quick development. • Perl is widely known as "the duct-tape of the Internet".
Perl - running • First Perl script (on Unix) file1.pl #!/usr/local/bin/perl -w print "Hi there!\n"; Note: On Windows, first line usually is #!c:/Perl/bin/perl.exe -w % file1.pl Result: Hi there!
Perl for Windows • Active Perl – ready-to-install Perl distribution • Runs on Windows, Linux, MAC OS, and other OS • Free download www.activestate.com/Products/ActivePerl/
Perl basics • Two data types: numbers and strings • Perl uses many special characters $, @, %, as part of its syntax • Perl variables: • Scalars (simple variables, things) start with $, e.g. $count • Arrays (lists) start with @, e.g. @array1 • Hashes (associative arrays) start with % • Usual control structures • Full introduction to Perl is beyond the scope of this module
What does this code do? @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{ @p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&& close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print Answer: We do NOT want to know !
The Tao of Coding • Human time is MUCH more precious than computer time • It is much better (and faster) to develop programs using methods that AVOID mistakes than try to find bugs in badly written programs
Perl style: understandability first • Perl allows you to do tricky programs to save a few lines of text • AVOID this approach • Use careful, step by step development • Test after every step • A good program should be easy to understand • Only after you have an understandable program, and only if you need it, you can improve efficiency
Perl coding • Variables can be declared implicitly by their first use, e.g. $oldvar=$nevar+27 • if $nevar was not declared before, it will be initialized to zero • Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ?) • Much better to declare variables explicitly e.g. my $newvar = 0; • Enforced by command use strict
Sample log file • We will again use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. • We will give useful code examples You are encouraged to try the code examples in this lecture on this file • You should get the same answers!
Perl for parsing a web log file Program 0: logparse0.pl - read and print log file #!c:/Perl/bin/perl.exe -w use strict; while (<>) { my $line = $_; # current line print $line; }
Perl regular expressions, 1 • Usage: $var =~/ regex / where regex is a regular expression. E.g. $line =~ /google/ will match all lines containing "google" Note: / delimit regular expression, so / can't be used inside (unless escaped like this \/ )
Perl log parsing, 1 Check how many lines refer to google #!c:/Perl/bin/perl.exe -w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~/google/) {$cnt++;} } print " $cnt lines matched google"; Applying this code to d100.log,you get: 2 lines matched google
Perl regular expressions, 2 Special characters: . : matches one character a* : matches zero or more repeats of "a" a+ : matches 1 or more repeats of "a" \S : matches any non-white space character ^ : anchor – matches beginning of string $ : anchor – matches end of string
Log parse 2: IP address • IP address is the first item on the log line. • In almost all log files it is followed by " - - ", representing missing "ident_user" and "auth_user" fields • Regular expression for matching these 3 fields: $line =~ /^(\S+) - - /;
Perl regex: parentheses capture match variables • Perl regex items enclosed in parentheses () correspond to special match variables. • Variable $1 contains value matched by regular expression in the first parentheses, etc
Perl regex: match variables Note: First line with Perl is probably different on your machine #!c:/Perl/bin/perl.exe –w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~ /^(\S+) - - /) { my $ip = $1; print "ip $ip\n"; $cnt++; } else { print "bad line $line\n"; } } print " processed $cnt log lines\n"; this program shows how to assign IP to variable $ip; also shows error processing if match is not successful
Perl regular expression 4: brackets • Brackets [ ] allow you match any character inside • Example: • [cmt]an will match can, man or tan, • will not match ban or dan.
Perl regular expression 4b: brackets [^ ] [^x] will match any character except x • (note: here ^ is not the beginning of text anchor) Example: [^:]* will match any string that does not include a colon : . Example: if $date is 16/Nov/2005:031415 , after $date =~ ([^:]*):.* [^:]* will match 16/Nov/2005 Because it was enclosed in (), match resultstored in $1
Parsing log: Date, Time • Date, Time is specified in the log as [DD/Mon/YYYY:HH:MM:SS timezone] Matching regular expression \[([^:]+):(..):(..):(..) -0500\]
Parsing log: Date, Time Matching regular expression in detail \[([^:]+):(..):(..):(..) -0500\] \[ matches brackets \] [^:] matches any string that does not contain : ([^:]+) will match DD/Mon/YYYY ; value in $1 first(..)will match HH (hours); value in $2 second (..)will match MM ; in $3 third (..) matches SS; in $4
Parsing log: Time Zone • The time zone is relative to GMT • The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same in the time log • but it changes during daylight savings time • In our test log file the time zone is -0500, US Eastern time zone
Parsing log: Request Regular expression for parsing Request field: opening and closing quotes "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" • HTTP version • usually • ignored method URL, captures any string of 1 or more non-blanks
Parsing log: Status code and Object size Status (Response) code is always a 3-digit number, followed by space, so it can be matched with (\d\d\d) Object size is either a number or "-" followed by space. Simplest regex to match it is (\S+)
Parsing log: Referrer The Referrer is a string enclosed in double quotes "…" Can have anything inside except for a double quote Can also be "-" in case of a direct request. Not documented, but can be "" (nothing between the quotes). Referrer can be matched by: opening and closing quotes "([^"]*)" appearing zero or more times anything except a double quote
Parsing log: User agent User agent is also a string enclosed in double quotes "…", that can have anything inside except for a double quote. It can also be "-". User agent can be matched by: opening and closing quotes "([^"]+)" appearing one or more times anything except a double quote
Parsing a web log line: putting all together The matching is done by the following (should be all on one line) if ($line =~ /^(\S+) - - \[([^:]+):(..):(..):(..) -0500\] "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" (\d\d\d) (\S+) "([^"]*)" "([^"]+)"/ ) { … } Full code is in program weblog_parse.pl
Perl arrays • Perl array is an ordered list of items • Array names begin with @ • Array initialization: @days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
Perl arrays, num of items • When referring to a single array item, name begins with "$". E.g. we print the first array item (index 0) using print $days[0] ; • Number of items in an array is $#array $#days is 7
Perl array iteration • Iterating over entire array foreach $day (@days) {print $day,"\n" } ; • is the same as for $n ($n=0; $n <7; $n++) { print $days[$n],"\n" } ;
Perl hash • Hash is unordered list of key, value pairs. • Hash names begin with % • Hash initialization: %capitals=("USA", "Washington D.C.", "France", "Paris", "China", "Beijing") ;
Perl hash reference • Referring to a single hash item, name begins with "$". • To get capital of China from %capitals we use $capitals{"China"} • To add the capital of UK, we use • $capitals{"UK"} = "London" ;
Perl hash iteration Iteration over the entire hash foreach $country (keys %capitals) {print "$country capital $capitals{$country}\n"; }
Additional tools for Web log analysis • Perl for web log analysis www.oreilly.com/catalog/perlwsmng/chapter/ch08.html Some web log analysis tools • Analog www.analog.cx/ • AWstats awstats.sourceforge.net/ • Webalizer www.mrunix.net/webalizer/ • FTPweblog www.nihongo.org/snowhare/utilities/ftpweblog/