1 / 35

Module 4b: Perl for Web Log Analysis

Module 4b: Perl for Web Log Analysis. 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“

bettyclark
Download Presentation

Module 4b: Perl for Web Log Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Module 4b:PerlforWeb LogAnalysis 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)"

  2. Perl - introduction • A full-featured, fast, and easy to use scripting language • Very powerful pattern-matching facilities • More powerful than gawk; very popular for web programming and CGI files • Many Perl tutorials, e.g. learn.perl.org/ www.perl.com/pub/a/2000/10/begperl1.html www.perlmonks.org/index.pl?node=Tutorials

  3. Perl – historical note • PERL stands for Practical Extraction and Reporting Language • Developed by Larry Wall • Perl 1.0 was released to usenet's alt.comp.sources in 1987 • Perl is the most popular web programming language – due to powerful text manipulation and quick development. • Perl is widely known as "the duct-tape of the Internet".

  4. Perl - running • First Perl script (on Unix) file1.pl #!/usr/local/bin/perl -w print "Hi there!\n"; Note: On Windows, first line usually is #!c:/Perl/bin/perl.exe -w % file1.pl Result: Hi there!

  5. Perl for Windows • Active Perl – ready-to-install Perl distribution • Runs on Windows, Linux, MAC OS, and other OS • Free download www.activestate.com/Products/ActivePerl/

  6. Perl basics • Two data types: numbers and strings • Perl uses many special characters $, @, %, as part of its syntax • Perl variables: • Scalars (simple variables, things) start with $, e.g. $count • Arrays (lists) start with @, e.g. @array1 • Hashes (associative arrays) start with % • Usual control structures • Full introduction to Perl is beyond the scope of this module

  7. What does this code do? @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{ @p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&& close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print Answer: We do NOT want to know !

  8. The Tao of Coding • Human time is MUCH more precious than computer time • It is much better (and faster) to develop programs using methods that AVOID mistakes than try to find bugs in badly written programs

  9. Perl style: understandability first • Perl allows you to do tricky programs to save a few lines of text • AVOID this approach • Use careful, step by step development • Test after every step • A good program should be easy to understand • Only after you have an understandable program, and only if you need it, you can improve efficiency

  10. Perl coding • Variables can be declared implicitly by their first use, e.g. $oldvar=$nevar+27 • if $nevar was not declared before, it will be initialized to zero • Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ?) • Much better to declare variables explicitly e.g. my $newvar = 0; • Enforced by command use strict

  11. Sample log file • We will again use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. • We will give useful code examples You are encouraged to try the code examples in this lecture on this file • You should get the same answers!

  12. Perl for parsing a web log file Program 0: logparse0.pl - read and print log file #!c:/Perl/bin/perl.exe -w use strict; while (<>) { my $line = $_; # current line print $line; }

  13. Perl regular expressions, 1 • Usage: $var =~/ regex / where regex is a regular expression. E.g. $line =~ /google/ will match all lines containing "google" Note: / delimit regular expression, so / can't be used inside (unless escaped like this \/ )

  14. Perl log parsing, 1 Check how many lines refer to google #!c:/Perl/bin/perl.exe -w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~/google/) {$cnt++;} } print " $cnt lines matched google"; Applying this code to d100.log,you get: 2 lines matched google

  15. Perl regular expressions, 2 Special characters: . : matches one character a* : matches zero or more repeats of "a" a+ : matches 1 or more repeats of "a" \S : matches any non-white space character ^ : anchor – matches beginning of string $ : anchor – matches end of string

  16. Log parse 2: IP address • IP address is the first item on the log line. • In almost all log files it is followed by " - - ", representing missing "ident_user" and "auth_user" fields • Regular expression for matching these 3 fields: $line =~ /^(\S+) - - /;

  17. Perl regex: parentheses capture match variables • Perl regex items enclosed in parentheses () correspond to special match variables. • Variable $1 contains value matched by regular expression in the first parentheses, etc

  18. Perl regex: match variables Note: First line with Perl is probably different on your machine #!c:/Perl/bin/perl.exe –w use strict; my $cnt=0; while (<>) { my $line = $_; if ($line =~ /^(\S+) - - /) { my $ip = $1; print "ip $ip\n"; $cnt++; } else { print "bad line $line\n"; } } print " processed $cnt log lines\n"; this program shows how to assign IP to variable $ip; also shows error processing if match is not successful

  19. Perl regular expression 4: brackets • Brackets [ ] allow you match any character inside • Example: • [cmt]an will match can, man or tan, • will not match ban or dan.

  20. Perl regular expression 4b: brackets [^ ] [^x] will match any character except x • (note: here ^ is not the beginning of text anchor) Example: [^:]* will match any string that does not include a colon : . Example: if $date is 16/Nov/2005:031415 , after $date =~ ([^:]*):.* [^:]* will match 16/Nov/2005 Because it was enclosed in (), match resultstored in $1

  21. Parsing log: Date, Time • Date, Time is specified in the log as [DD/Mon/YYYY:HH:MM:SS timezone] Matching regular expression \[([^:]+):(..):(..):(..) -0500\]

  22. Parsing log: Date, Time Matching regular expression in detail \[([^:]+):(..):(..):(..) -0500\] \[ matches brackets \] [^:] matches any string that does not contain : ([^:]+) will match DD/Mon/YYYY ; value in $1 first(..)will match HH (hours); value in $2 second (..)will match MM ; in $3 third (..) matches SS; in $4

  23. Parsing log: Time Zone • The time zone is relative to GMT • The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same in the time log • but it changes during daylight savings time • In our test log file the time zone is -0500, US Eastern time zone

  24. Parsing log: Request Regular expression for parsing Request field: opening and closing quotes "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" • HTTP version • usually • ignored method URL, captures any string of 1 or more non-blanks

  25. Parsing log: Status code and Object size Status (Response) code is always a 3-digit number, followed by space, so it can be matched with (\d\d\d) Object size is either a number or "-" followed by space. Simplest regex to match it is (\S+)

  26. Parsing log: Referrer The Referrer is a string enclosed in double quotes "…" Can have anything inside except for a double quote Can also be "-" in case of a direct request. Not documented, but can be "" (nothing between the quotes). Referrer can be matched by: opening and closing quotes "([^"]*)" appearing zero or more times anything except a double quote

  27. Parsing log: User agent User agent is also a string enclosed in double quotes "…", that can have anything inside except for a double quote. It can also be "-". User agent can be matched by: opening and closing quotes "([^"]+)" appearing one or more times anything except a double quote

  28. Parsing a web log line: putting all together The matching is done by the following (should be all on one line) if ($line =~ /^(\S+) - - \[([^:]+):(..):(..):(..) -0500\] "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" (\d\d\d) (\S+) "([^"]*)" "([^"]+)"/ ) { … } Full code is in program weblog_parse.pl

  29. Perl arrays • Perl array is an ordered list of items • Array names begin with @ • Array initialization: @days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")

  30. Perl arrays, num of items • When referring to a single array item, name begins with "$". E.g. we print the first array item (index 0) using print $days[0] ; • Number of items in an array is $#array $#days is 7

  31. Perl array iteration • Iterating over entire array foreach $day (@days) {print $day,"\n" } ; • is the same as for $n ($n=0; $n <7; $n++) { print $days[$n],"\n" } ;

  32. Perl hash • Hash is unordered list of key, value pairs. • Hash names begin with % • Hash initialization: %capitals=("USA", "Washington D.C.", "France", "Paris", "China", "Beijing") ;

  33. Perl hash reference • Referring to a single hash item, name begins with "$". • To get capital of China from %capitals we use $capitals{"China"} • To add the capital of UK, we use • $capitals{"UK"} = "London" ;

  34. Perl hash iteration Iteration over the entire hash foreach $country (keys %capitals) {print "$country capital $capitals{$country}\n"; }

  35. Additional tools for Web log analysis • Perl for web log analysis www.oreilly.com/catalog/perlwsmng/chapter/ch08.html Some web log analysis tools • Analog www.analog.cx/ • AWstats awstats.sourceforge.net/ • Webalizer www.mrunix.net/webalizer/ • FTPweblog www.nihongo.org/snowhare/utilities/ftpweblog/

More Related