190 likes | 215 Views
Learn about Gawk, a powerful text processing language similar to AWK. Run commands, process fields, and analyze log files effectively with Gawk. Explore tutorials and examples for in-depth understanding.
E N D
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 3b: Gawk forWeb LogAnalysis
Gawk - introduction • A very powerful text processing and pattern matching language • gawk is a Gnu version of awk • Syntax similar to C • See http://www.gnu.org/software/gawk/ for manual • Many awk/gawk tutorials, e.g. • http://www.cs.hmc.edu/qref/awk.html • http://www.cs.ucsb.edu/~sherwood/awk/ Note: The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977.
Gawk - running • Several ways of running from the Unix prompt: % gawk ‘commands’ file % cat file | gawk ‘commands’ % cat file | gawk –f prog.gawk’
Gawk – fields and records • Gawk divides the file into records and fields • Each line is a record (by default) • Fields are delimited by a special character • Default: white space (blank or tab) • Can be changed with –F option • E.g. to have comma as a delimiter, use gawk –F”,” file.csv
Gawk fields and variables Fields are accessed with the $ prefix Special variables: • $1 is the first field, $2 is the second… • $0 is a special field which is the entire line • NF is a special variable - number of fields in the current record • NR is a special variable – current record number
Gawk conditions gawk –F"d" 'condition' file • gawk processes each line of file, using the delimiter d (default is whitespace) to split each line into fields. • The default action is to print the entire line.
Sample log file • We will use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. • We will give useful code examples – for full gawk introduction see elsewhere • You are encouraged to try the code examples in this lecture on this file • You should get the same answers!
Sample log file d100.log ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /robots.txt HTTP/1.0" 200 173 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /gpspubs/sigkdd-kdd99-panel.html HTTP/1.0" 200 14199 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" ip2283.unr - - [16/Nov/2005:00:01:02 -0500] "GET /dmcourse/data_mining_course/assignments/assignment-3.html HTTP/1.1" 200 8090 "http://www.google.com/search?hl=en&q=use+of+data+cleaning+in+data+mining&spell=1" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip2283.unr - - [16/Nov/2005:00:01:03 -0500] "GET /dmcourse/dm.css HTTP/1.1" 200 155 "http://www.kdnuggets.com/dmcourse/data_mining_course/assignments/assignment-3.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/sld021.htm HTTP/1.1" 200 1385 "http://www.google.com/search?hs=JnE&hl=en&lr=&client=opera&rls=en&q=lift+curve&btnG=Search" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/img021.gif HTTP/1.1" 200 7465 "http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02:47 -0500] "GET /favicon.ico HTTP/1.1" 200 899 "http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1946.com - - [16/Nov/2005:00:02:49 -0500] "GET /news/2001/n10/15i.html HTTP/1.0" 200 4214 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)“ …
Example 1: Lines with Status not equal 200 • Status code is field $9 in the log file • How many lines had status code not 200: % gawk '$9 != 200' d100.log | wc Result: 27 Note: to count status code equal to 200, use '$9 == 200' not '$9 = 200' (this sets $9 to be 200)
Example 2: Count referrals from Google • Gawk has powerful pattern matching • variable ~ "pattern" • Example: how many log lines had a referral (field $11 in the log line) from google: % gawk '$11 ~ "google"' d100.log | wc Result: 2
Example 3: complex condition • How many hits had GET method and status 404? • (status 404 is an error code) • Method is field $6 in the log, but the request is surrounded by " ". We can use % gawk '$6 ~ "GET" && $9 == 404' d100.log | wc Result: 1
Example 4a: Counting ".html" requests • The requested file is field $7. We can use this condition to match files that end in .html • Note: $ in the pattern matches the end of string % gawk '$7 ~ ".html$"' d100.log | wc Result: 21
Example 4b: Counting htm or html requests Some files may also end in .htm, so we can use % gawk '$7 ~ ".html$|.htm$"' d100.log | wc Result: 22
Example 4c: Counting directory requests Some requests can be for a directory, e.g. a request for the homepage www.kdnuggets.com/ would have "GET / HTTP/1.1" string. • We can count these requests by % gawk '$7 ~ "/$"' d100.log | wc Result: 6
Example 4d: Counting all HTML pages • or count html, htm, and directory pages by % gawk '$7 ~ "(html|htm|/)$"' d100.log | wc Result: 28
Gawk computations • More general form of gawk statements is gawk '{statements;…}' file • The statements are executed for each line of file • Statements include the usual conditionals, loops, etc • Details in gawk manual/tutorials
Example 5: External referrers • Example: Print referrers to html pages, excluding direct access (where referrer is "-" ) • Note: to test if $11 is "-", we need to escape a double quote as \" • Code: (all on one line) % gawk '{if ($7~"html$" && $11!="\"-\"") print $11}' d100.log | wc Result: 7
Gawk statements: BEGIN, END • To execute statements before reading the first line we use BEGIN keyword • To execute statements after the last line is read we use END keyword gawk 'BEGIN{stat1;…}{stat2;…}END{stat3;…}' file
Example 6 • Sum all the object sizes for access code 200 gawk '{if ($9 == 200) sumsize+=$10} END{print sumsize}' d100.log Result: 396460 Note: we did not initialize sumsize; all variables by default are initialized to zero