190 likes | 207 Views
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“
E N D
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 2: Web Server Log 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" An extract from KDnuggets web log
Page contents Web server log 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET … HTTP/1.1" 200 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /gps.html HTTP/1.1" 200 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 … Web Server Log – An Example KDnuggets.com Server http://www.kdnuggets.com/jobs/
Web (Server) Log – In Depth A sample web log line 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
Web log field: IP 152.152.98.11 IP address - can be converted to host name, such as xyz.example.com
Web log fields: Name, Login - The name of the remote user (usually omitted and replaced by a dash “-”) - Login of the remote user (also usually omitted and replaced by a dash “-”)
Time: HH:MM:SS Time Zone: (+|-)HH00 relative to GMT -0500 is US EST Web log field: Date/Time/TZ [16/Nov/2005:16:32:50 -0500] Date: DD/Mon/YYYY
Web log field: Request "GET /jobs/ HTTP/1.1" Method: GET HEAD POST OPTIONS … HTTP protocol: e.g. HTTP/1.0 or HTTP/1.1 URL: relative to domain Note: the request is recorded as sent, so it may contain errors, hacks, and any strange thing you can imagine
Web log field: Status code 200 Status (Response) code. Most important ones are: • 200 – OK (most frequent, hopefully) • 206 – partial access • 301 – permanently redirected (e.g. access to /courses is redirected to /courses/ ) • 302 – temporarily redirected • 304 – not modified • 404 – not found • …
Web log field: Object size 15140 size of the object returned to the client, in bytes Can also be “-” if status code is 304 (not modified)
Web log field: Referrer http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N URL the visitor came from (here it was a Google query for “salary for data mining”, 2nd page of results – starting from 10) Referrer can also be a static page, internal (same domain) or external (different domain), or “-” in case of a direct request (e.g. type-in, bookmark) Referrer analysis is very valuable
Web log field: User agent "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" User agent (browser) http://en.wikipedia.org/wiki/User_agent Almost all browsers start with Mozilla – for historic reasons In many cases additional information: Browser type, version : MSIE 6.0 - Internet Explorer 6.0 OS: Windows NT 5.1 (XP SP2) with .NET Framework 1.1 installed
Web Usage Mining • Basic • Totals • Simple • Request level breakdowns • Advanced • Visit level analysis • Target pages; Conversion analysis
Web Log Analysis Programs • Free • Analog, awstats, webalizer • Google analytics • Commercial • WebTrends, WebSideStory, … www.kdnuggets.com/software/web-mining.html
Web Usage Mining - Basic • Totals for each component • Hits – total number of requests • Files – number of GETs • Pages – number of HTML pages • Sites – unique IP addresses • Response codes • Kbytes – total Kbytes transferred • User Agents
More details Example: KDnuggets.com Nov 2005 totals Monthly Statistics (from webalizer) Q: What is the meaning of the difference between Hits and Files?
Example: KDnuggets.com Nov 2005 totals, 2 Monthly stats for Files by Status Code Answer: the difference between Hits and Files is the number of requests with status code not 200.
Difference between Files and Pages • Q: What is the meaning of difference between Files and Pages ?
Difference between Files and Pages • A: the difference between Files and Pages is the number of non-HTML files (e.g. image, javascript, etc • In November 2005 KDnuggets log HTML files were about 1/3 of all requests • However, this data does not separate bot requests (which are heavily weighted towards HTML pages)
Notes: web log formats • We used web log in Apache standard format • Some old logs have a different format without the last 2 fields (referrer and user agent), but these are now rare.