1 / 55

Logs Miner : Portal for Data Mining Web Access Logs

Logs Miner : Portal for Data Mining Web Access Logs. Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009. Agenda. Definitions Motivations Architecture of Logs Miner Logs Miner User Interface Logs Miner reports Benefits Future development. Definitions.

peers
Download Presentation

Logs Miner : Portal for Data Mining Web Access Logs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logs Miner :Portal for Data Mining Web Access Logs Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009

  2. Agenda Definitions Motivations Architecture of Logs Miner Logs Miner User Interface Logs Miner reports Benefits Future development

  3. Definitions Web data mining -- “application of data mining methodologies, techniques, and models to variety of data forms, structures, and usage patterns that comprise the World Wide Web” (Markov, Z. & Larose, D. T. 2007) Three scopes of Web data mining: Web content mining Web structure mining Web log mining

  4. Definitions Web log mining Discover user access patterns from Web usage logs Is also called web usage mining Three processing stages: Pre-processing Pattern discovery Pattern analysis

  5. Purposes for web logs mining Identify and classify different group of patrons Understand search patterns by different group of patrons Adapt web-user interfaces to suit users need Statistical data for collection management

  6. Web logs • Web logs provide huge information on user action lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=10486796160015392754)" lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5“ lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

  7. Web logs lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“

  8. Various types of web log lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ Common Log Format – usually used by Apache Web server logs, Apache Tomcat Logs e.g. Library web server, INNOPAC, SmartCAT, Institutional Repository Microsoft IIS Log Format e.g. ILLiad, Class Registration Form 2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0 • Include: • Remote host field • Date field • Time field • HTTP request field • Status code field • Transfer Volume (Bytes) • Referrer field • User agent field

  9. Various types of web log Microsoft Streaming Server e.g. Streaming video 143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5 200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium 3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 - 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv - - 0 • Fields only for streaming server: • Video codec • Audio codec • Duration • Client’s player

  10. Web Logfile analysis tools Tools used to analyze web access logs AccessWatch v1.33 Analog 6.0 Pwebstats RefStats 1.2 INNOPAC Millennium Web Report – Search Statistics Others: AWStats Sawmill Analytics Webalizer

  11. Motivations Create a portal for storing, analyzing all different web access logs. Interface for querying web access logs to generate dynamic statistical report

  12. AWStats as core Ability to analyze different log formats including Apache NCSA combined log files, IIS log files (W3C), streaming servers log files Feasible to analyze non-standardized log format Support works from command line and from a browser as CGI Build a web interface to query the data (Logs Miner) Pre-process the raw log data, running large scale query in cron job

  13. AWStats as core Unlimited log file size Report number of unique visit and visit Provides Plug-in to expand the functionality Open source

  14. Requirement for AWStats Web logs files: raw data must be contained web logs components such as client IP address, status code, HTTP Request field…… Any OS platform which supporting PERL

  15. System configuration of Logs Miner: PC-level workstations CentOS release 5.4 Apache web server 2.0 PERL v.5.8.8 AWStats 6.9

  16. Logs Miner architecture AWStats Logs Miner UI Raw logs: Library web server, INNOPAC, SmartCAT, Institutional repository, Digital archives ….. AWStats reports Customized report Access statistics Preprocessing Pattern discovery, pattern analysis

  17. Logs Miner user interface A portal for mining web access log data and retrieve information about usages of multiple web applications. Built on top of AWStats, an open source logs analyzer. Currently set up to analyze more than 20 library servers and applications including Library Web Server, INNOPAC, Institutional Repository, Digital Archives, SmartCAT, ILLiad, Streaming Video Server, etc.

  18. Logs Miner user interface URL: https://lbnx16.ust.hk/mining Includes 20+ applications Provides three types of report Filtered by URL or Host Generates Yearly or monthly report Query box which supporting regular expression

  19. Logs Miner user interface URL: https://lbnx16.ust.hk/mining Tips for construct query string

  20. Three types of reports AWStats reports Access statistics - filtered by URL / Host Customized reports

  21. AWStats report

  22. AWStats report

  23. AWStats report • Report the number of • number of unique visitors • number of visits • These number are exclude the visit from the Robot

  24. AWStats report

  25. AWStats report Created by plugins: geoip

  26. AWStats report Work in progress HKUST's iPhone Application for receiving Library information and searching on SmartCAT

  27. Access statistics report Query box which supporting regular expression

  28. Access statistics report – filtered by URL

  29. Access statistics report – filtered by Host

  30. Example (1) – Usage of a database

  31. Example (1) – Usage of a database

  32. Example (1) – Usage of a database

  33. Example (2) – Usage of a document of HKUST Institutional Repository

  34. Example (2) – Usage of a document of HKUST Institutional Repository

  35. Example (2) – Usage of a document of HKUST Institutional Repository

  36. Example (3) – Access by particular group Number of access on Library web page from Library public workstations

  37. Example (3) – Access by particular group

  38. Example (3) – Access by particular group

  39. Example (4) – Exclude particular group Number of access on Digital Archives from HKUST campus but exclude HKUST Library Staff

  40. Example (4) – Exclude particular group

  41. Example (4) – Exclude particular group

  42. Example (5) – Number of virtual visits A virtual visit is defined as a user’s request on the library’s website in order to use one of the services provided by the library. One Key Performance Indicator – Virtual visits per capita Includes main web applications: Library web server Innopac SmartCAT (Next generation Catalogs) HKUST Institutional Repository Digital Archives HKUST ILLiad

  43. Example (5) – Number of virtual visits • Report the number of • Visits • a unique IP accesses a page, and requests other pages without an hour between any of the requests

  44. Example (5) – Number of virtual visits Request within an hour Request within an hour Count as a visit Request within an hour

  45. Example (5) – Number of virtual visits

  46. Customized reports Built-in customized reports to provide a full picture of page visit figures of similar pages From HKUST Library Web Server (http://library.ust.hk) Sitemap Databases List Course Guides Database Guides Subject Guides

  47. Customized reports • SubSet: • Sitemap • Databases List • Course Guides • Database Guides • Subject Guides

  48. Customized reports HKUST library web sitemap

  49. Customized reports

  50. Customized reports Add more customized reports template E-Journal list Library Forms ……

More Related