550 likes | 567 Views
Logs Miner : Portal for Data Mining Web Access Logs. Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009. Agenda. Definitions Motivations Architecture of Logs Miner Logs Miner User Interface Logs Miner reports Benefits Future development. Definitions.
E N D
Logs Miner :Portal for Data Mining Web Access Logs Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009
Agenda Definitions Motivations Architecture of Logs Miner Logs Miner User Interface Logs Miner reports Benefits Future development
Definitions Web data mining -- “application of data mining methodologies, techniques, and models to variety of data forms, structures, and usage patterns that comprise the World Wide Web” (Markov, Z. & Larose, D. T. 2007) Three scopes of Web data mining: Web content mining Web structure mining Web log mining
Definitions Web log mining Discover user access patterns from Web usage logs Is also called web usage mining Three processing stages: Pre-processing Pattern discovery Pattern analysis
Purposes for web logs mining Identify and classify different group of patrons Understand search patterns by different group of patrons Adapt web-user interfaces to suit users need Statistical data for collection management
Web logs • Web logs provide huge information on user action lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=10486796160015392754)" lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5“ lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
Web logs lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“
Various types of web log lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ Common Log Format – usually used by Apache Web server logs, Apache Tomcat Logs e.g. Library web server, INNOPAC, SmartCAT, Institutional Repository Microsoft IIS Log Format e.g. ILLiad, Class Registration Form 2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0 • Include: • Remote host field • Date field • Time field • HTTP request field • Status code field • Transfer Volume (Bytes) • Referrer field • User agent field
Various types of web log Microsoft Streaming Server e.g. Streaming video 143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5 200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium 3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 - 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv - - 0 • Fields only for streaming server: • Video codec • Audio codec • Duration • Client’s player
Web Logfile analysis tools Tools used to analyze web access logs AccessWatch v1.33 Analog 6.0 Pwebstats RefStats 1.2 INNOPAC Millennium Web Report – Search Statistics Others: AWStats Sawmill Analytics Webalizer
Motivations Create a portal for storing, analyzing all different web access logs. Interface for querying web access logs to generate dynamic statistical report
AWStats as core Ability to analyze different log formats including Apache NCSA combined log files, IIS log files (W3C), streaming servers log files Feasible to analyze non-standardized log format Support works from command line and from a browser as CGI Build a web interface to query the data (Logs Miner) Pre-process the raw log data, running large scale query in cron job
AWStats as core Unlimited log file size Report number of unique visit and visit Provides Plug-in to expand the functionality Open source
Requirement for AWStats Web logs files: raw data must be contained web logs components such as client IP address, status code, HTTP Request field…… Any OS platform which supporting PERL
System configuration of Logs Miner: PC-level workstations CentOS release 5.4 Apache web server 2.0 PERL v.5.8.8 AWStats 6.9
Logs Miner architecture AWStats Logs Miner UI Raw logs: Library web server, INNOPAC, SmartCAT, Institutional repository, Digital archives ….. AWStats reports Customized report Access statistics Preprocessing Pattern discovery, pattern analysis
Logs Miner user interface A portal for mining web access log data and retrieve information about usages of multiple web applications. Built on top of AWStats, an open source logs analyzer. Currently set up to analyze more than 20 library servers and applications including Library Web Server, INNOPAC, Institutional Repository, Digital Archives, SmartCAT, ILLiad, Streaming Video Server, etc.
Logs Miner user interface URL: https://lbnx16.ust.hk/mining Includes 20+ applications Provides three types of report Filtered by URL or Host Generates Yearly or monthly report Query box which supporting regular expression
Logs Miner user interface URL: https://lbnx16.ust.hk/mining Tips for construct query string
Three types of reports AWStats reports Access statistics - filtered by URL / Host Customized reports
AWStats report • Report the number of • number of unique visitors • number of visits • These number are exclude the visit from the Robot
AWStats report Created by plugins: geoip
AWStats report Work in progress HKUST's iPhone Application for receiving Library information and searching on SmartCAT
Access statistics report Query box which supporting regular expression
Example (2) – Usage of a document of HKUST Institutional Repository
Example (2) – Usage of a document of HKUST Institutional Repository
Example (2) – Usage of a document of HKUST Institutional Repository
Example (3) – Access by particular group Number of access on Library web page from Library public workstations
Example (4) – Exclude particular group Number of access on Digital Archives from HKUST campus but exclude HKUST Library Staff
Example (5) – Number of virtual visits A virtual visit is defined as a user’s request on the library’s website in order to use one of the services provided by the library. One Key Performance Indicator – Virtual visits per capita Includes main web applications: Library web server Innopac SmartCAT (Next generation Catalogs) HKUST Institutional Repository Digital Archives HKUST ILLiad
Example (5) – Number of virtual visits • Report the number of • Visits • a unique IP accesses a page, and requests three other pages without an hour between any of the requests
Example (5) – Number of virtual visits Request within an hour Request within an hour Count as a visit Request within an hour
Customized reports Built-in customized reports to provide a full picture of page visit figures of similar pages From HKUST Library Web Server (http://library.ust.hk) Sitemap Databases List Course Guides Database Guides Subject Guides
Customized reports • SubSet: • Sitemap • Databases List • Course Guides • Database Guides • Subject Guides
Customized reports HKUST library web sitemap
Customized reports Add more customized reports template E-Journal list Library Forms ……