331 likes | 463 Views
Design and Implementation of a Web Log Preprocessing System Supporting Path Completion. Batchimeg AI lab. 2005.04.19. Outline. Introduction Background Related work Purposed System Experiment and Result Conclusion and Future work. Introduction. My research area:
E N D
Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19
Outline • Introduction • Background • Related work • Purposed System • Experiment and Result • Conclusion and Future work AI lab.
Introduction My research area: Web log preprocessing Web Log Mining Process Saved Web Log Data in Web Server Viewing news E-Mail • Logged data • IP • OS, Agent- Time- URL- Refer page • Date • Cookie • Method • Status • UserID • bytes • … download Preprocessing shopping Web Site Visitor Auction DB Pattern Discovery Pattern Analysis • Visualization tools • Knowledge Query • Intelligent Agents Data Analysis AI lab.
Background (1/4) • Log format : –Client IP -210.126.19.93 –Date - 23/Jan/2005 –Accessed time - 13:37:12 –Method - GET (to request page ), POST, HEAD (send to server) –Protocol - HTTP/1.1 –Status code - 200 (Success), 401,301,500 (error) – Size of file - 2705 –Agent type -Mozilla/4.0 –Operating system - Windows NT http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225→ → http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=8225 A visitor (210.126.19.93) after to view the news who send it to friend. 210.126.19.93 - - [23/Jan/2005:13:37:12 -0800] “GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" 2002705 "http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ … 285014 lines record AI lab.
Mozilla/5.0(Windows NT) 202.131.3.100 A,B,C,D,F Mozilla/5.0(Windows NT) 202.131.3.100 A,L 202.131.3.100 A,B,G,L Mozilla/4.0 (Win2000) 210.126.19.93 Mozilla/5.0(Windows NT) N,O Background (2/4) - User identification, Session Identification User Identification is identifying each user accessing Web site User IP+Browser (UserID+IP+OS or cookie)=> Identify the users Session identification is to find each user’s access pattern and frequency path. CleaningLog User Identification Session Identification PathCompletion Formatting IP Browser Visited pages IP, Browser User Identification 202.131.3.100 Mozilla/5.0(Windows NT) A,B,C,D,F,A,L A,B,G,L 202.131.3.100 Mozilla/4.0 (Win2000) N,O 210.126.19.93 Mozilla/4.0(Windows NT) Session Identification AI lab.
Request P4 … Request P3 Request P3 Request P6 Client P4 P3 P3 P5 Cache Send 5 Server Send P4 Send P4 Never logged by server Background (3/4) Server Log and Caching Missed Page Views at Server If client must request every web page from the server slower. The solution to this problem is caching. Clients and Proxy Servers save local copies of pages back” and “forward AI lab.
Before .. After A,B,C,D,F A,B,C,D,C,B,F A,L A,L A,B,G,I A,B,A,G,I N,O N,O Background (4/4) - Path completion Not all requested pages are recorded in Web log. Due to caching problem. CleaningLog User Identification Session Identification PathCompletion Formatting C.html E.html D.html B.html F.html Topological Structure Path completion H.html J.html G.html A.html I.html K.html L.html M.html O.html Q.html N.html AI lab. P.html
Related work X – not used O – used AI lab.
Purposed System(1/7)(preprocessing) Construct the site topological structure by web log data in server Web site’s topological structure (find the hyperlink relation between web pages) Data cleaning (eliminate irrelevant info) User Identification, session Identification,(identify each user, find each user’s access pattern) Path completion User Grouping Result After session Identification and path completion User grouping User Identify Preprocessing can take up to 60-80% of the times spend analyzing the data. Incomplete preprocessing task can easily result invalid pattern and wrong conclusions. Why preprocessing? AI lab.
Purposed System (2/7) Make the site topological structure • Helps solving data preprocessing and analysis: - user identification - path completion Goal of purposed system • Discover Similar user group, Relevant page group and Frequency accessing paths AI lab.
Purposed System (3/7) Algorithm of Topological Structure begin Not end of Log file No Yes Find “http” data No Yes Enter URL to URL_Queue URL Queue Not empty No Yes Is there other Record? Get head, define depth Yes No To add link to the Topo_Str_DB end Make Topological Structure AI lab.
Depth Index.html (A) 0 Sport.html L.html 1 Sport/News/Mongolia.html Sport/Team/ X 2 3 Sport/Team/football.html Purposed System (4/7)- Make the topological structure • Topological Structure • input: URL path and link • output: complete sitemap (tree) link, path, depth and referrersqueue 0. Index.html (A) 1. L.html (referrer) 2. Sport/Team/football.html 2. Sport/News/Mongolia.html 1. Sport.html 2. Sport/Team/ 3. Sport/Team/football.html 2. Sport/Advice/ . . . Sport/Advice olloo.mn/L.html olloo.mn/L.html Sport/Team/football.html olloo.mn/L.html Sport/News/Mongolia.html olloo.mn/Sport.html olloo.mn/Sport.html /Team/football.html olloo.mn/Sport.html /Advice/ AI lab.
Begin Yes Not end of log DB No Yes IP not in IPSet No IF current IP’s Agent and OS same Save the IP, Agent and OS No Yes Assign to the User Set, Increase User counter Is there other Records? No Yes End Purposed System (5/7) - User Identification • Flow chart of User Identification algorithm .. for similar user group AI lab.
Begin not end of log DB Yes No IP not in User Set? No Yes refer page empty? Yes No Yes time taken >25.5? Start new Session No Is there other Records? A page append to the session No Yes go to path Completion End Purposed System (6/7)- Session identification • Flow chart of Session Identification algorithm AI lab.
Purposed System (7/7) - Path completion • Flow chart of Path completion algorithm Begin Not end of Session set No Yes A page in a Session contains next page in that session Yes No check to the next page Search that page from site map Complete the path End AI lab.
Experiment (1/4) www.olloo.mn Raw log data URLs in Web server log AI lab.
Experiment (2/4) Topological Structure AI lab.
Experiment (3/4) Data cleaning AI lab.
Experiment (4/4) AI lab.
Result Path completion User group This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM. AI lab.
Interface of Path Completion Preprocessing System (PCPS) • Start the new project. AI lab.
Interface of Path Completion Preprocessing System (PCPS) • Giving the project name and folder AI lab.
Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult) • Add the log file to project AI lab.
Interface of Path Completion Preprocessing System (PCPS) • Choose the log file to add AI lab.
Interface of Path Completion Preprocessing System (PCPS) • Asking to remove the image files (files) Should to analyze… (files) Should to clean … AI lab.
Interface of Path Completion Preprocessing System (PCPS) • Cleaned log and information The pages and files that wanted to analyze AI lab.
Interface of Path Completion Preprocessing System (PCPS) • Topological Structure AI lab.
Interface of Path Completion Preprocessing System (PCPS) Browser AI lab.
Interface of Path Completion Preprocessing System (PCPS) • System AI lab.
Comparing other preprocessing approach to Purposed System O- used, X – not used AI lab.
Conclusion • My work focus on preprocessing of Web log mining and enhance the • discovering patterns. • 3061 – 2812 = 249 users neglected. • This paper presented some new approach and practicable algorithm. • This approach can be better precision than some existence approaches. AI lab.
Reference [1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and Pattern Discovery on the World Wide Web” 1998 [2] C. Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking,”2001 [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environment. 1996 [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence. 1996 [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hypertext linking. 1996. [6]. J. Pitkow. In search of reliable usage data on the WWW. 1997. [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. 1996 [8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site. [9]. Open Market Inc. Open Market Web reporter. http://www.openmarket.com,1996. [10]. net.Genesis. net.analysis desktop http://www.netgen.com,1996 [11]. Doru Tanasa, Brigitte Trousse “Advanced data preprocessing for intersites Web Usage Mining “2004 [12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000. AI lab.