300 likes | 509 Views
Data e Web Mining 825368 Paolo Gobbo. Smart Miner: A New Framework for Mining Large Scale Web Usage Data. Bayir – Toroslu – Cosar - Fidan. Data Mining on Web. Web Mining. discover and retrieve useful and interesting pattern from large web dataset. web content mining. web structure mining.
E N D
Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan
Data Mining on Web Web Mining discover and retrieve useful and interesting pattern from large web dataset web content mining web structure mining web usage mining data describes the pattern of usage of web pages data describes the organization of the content real data in web pages hyperlink structure web log records text and multimedia documents
PreProcessing INPUT Referrer Log Agent Log Registration Access Log Site File Data Cleaning Path Completion Site Crawler Session Identification User Identification SQL Query User Session File PREPROCESISNG Site Topology Transaction Identification Transaction File
Session Identification Session Identification partitioning each user’s activities into sequence (session) of entries from web request logs navigation oriented heuristics time oriented heuristics link between web pages temporal boundaries session length page-stay
Sequential Mining Sequential Mining Association Mining with the order of transactions Given a set of data sequences find all sequences with a user-specified minimum support items : : : itemset/element : is itemset : sequence : number of itemsets/elements sequence size number of items : sequence length : subsequence
Sequential Mining algorithms Transforms customer transaction into custumer sequences Sort Phase LargeItemSet Phase Generates set of large itemset Represents customer sequences based on large itemset Transformation Phase Derives large k-sequences based on large (k-1)-sequences Sequence Phase Maximal Phase Prunes non maximal sequences APrioriAll APrioriSome GSP
Smart-SRA session Smart-SRA session Path • timestamp ordering (time oriented) rule (session) • topology (navigation oriented) rule (path in the web site) • maximality rule (path in the web site)
Smart Miner DATA STREAM Candidate Session SMART-SRA SESSION CONSTRUCTION Smart Session SEQUENCIAL MINING Sequencial AprioriAll FREQUENT ACCESS PATTERN
Smart Miner: First Phase Smart SRA Candidate session construction • time oriented heuristics • session length • page-stay • no backward movement P1 P13 Page P1 P20 P13 P49 P34 P23 TimeStamp 0 6 9 12 14 15 P23 P20 Page P49 P13 P20 P23 10 TimeStamp 0 5 9 P49 P34 Candidate Session Web Site Graph
Smart Miner: Second Phase Smart SRA Smart session construction • time oriented heuristics • inherithed session length • re-check page-stay • no backward movement • maximality • topology rule P1 P13 Page P1 P20 P13 P49 P34 P23 TimeStamp 0 6 9 12 14 15 [P1, P13, P34, P23] [P1, P13, P49, P23] [P1, P20, P23] P23 P20 P49 P34 Web Site Graph Smart Session
Smart Miner: Second Phase Smart SMART SESSION RECONSTRUCTION foreach CanditateSession in CandSessionSet NewSessionSet={} while CanditateSession ≠Ø TSessionSet = {}; TPageSet = {}; foreach Pageiin CandSession StartPageFlag = TRUE foreach Pagejin CandidateSession with j<i if (Link[Pagej,Pagei] and TimeDiff(Pagei,Pagej)≤σ then StartPageFlag = FALSE endfor if StartPageFlag then TPageSet = TPageSet U {Pagei} endfor CandSession = TPageSet U {Pagei} if NewSessionSet = {} then foreach Pageiin TPageSet TSessionSet = TSessionSet U {[Pagei]} else foreach Pageiin TPageSet foreach Sessionj in NewSessionSet if (Link[Last(Sessionj),Pagei] and TimeDiff(Last(Sessionj),Pagei)≤σ) then TSession = Sessionj TSession.mark = UNEXTENDED TSession = TSession • Pagei TSessionSet = TSessionSet U {TSession} Sessionj.mark = EXTENDED endif endfor endfor endif foreach SessionJin New SessionSet if SessionJ.mark ≠ EXTENDED then TSessionSet = TSessionSet U {SessionJ} end for NewSessionSet = TSessionSet end while end for page with no incoming link session set construction session set extension session set extension with no extended
Session Construction Example Iteration CandidateSession TPageSet NewSessionSet 1 [ P1, P20, P13, P49, P34, P23 ] { P1 } [ P1 ] [ P20, P13, P49, P34, P23 ] { P20, P13 } 2 [ P1, P20 ] [ P1, P13] 3 [ P49, P34, P23 ] { P49, P34 } [ P1, P13, P34 ] [ P1, P13, P49 ] [ P1, P20 ] 4 [ P23 ] { P23 } [ P1, P13, P34, P23 ] [ P1, P13, P49, P23] [ P1, P20, P23 ] P1 P13 P23 P20 P49 P34
Sequential APrioriAll Pruning • during candidate sequence generation before calculating their support • topological constraint • every subsequent pair of pages in a sequence the former one must have a hyperlink to the latter one • string matching costraint • session S supports a pattern P if and only if P is a subsequence of S not violating string matching • <1,2,3> support <1,2> • <1,2,3> not support <1,3>
Support Support • one scan through the transaction database by keeping candidate session in hashmap I : pattern S : user reconstructed sessions
Sequential Apriori Algorithm SEQUENTIAL APRIORI INPUT: minimum support frequency : δ reconstructed sessions : S topology information : Link set of all web pages : P OUTPUT: set of maximal frequent patterns : Max L1 = {} for i = 1 to |P| do L1 = L1 U [Pi] | if Support([Pi],S)> δ for k = 1 to N-1 do if Lk = Ø then Halt else Lk+1 = {} foreach Iiin Lk foreach Pjin P if Link[Last(Ii),Pj] then T = Ii • Pj // append page if Support(T,S)> δthen T.maximal = true Ii.maximal = false V = [T2,T3,…, T|T|] if V in Lkthen V.maximal = false lk+1 = lk+1 U {T} endif endif endif endfor endfor endif max = {} for k=1 to N-1 do max = max U {S|S in Lk and S.maximal = true } endfor length-1 candidate pattern generation no further generation length-k+1 candidate pattern generation joining step pruning step topological rule support rule maximality rule union of the sets of maximal patterns
Accuracy Metric : frequent maximal pattern of the agent simulator : frequent maximal pattern of the heuristic recall precision accuracy
Agent Simulator Agent Simulator Parameters • STP : Session Termination Probability probability of terminating session • LPP : Link from Previous page Probability probability of referring next page from one of the previously accessed pages except the most recently accessed one • LPC : Link from Current page Probability probability of referring next page from the most recently visited page • NIP : New Initial page Probability probability of selecting one of the starting pages of a web site during the navigation
Simulated Data Web topology • number of web pages from 10 to 1000 • number users from 1000 to 10000 Agent simulator parameters • 49 different cases • NIP/STP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0 • LPC/LPP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0 Support parameter • Values 0.001 , 0.0025 , 0.005 , 0,0075 , 0.01 Runs of agent simulator • 10 random different runs
Results on Simulated Data : New Initial Page Probability : NIP New Initial Page Probability NIP Session Termination Probability : Session Termination Probability : STP STP NO : navigation oriented : time oriented TO : Smart SRA SSRA
Results on Simulated Data NO : navigation oriented : time oriented TO : Smart SRA SSRA
Real Data AGMLAB’s company web site • 4 months user activity • 3801 users • 30 minutes session time-out • 10 web pages • link graph densely connected User Activity • action tracking program • cookies • cookie information recorded to a server log file
Results on Real Data NO : navigation oriented : time oriented TO : Smart SRA SSRA
Scalability Performance with 50 nodes Performance on 100 GB Data MAP/REDUCE paradigm each node process a block of session database computing the local frequency of each candidate patterns
Sitologia/Bibliografia • M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New Framework for Mining Larga Scale Web Usage Data -2009 • R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World Wide Web - 1999 • J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data - 2000 • M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction - 2005 • J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage Mining - 2005 • R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995
GSP GSP – GENERALIZED SEQUENTIAL PATTERN C1 = Init_Pass L1 = {<{f}>|f in C1, with minimum support} for (k=2; Lk-1≠Ø; k++) do begin Ck = Candidate-gen-SPM Lk-1 foreach sequence s in the database D do foreach candidate c in Ck if (c in s) then update candidate c Lk= candidated c in Ck with minimum support end result = Uk(Lk) CANDIDATE-GEN-SPM (join step) foreachp in Lk-1 foreachq in Lk-1 if ( ) then Ck = Ck U {p1,…,pk-1,qk-1 } foreachs in Ck if exists(r | ˄ ) then Ck = Ck - s (prune step)
GSP Example Candidate 4-sequences (prune step) Candidate 4-sequences (join step) L3-sequences <{1,2},{4,5}> <{1,2},{4,5}> <{1,2},{4}> <{1,2},{4},{6}> <{1,2},{5}> <{1},{4,5}> <{1,4},{6}> <{2},{4,5}> <{2},{4},{6}> <{1},{4},{6}>
APrioriAll APRIORIALL L1 = {large 1-sequences} for (k=2; Lk-1≠Ø; k++) do begin Ck = Apriori-generate function Lk-1 foreach sequence c in the database D do update candidates in Ck that are contained in c Lk= candidated in Ck with minimum support end result = maximal sequences in Uk(Lk) APRIORI-GENERATE (join step) foreachp in Lk-1 foreachq in Lk-1 if (p.x1=q.x1) ˄ (p.x2=q.x2) ˄ … ˄ (p.xk-2=q.xk-2) then Ck = Ck U {<p.x1,…,p.xk-1,q.xk-1>} foreachs in Ck if exists(r | ˄ ) then Ck = Ck - s (prune step)
APrioriAll Example Candidate 4-sequences (prune step) Candidate 4-sequences (join step) L3-sequences <1,2,3,4> <1,2,3> <1,2,3,4> <1,2,4,3> <1,2,4> <1,3,4,5> <1,3,4> <1,3,5,4> <1,3,5> <2,3,4>
APrioriSome APRIORISOME //Forward Phase L1 = {large 1-sequences}; C1 = L1 ; last = 1; for (k=2; Ck-1≠Ø; k++) do begin if (Lk-1 known) then Ck = Apriori-generate function Lk-1 else Ck = Apriori-generate function Ck-1 if (k=next(last)) then foreach sequence c in the database D do update candidates in Ck that are contained in c Lk= candidated in Ck with minimum support; last = k end //Backword Phase for (k--; k>=1; k--) do begin if (Lk not found) then delete all sequences in Ck contained in some Li, i>k foreach sequence c in the database D do update candidates in Ck that are contained in c Lk= candidated in Ck with minimum support else delete all sequences in Lk contained in some Li, i>k end result = maximal sequences in Uk(Lk)