150 likes | 249 Views
Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse AxIS Research Team INRIA Sophia Antipolis and Rocquencourt. Motivations. To show on the clickstream dataset proposed
E N D
Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce DomainSergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte TrousseAxIS Research TeamINRIA Sophia Antipolis and Rocquencourt
Motivations To show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge the benefits of our InterSitepre-processing method proposed by Tanasa in his PhD Thesis (2005) And the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs 2 main viewpoints: User and web site charge
Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit « Group of SessionIDs » - first statistical Intersite analysis 2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4 3. Conclusions
Data pre-processing Table 1. Format of page requests Initial data: Table 2. Number of requests per shop
Data pre-processing Tanasa & Trousse (IEEE Intelligent Systems 2004) Tanasa ‘s Thesis (2005)
Data pre-processing • Data fusion, data cleaning Table 3. Transformed log lines • Data Structuration • SessionID a single visit on each shop • Towards the notion of user’s intersite visit: • we group such SessionIDs that belongs to a single user (same IP) • into a « Group of SessionIDs ». • We compare the Referer with the URLs • previously accessed (in a reasonable time window) • 522,,410 SessionIDs into 397,629 Groups, • equivalent to a 23.88% reduction;
Relational DB model Data summarisation
Data pre-processing Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop • Low number of new visits on Saturdays and Sundays during the lunch time • The high number of new visits on Tuesdays and Wednesdays • Same results a) and b)
Crossed Clustering Aproach for Time Periods/Product Analysis Method developed by Yves Lechevallier & Rosanna Verde (2003,2004) Data: Selection of ls pages in shop 4 (the most used)
Crossed Clustering Aproach for Time Periods/Product Analysis Method developed by Yves Lechevallier & Rosanna Verde (2003,2004) Relational BD model : We add easily a crossed table Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals Column: a multi-categorical variable representing the number of products requested by users into the specific time slice
Crossed Clustering Aproach for Time Periods/Product Analysis Table 4. Quantity of products requested by weekday x hour and registered on shop 4
Crossed Clustering Aproach for Time Period/Product Analysis 57,7% Table 5. Confusion table
Crossed Clustering Aproach for Time Period/Product Analysis Example of one surprising result: the class Product 5 is defined by one type of products « Free standing combi refrigerators » consulted predominantly on Fridays from 17:00 to 20:00 (class period 6) 57,7% of such a product type requested on this period
Conclusions 1. Intersite Data Pre-Processing - structuration into user’s intersite visits « Group of SessionIDs » - first statistical Intersite analysis - anomalies and recommandations for the dataset 2. Crossed Clustering Approach - first application of such a method on time periods of Web logs and in e-commerce domain - promising results
Data pre-processing Inconsistency problems: - table kategorie: found repeated entries and different entries with same ID • for some page types (dt, df) the given parameter represented actually a • specific product, not the given product description (from products table). • extra parameters equivalent to the give ones for some page types: • i.e. for ct page type, id is equivalent to the given c parameter • missing values (descriptions) in tables: • 3 values in product table and 64 in category table • multiple site SessionIDs: 13 cross-server visits had same SessionID on the • visited sites (up to 4 sites); SessionID should change on each new site; • multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than • one IP (anonymization proxies ?).