510 likes | 679 Views
Data Mining to Measure and Improve the success of Web sites. - Myra Spiliopoulou and Carsten Pohle Santosh Vishwanath Department of Computer Science and Engineering http://santoshv23.tripod.com/. My Presentation is going to focus on :
E N D
Data Mining to Measure and Improve the success of Web sites. - Myra Spiliopoulou and Carsten Pohle Santosh Vishwanath Department of Computer Science and Engineering http://santoshv23.tripod.com/
My Presentation is going to focus on : • Introducing various concepts related to Data Mining and Websites. • Related work. • Notion of “Success” for web sites. • Navigation Pattern Discovery ( our main topic of focus)
My Presentation is going to focus on : • A procedure for evaluating the success on the basis of navigation patterns. • Improving the success in websites. • Case Study of a “SchulWeb” site • Conclusions • References.
Introduction : • Web as an international marketplace. • Requirement to improve the quality of the website continuously to meet the demands and the requirements of the user. • Far away from intelligent web sites in the AI sense. • Methods to determine the success of a website :
Introduction : • Study the satisfaction of the visitors. • Drawbacks : • Sample is used instead of population. • Sample is local whereas web is global. • Pre-Specifed criteria. • Setting up the environment is costly.
Introduction : • Format of studying the success should involve the following factors: • Considers all the visitors of the site. • Should be performedfrequently,on regular basis. • It points to the shortcomingsof the site and way of alleviating them.
Introduction : • So we use the method of analyzing the web usage patterns. • Data Mining is by nature appropriate for this kind of analysis.
What is Data Mining? • Data Mining can be defined as the discovery of useful summaries from usually a large amount of data. • Data mining refers to the process of extracting trends and predicting future patterns from large quantity of data by applying various algorithms to data.
What is Data Mining for the Web? • We define Data Mining for the web or in short Web Mining as the extraction of interesting and potentially useful patterns and implicit information from activities related to the World Wide Web. • There are essentially 3 domains that pertain to web mining: • Web Content Mining. • Web Structure Mining. • Web Usage Mining.
How Data Mining can be used for the Web Mining? • The activities of the user are recorded in the web server log . • This web server log data can be mined for obtaining useful information related to a particular site’s data. • The data mining paradigm provides the methodology for analyzing them. • We need an appropriate model of navigation behavior so that the discovered patterns can provide the indicators necessary for improving the site.
Related work: • Monitoring the website is as old as the web itself. • Was used by web administrators for studying and maintaining the web server’s load. • Modern tools support the computation of statistics that can serve as a basis for the analysis of success of a website. • There are 2 domains involved in this work • Success measures of the website. • Data Mining techniques for analyzing the web usage.
Measuring the Success of a Website: • Suillivan’s definition • Quality of service • Quality of navigation in modes supported by the site • Accessibility of a page • Eighmey’s definition • Measures a set of factors • Measures obtained through questionnaire • Dreze and Zufryden • Conjoint measurement • Finds important attributes in Websites to focus on
Measuring the Success of a Website: • Berthon et al definition : • Reflected the success of a site w.r.t the objective goals of the owner. • Proposed 2 measures : • Contact efficiency. • Conversion efficiency. • Success as its efficiency in “converting” visitors into customers and this can be measured without the involvement of the users.
Advances in Web Usage Mining: • Knowledge is obtained through the analysis of web usage. • Co-related but not linked web pages are discovered by clustering pages together by the site’s visitors. • Many other techniques helped new users get oriented it did not take the objectives of the site itself into account nor gives hints on how the site could improve. • OLAP techniques were applied for web usage mining.
Advances in Web Usage Mining: • The discovery of web usage patterns with the conventional mining techniques involved the discovery of frequently accessed paths by applying various methods like those of discovery of association rules. • Problems with data mining • Modeling the navigations patterns as sequences or association oversimplifies the problem. • Statistical patterns like frequency are too simple a measure for web usage. • In the later portion we will see complete framework of modeling success and navigation behavior and combining the two to improve success of a website.
Concepts related to the author’s proposal: • Define the “site success” as the efficiency of its components in helping the users achieve the site’s goals. • Specify the context in which analysis takes place i.e. the sites goal towards which the success is measured. • Pages Reflecting a site’s goal • Action Page -- e.g. the form for buying books • Target Page -- e.g. the order completed page.
Concepts related to the author’s proposal: • Service-based Concept hierarchies
Concepts related to the author’s proposal: • Service-based Concept hierarchies (contd) • Market Basket Analysis used individual products. • Authors propose to model the site’s services that generate the URL’s.
Success as Contact and Conversion Efficiency: • Berthon et al • contact efficiency,conversion efficiency • short-time visitor and active visitor. • Author uses sessions instead of visitors as a measure for success of a site goals. • All users accessing the site are “visitors”. • An “active session” is a session containing at least one activity towards fulfilling the site’s goal. All other sessions are termed “inactive.”
Success as Contact and Conversion Efficiency: • A “customer session” is a session in which the user has achieved the site’s goal. • Advantages • Does not depend upon length of stay or number of pages visited. • site’s goal is the center of focus
Success as Contact and Conversion Efficiency: • Contact Efficiency inside a site: • Let Sessions denote all sessions recorded in the log and let A be an action page of the web site. Then, the contact efficiency of A is given by: contacteff (A) = card({{s SessionsA s}}) card(Sessions) where card() denotes cardinality and {{: : :}} denotes a multiset. • Identify the success of each page on overall success of site. • detect pages with low efficiency.
Success as Contact and Conversion Efficiency: • Relative Contact Efficiency of a site: • The “relative contact efficiency of an action page” A is the ratio of sessions containing this page to the cardinality of the multiset of active sessions, called aSessions: Rcontacteff (A) = card({{s aSessionsA s}}) card(aSessions) where card() denotes cardinality and {{: : :}} denotes a multiset. • Identify the measure of relative importance of each action page within a site.
Success as Contact and Conversion Efficiency: • Conversion Efficiency of a site: • We define the “conversion efficiency of a page P towards a target page T over a group of paths G from P to T ” as the ratio of the cardinality of G to the cardinality of all active sessions containing P: conveff (P,T,G) = card (G) card({{s e aSessions Pes}}) Path – composed of successive accesses. Estimates the success of an arbitrary page in helping/guiding the users towards a target page
The Knowledge Discovery Process of Success Analysis: The knowledge discovery process is typically modeled as a series of steps, namely • specification of the problem • gathering and preparation of the relevant data • analysis of the data with data mining techniques • evaluation of the results according to previously established measures • interpretation of the results • action according to strategic decisions.
Navigation Pattern Discovery: • Previous sections discussed efficiency of individual pages. • Log of individual requests registered in the web site server must be transformed into sessions and navigation pattern must be discovered. Efficiency is estimated on the basis of these patterns. • The transformation of the web server log into a log of sessions appropriate for mining and the process of navigation pattern discovery are performed in the framework of the Web Utilization Miner WUM (http://wum.wiwi.hu-berlin.de/)
Navigation Pattern Discovery: • WUM :
A model of Web Usage Patterns: • A “session” is a sequence of consecutive URL requests performed by the same visitor. • Boundaries of session is either by durationor content • Authors useDuration as is used in WUM. • Replace individual URLs with abstract service descriptors that reflect the search strategies, format types and layouts used by the visitors.
A model of Web Usage Patterns: • (ParamA,1)(ShortList,1)(ShortList,2)(TextOnlyDescr,1)(TextOnlyDescr,2) • (ParamA,1)(LongList,1)(ParamA&B,1)(LongList,2)(TextOnlyDescr,1) • (ParamA,1)(LongList,1)(ButtonX,1)(LongList,2)
Generalized sequences and Navigation Patterns: • Session – • Generally one user • We are interested in multiple users. • Also not only the frequent patterns but also the less frequent patterns. • A generalized sequence or “g-sequence” is a vector comprising page occurrence and wildcards. • The “navigation pattern of a g-sequence” is then the group of subsequences matching the g-sequence. G-Sequence : (ParamA,1) [0;3] (TextOnlyDescr,1)
Generalized sequences and Navigation Patterns: • U be the set of (page, occurrence number) pairs recorded in a web site. • L be a sequence log over elements of U • g be a g-sequence over elements of U as well . • The “hits” of g, hits(g), is the number of sequences in L that are matched by g.
Generalized sequences and Navigation Patterns: • Let g=g1*g2*g3……*gn.be a g-sequence. For each i = 1,……. n –1 and for each j > i, the “confidence of gjtowards gi” is the ratio of the number of sequences containing g1*g2*g3……*gjto the number of sequences containing g1*g2*g3……*gi confidence g1*g2*g3……*gi (gj, gi) = hits(g1*g2*g3……*gj ) hits(g1*g2 ….*gi) conveff(S,T,G) = confidence (T,S)
Generalized sequences and Navigation Patterns: (Shortlist,1),1 (Shortlist,1),1 (TextOnlyDescr,1),1 (ParamA,1) ,3 (Longlist,1),2 (ParamA&B,1),1 (Longlist,2),1 (TextOnlyDescr,1),1 the conversion efficiency of (ParamA,1) towards the target page (TextOnlyDescr,1) over the paths matching g is given by
Pattern Templates and Pattern Discovery • We seek all the patterns that have certain properties such as a minimal frequency within a population. • In WUM we can specify complex constraints on structure using the mining language MINT. • MINT supports templates. • Template is a vector of variables and wildcards and are constrained by statistics and content of events(here page occurrences.
Pattern Templates and Pattern Discovery • Example of a template in MINT select t from node as x y, template # x [0;3] y as t where y.url contains “ descr” and y.occurrence =1 and(y.support /x.support ) >=0.2 and x.support >=30
A procedure to evaluate success on the basis of navigation patterns • Which patterns should be discovered to compute the success values upon .? • Involves Human expert,experience,background and intuition • Contact Efficiency • Conversion Efficiency • Comparisons between customer and non customer sessions.
Contact Efficiency of Action Pages: • Compute it as earlier discussed. • Improvement of contact efficiency involves the identification of rarely reached target pages. • C often used but visitors rarely go to target page so re-design.
Conversion Efficiency of all Sessions: • Active Sessions only.(remove inactive sessions). • Efficiency over • All paths • Long paths • Short paths. • A heuristic procedure is followed. • The heuristic indicates that a frequent page may show low conversion efficiency because • Unintuitive to users • Followed by a poorly designed page
Analysis of Customer and Non-Customer Sessions: • Partition the log to “customer log” of customer sessions and “non-customer log” of non customer sessions. • Process customer log to discover frequent navigation patterns that lead from action page to target page. • Compare with non-customer log that start from same action page and have similar frequency. • Inspections of differences in the navigation behavior between the two sessions.
The SchulWeb Site : • Conducted on a real site called “SchulWeb” (www.schule.de) • Goal is to improve searching of schools • KDD steps discussed applied on SchulWeb’s log
The SchulWeb Site : • Action Page : Query Strategy • Federal State(FS) • School Type (ST) • Arbitrary Text (T) • Target Page : Describing a school • Active Session – Customer Session if target page was accessed and the stay exceeded 7 minutes • According to active log there were 725 customer sessions and 549 non-customer sessions.
The SchulWeb Site : • Used MINT to query the customer log and evaluate the conversion efficiency.
The SchulWeb Site : • Used MINT to query the non-customer log and evaluate the conversion efficiency.
The SchulWeb Site : • Results • Users are reluctant to use search engines on the site • They are interested in schools in a given area • Whether they becomes a “customer” depends whether the results is shown in the first two results page • Make regional search efficient and easy toaccess
The SchulWeb Site : • Improvements to be made. • Town made the default button so the option of search on the basis of town became clear at the first glance. • Search interface was changed to support more town based searches. • After Modifications : • Further mining sessions were conducted to study the results after changes.
The SchulWeb Site : • Impact of changes on all active sessions :
Conclusions • There is more to KDD then Data Mining • Data Mining is just a part of it. • Model satisfied all the requirements that are need to efficiently calculate the success of a website. • Navigational Behavior of users was studied using WUM. • Heuristics were proposed to discover useful patterns in the log and to use the patterns to measure and improve the success of web sites.
Future Work : • Construct Concept Hierarchies by a convenient user interface. • Add a batch mode with the query generator for more efficient processing. • Reflect more characteristics of sire usage • Extend WUM to support better post mining phase. • Refine the heuristic procedures used.