360 likes | 499 Views
Web Usage Mining: Processes and Applications. Qiaoyuan Jiang CSE 8331 November 24, 2003. Outline. Brief overview of Web mining Web usage mining Application areas of Web usage mining Future research directions Conclusions. Web Mining.
E N D
Web Usage Mining: Processes and Applications Qiaoyuan Jiang CSE 8331 November 24, 2003
Outline • Brief overview of Web mining • Web usage mining • Application areas of Web usage mining • Future research directions • Conclusions
Web Mining • Web Mining is the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services [Etzioni, 1996].
Web Mining Categories • Web Content Mining- extracting knowledge from the content of the Web • Web Structure Mining- discovering the model underlying the link structures of the Web • Web Usage Mining- discovering user’s navigation pattern and predicting user’s behavior
Web Usage Mining Processes • Preprocessing: conversion of the raw data into the data abstraction (users, sessions, episodes, clicktreams, and pageviews) necessary for further applying the data mining algorithm. • Pattern Discovery: is the key component of WUM, which converges the algorithms and techniques from data mining, machine learning, statistics and pattern recognition etc. research categories. • Pattern Analysis: Validation and interpretation of the mined patterns
Web Usage Mining- Preprocessing • Data Cleaning: remove outliers and/or irrelative data • User Identification: associate page references with different users • Session Identification: divide all pages accessed by a user into sessions • Path Completion: add important page access records that are missing in the access log due to browser and proxy server caching • Formatting: format the sessions according to the type of data mining to be accomplished.
Web Usage Mining - Pattern Discovery Tasks • Statistical Analysis • Clustering • Classification • Association Rules • Sequential Patterns • Dependency Modeling
Web Usage Mining - Pattern Discovery Tasks (Cont.) • Statistical Analysis: frequency analysis, mean, median, etc. • Improve system performance • Provide support for marketing decisions • Simplify site modification task • Clustering: • Clustering of users help to discover groups of users with similar navigation patterns => provide personalized Web content • Clustering of pages help to discover groups of pages having related content => search engine
Web Usage Mining - Pattern Discovery Tasks (Cont.) • Classification: the technique to map a data item into one of several predefined classes • Develop profile of users belonging to a particular class or category • Association Rules: discover correlations among pages accessed together by a client • Help the restructure of Web site • Page prefetching • Develop e-commerce marketing strategies
Web Usage Mining - Pattern Discovery Tasks (Cont.) • Sequential Patterns: extract frequently occurringinter-session patterns such that the presence of a set of items s followed by another item in time order • Predict future user visit patterns=>placing ads or recommendations • Page prefeteching • Dependency Modeling: determine if there are any significant dependencies among the variables in the Web domain • Predict future Web resource consumption • Develop business strategies to increase sales • Improve navigational convenience of users
Web Usage Mining - Pattern Analysis • Pattern Analysis is the final stage of WUM, which involves the validation and interpretation of the mined pattern • Validation: to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process • Interpretation: the output of mining algorithms is mainly in mathematic form and not suitable for direct human interpretations
Web Usage Mining - Pattern Analysis Methodologies and Tools • Visualization: help people to understand both real and abstract concepts • WebViz: Web is visualized as a direct graph • Query mechanism: allow analysts to extract only relevant and useful patterns by specifying constraints. • WEBMINER • On-Line Analytical Processing (OLAP): enable analysts to perform ad hoc analysis of data in multiple dimensions for decision-making • WebLogMiner
WEMINER Query Example • Finds all ARs with min support of 1% and min confidence of 90%. The analyst only interested in clients from “.edu” domain and data later than Nov. 1st, 2003 with page accesses start with URL A and contains B and C in that order: SELECT association-rules(A*B*C*) FROM log.data WHERE date>=031101 AND domain=“edu” AND support = 1.0 AND confidence = 90.0
Application Areas for Web Usage Mining • Personalized: discover the preference and needs of individual Web users in order to provide personalized Web site for certain types of users • Impersonalized: examine general user navigation patterns in order to understand how general users use the site • System Improvement • Site Modification • Business Intelligence • Web Characterization
System Improvement • High performance of a web application is expected since it directly affects user’s satisfaction • WUM provides a key to understanding Web traffic behavior • Applications • Develop policies for web caching, network transmission, load balancing, or data distribution • Detecting intrusion, fraud, and attempted break-ins to the system
Site Modification • Structure of a Web site is another crucial attribute for attracting users other than the content of the Web • WUM can provide detailed feedback on user’s navigation behavior, which can be used to redesign the Web site structure for user’s navigational convenience • Adaptive Web site project [Perkowiz & Etzioni, 1998-1999]
Business Intelligence • Information on how customers are using a Web site is critical information for marketers of e-commerce businesses • WUM can provide business process optimization and marketing decisions • Business intelligence includes personalization for C2B systems
Usage Characterization • Mining general usage patterns (do not focus on any specific users or web sites)help in the study of how browsers are used and the user’s interaction with a browser interface. • Enables the ability to look at the dynamics of the Web and how it is growing.
Personalization • Choosing among thousands of options is challenge for Web users • Goal: provides users with dynamic content tailored to their individual interest • Form: recommending one or more items or pages to a user, based on the user’s profile and usage behavior, or the patterns of past visitors who have similar profiles. • Performance Measurement: • Effectiveness: accuracy + coverage • Scalability
Applications of Personalization • Customizing access to information sources • Filtering news or e-mails • Recommendation services for the browsing process • Tutoring systems • Search • More ...
3 phases of Personalization • Data preparation and transformation: data cleaning, filtering, transaction identification • Pattern discovery: discovery usage patterns • Recommendation: generate personalized content for a user based on matching the user’s session. (online process)
Personalization Techniques – Collaborative Filtering (CF) • Pattern discovery: onlinekNN algorithm applied on user profiles in a given domain and matching people who have the same taste. • Recommendation: pages or items that are interested to the k-neighbors will be interested to the active user as well. • Drawbacks: • Online process =>Lack of scalability • Static user profiles => low quality of recommendations
Personalization Techniques – Clustering • Technique: clustering user transactions and pageviews. • Advantages: • User preference is automatically learned from usage data and therefore up-to-date. • Better scalability through clustering • Drawbacks: • Low accuracy
Personalization Techniques – Association Rules (ARs) • Technique: • For each user, create a transaction contains all the items the user have ever accessed. • Find all rules satisfy the given support and confidence. • For each active user, find all the rules supported by the user. Items predicted by these rules are the candidate recommendations • Drawbacks: • All association rules must be discovered prior generating recommendation. This can be improved by real-time generating ARs from a subset of transactions within the active users neighborhood • High support => better scalability and accuracy, low coverage.
Personalization Techniques – Sequential Patterns (SPs) • Technique: Markov Model • Advantages: • Better accuracy: SPs contains more precise information about user navigation behavior. • Drawbacks: • Low recommendation coverage • More suitable for predictive tasks, e.g., Web prefeteching
Personalization Techniques – Hybrid Models • Hybrid Models automatically switch among different personalization models based on localized degree of hyperlink connectivity. • High connectivity degree => Non-SP models • Low connectivity degree and deeper navigation path => SP models • Performance: better than any individual models
Future Research Directions • Usage Mining on Semantic Web • Help to build semantic Web • With semantic Web, WUM can be improved • Multimedia Web Data Mining • Representation, problem solving and learning from Multimedia data is indeed a challenge
Future Research Directions (Cont.) • Software Computing Technology for Web Mining • Fuzzy logic: dealing with imprecision and conceptual data. Used in clustering Web log data and mining ARs. • Neural network: • Adaptive to new new data and information • Suitable for parallel process • Robust for missing, confusing, ill-defined data • Capable for modeling non-linear decision boundaries • Effective for learning user profiles • Genetic algorithm: randomized search and optimization guided by evaluation criteria. • Efficient, adaptive, robust, parallel process • Used in search and query optimization, predict user preference
Future Research Directions (Cont.) • Analysis of Discovered Patterns • Research on efficient, flexible and powerful analysis tools • More Applications • Temporal evolutions of usage behavior • Improving Web services • Detect credit card fraud • Privacy issues