380 likes | 585 Views
Web Mining Research : A Survey. Raymond Kosala and Hendrik Blockeel ACM SIGKDD , July 2000 Presented by Shan Huang, 4/24/2007. Outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions. Four Problems.
E N D
WebMiningResearch:ASurvey Raymond Kosala and Hendrik Blockeel ACM SIGKDD , July 2000 Presented by Shan Huang, 4/24/2007
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Four Problems • Finding relevant information • Low precision • Unindexed information • Creating new knowledge out of available information on the web • Personalizing the information • Catering to personal preference in content and presentation • Learning about the consumers • What does the customer want to do? • Using web data to effectively market products and/or services
Other Approaches • Web mining is not the only approach • Database approach (DB) • Information retrieval (IR) • Natural language processing (NLP) • In-depth syntactic and semantic analysis • Web document community • Standards, manually appended meta-information, maintained directories, etc
Direct vs Indirect Web Mining • Web mining techniques can be used to solve the information overload problems: • Directly • Attack the problem with web mining techniques • E.g. newsgroup agent classifies news as relevant • Indirectly • Used as part of a bigger application that addresses problems • E.g. used to create index terms for a web search service
The Research • Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) • Paper focuses on research from the machine learning point of view
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Web Mining: Definition • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • Can be viewed as four subtasks • Not the same as Information Retrieval • Not the same as Information Extraction
Web Mining: Subtasks • Resource finding • Retrieving intended documents • Information selection/pre-processing • Select and pre-process specific information from selected documents • Generalization • Discover general patterns within and across web sites • Analysis • Validation and/or interpretation of mined patterns
Web Mining: Not IR or IE • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)
Web Mining: Not IR or IE • Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select the relevant documents • IE systems for the general Web are not feasible • Most focus on specific Web sites or content
Web Mining and Machine Learning • As a broad subfield of artificial intelligence, machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". • Web mining not the same as learning from the Web. • Some applications of machine learning on the web are not Web Mining • Some methods used for Web Mining besides machine learning • However, there is a close relationship between web mining and machine learning.
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Web Mining Categories • Web Content Mining • Discovering useful information from web contents/data/documents. • IR view for finding • DB view for modeling • Web Structure Mining • Discovering the model underlying link structures (topology) on the Web • E.g. discovering authorities and hubs • Web Usage Mining • Make sense of data generated by surfers • Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
Web Content Data Structure • Unstructured – free text • Semi-structured – HTML • More structured – Table or Database generated HTML pages • Multimedia data – receive less attention than text or hypertext
Web Mining: The Agent Paradigm • User Interface Agents • information retrieval agents, information filtering agents, & personal assistant agents. • Distributed Agents • distributed agents for knowledge discovery or data mining. • Problem solving by a group of agents • Mobile Agents
Web Mining: The Agent Paradigm • Content-based approach • The system searches for items that match based on an analysis of the content using the user preferences. • Collaborative approach • The system tries to find users with similar interests • Recommendations given based on what similar users did
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Web Content Mining: IR View • Unstructured Documents • Bag of words, or phrase-based feature representation • Features can be boolean or frequency based • Features can be reduced using different feature selection techniques • Word stemming, combining morphological variations into one feature
Web Content Mining: IR View • Semi-Structured Documents • Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks) • Uses common data mining methods (whereas unstructured might use more text mining methods)
Web Content Mining: DB View • Tries to infer the structure of a Web site or transform a Web site to become a database • Better information management • Better querying on the Web • Can be achieved by: • Finding the schema of Web documents • Building a Web warehouse • Building a Web knowledge base • Building a virtual database
Web Content Mining: DB View • Mainly uses the Object Exchange Model (OEM) • Represents semi-structured data (some structure, no rigid schema) by a labeled graph • Process typically starts with manual selection of Web sites for content mining • Main application: building a structural summary of semi-structured data (schema extraction or discovery)
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Web Structure Mining • Interested in the structure between Web documents (not within a document) • Inspired by the study of social networks and citation analysis • Example: PageRank – Google • Application: Discovering micro-communities in the Web • Measuring the “completeness” of a Web site
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs) • Web client data • Proxy server data • Web server data • Two common approaches • Map usage data into relational tables before using adapted data mining techniques • Use log data directly by utilizing special pre-processing techniques
Web Usage Mining • Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers • Often Usage Mining uses some background or domain knowledge • E.g. site topology, Web content, etc
Web Usage Mining • Two main categories: • Learning a user profile (personalized) • Web users would be interested in techniques that learn their needs and preferences automatically • Learning user navigation patterns (impersonalized) • Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Conclusions • Tried to resolve confusion with regards to the term Web Mining • Differentiated from IR and IE • Suggest three Web mining categories: • Content, Structure, and Usage Mining • Briefly described approaches for the three categories • Explored connection with agent paradigm
Exam Question #1 • Question: Outline the main characteristics of Web information. • Answer: Web information is huge, diverse, and dynamic.
Exam Question #2 • Question: How data mining techniques can be used in Web information analysis? Give at least two examples. • Classification: classification on server logs using decision tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class • Clustering: Clustering can be used to group users exhibiting similar browsing patterns. • Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session.
Exam Question #3 • Question: What are the three main areas of interest for Web mining? • Answer: (1) Web Content (2) Web Structure (3) Web Usage