Web Mining Research : A Survey

WebMiningResearch: ASurvey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised and presented by Fan Min, 4/22/2009 Revised and Presented by Nima [Poornima Shetty] Date: 12/06/2011 Course: Data Mining[CS332] Computer Science Department University Of Vermont

Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions Web Mining Research: A Survey

Introduction • Withthe huge amount of information available online, the World Wide Web is a fertile area for data mining research. • WWW is a popular and interactive medium to circulate information today. • The Web is huge, diverse, and dynamic. • Thus raises the scalability, multimedia data, and temporal issues respectively. Web Mining Research: A Survey

Four Problems • Finding relevant information • Low precision and unindexed information • Creating new knowledge out of available information on the web • A data-triggered process • Personalizing the information • Personal preference in content and presentation of the information • Learning about the consumers • What does the customer want to do? Web Mining Research: A Survey

Other Approaches Web mining is NOT the only approach • Database approach (DB) • Information retrieval (IR) • Natural language processing (NLP) • Web document community Web Mining Research: A Survey

Direct vs. Indirect Web Mining • Web mining techniques can be used to solve the information overload problems: • Directly Address the problem with web mining techniques E.g. newsgroup agent classifies whether the news as relevant • Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service Web Mining Research: A Survey

The Research • Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) • Attempt to put research done in a structured way from the machine learning point of view Web Mining Research: A Survey

Web Mining: Definition • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • Can be viewed as four subtasks Web Mining Research: A Survey

Web Mining: Subtasks • Resource finding • Retrieving intended web documents • Information selection and pre-processing • Select and pre-process specific information from selected documents • Kind of transformation processes of the original data retrieved in the IR process • This transformation could be a kind of pre-processing • Generalization • Discover general patterns within and across web sites • Analysis • Validation and/or interpretation of mined patterns Web Mining Research: A Survey

Web Mining and Information Retrieval • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible • Goal: Indexing text and searching for useful documents in a collection. • Research in IR: modeling, document classification and categorization, user interfaces, data visualization, filtering etc. • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine) • Viewed in this respect, Web mining is part of the (Web) IR process. Web Mining Research: A Survey

Web Mining and Information Extraction • Information Extraction (IE): Transforming a collection of documents, into information that is more easily understood and analyzed. • Building IE systems manually for the general Web are not feasible • Most IE systems focus on specific Web sites or content to extract Web Mining Research: A Survey

Compare IR and IE • IR aims to select relevant documents • IE aims to extract the relevant facts from given documents • IR views the text in a document just as a bag of unordered words • IE interested in structure or representation of a document Web Mining Research: A Survey

Web Mining and The Agent Paradigm • Web mining is often viewed from or implemented within an agent paradigm. • Web mining has a close relationship with Intelligent Agents. • User Interface Agents • information retrieval agents, information filtering agents, & personal assistant agents. • Distributed Agents • Concerned with problem solving by a group of agents. • distributed agents for knowledge discovery or data mining. • Mobile Agents Web Mining Research: A Survey

Web Mining and The Agent Paradigm (contd.) • Two frequently used approaches for developing intelligent agents: • Content-based approach • The system searches for items that match based on an analysis of the content using the user preferences. • Collaborative approach • The system tries to find users with similar interests to give recommendations to. • Analyze the user profiles and sessions or transactions. Web Mining Research: A Survey

Web Mining Categories • Web Content Mining • Discovering useful information from web page contents/data/documents. • Web Structure Mining • Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs • Web Usage Mining • Extraction of interesting knowledge from logging information produced by web servers. • Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc. Web Mining Research: A Survey

Web Content Data Structure Web content consists of several types of data Text, image, audio, video, hyperlinks. Unstructured – free text Semi-structured – HTML More structured – Data in the tables or database generated HTML pages Note: much of the Web content data is unstructured text data. Web Mining Research: A Survey 19

Web Content Mining: IR View • Unstructured Documents • Bag of words to represent unstructured documents • Takes single word as feature • Ignores the sequence in which words occur • Features could be • Boolean • Word either occurs or does not occur in a document • Frequency based • Frequency of the word in a document • Variations of the feature selection include • Removing the case, punctuation, infrequent words and stop words • Features can be reduced using different feature selection techniques: • Information gain, mutual information, cross entropy. • Stemming: which reduces words to their morphological roots. Web Mining Research: A Survey

Web Content Mining: IR View • Semi-Structured Documents • Uses richer representations for features • Due to the additional structural information in the hypertext document (typically HTML and hyperlinks) • Uses common data mining methods (whereas unstructured might use more text mining methods) • Application: • Hypertext classification or categorization and clustering, • learning relations between web documents, • learning extraction patterns or rules, and • finding patterns in semi-structured data. Web Mining Research: A Survey

Web Content Mining: DB View • The database techniques on the Web are related to the problems of managing and querying the information on the Web. • DB view tries to infer the structure of a Web site or transform a Web site to become a database • Better information management • Better querying on the Web • Can be achieved by: • Finding the schema of Web documents • Building a Web warehouse • Building a Web knowledge base • Building a virtual database Web Mining Research: A Survey

Web Content Mining: DB View • DB view mainly uses the Object Exchange Model (OEM) • Represents semi-structured data by a labeled graph • The data in the OEM is viewed as a graph, with objects as the vertices and labels on the edges • Each object is identified by an object identifier [oid] and • Value is either atomic or complex • Process typically starts with manual selection of Web sites for doing Web content mining • Main application: • The task of finding frequent substructures in semi-structured data • The task of creating multi-layered database Web Mining Research: A Survey

Web Structure Mining • Interested in the structure of the hyperlinks within the Web • Inspired by the study of social networks and citation analysis • Can discover specific types of pages(such as hubs, authorities, etc.) based on the incoming and outgoing links. • Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site Web Mining Research: A Survey

Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs) • Web client data • Proxy server data • Web server data • Two common approaches • Maps the usage data of Web server into relational tables before an adapted data mining techniques • Uses the log data directly by utilizing special pre-processing techniques Web Mining Research: A Survey

Web Usage Mining • Typical problems: • Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers • Often Usage Mining uses some background or domain knowledge E.g. site topology, Web content, etc. Web Mining Research: A Survey

Web Usage Mining • Applications: • Two main categories: • Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically • Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site Web Mining Research: A Survey

Conclusions • Survey the research in the area of Web mining. • Suggest three Web mining categories • Content, Structure, and Usage Mining • And then situate some of the research with respect to these categories • Explored connection between Web mining categories and related agent paradigm Web Mining Research: A Survey

Exam Question #1 • Question: Outline the main characteristics of Web information. • Answer: Web information is huge, diverse, and dynamic. Web Mining Research: A Survey

Exam Question #2 • Question: Define Web Mining • Answer: Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. Web Mining Research: A Survey

Exam Question #3 • Question: What are the three main areas of interest for Web mining? • Answer: (1) Web Content (2) Web Structure (3) Web Usage Web Mining Research: A Survey

Thank you!

Web Mining Research : A Survey