Web Mining Research : A Survey

WebMiningResearch:ASurvey Raymond Kosala and Hendrik Blockeel ACM SIGKDD , July 2000 Presented by Shan Huang, 4/24/2007

Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions

Four Problems • Finding relevant information • Low precision • Unindexed information • Creating new knowledge out of available information on the web • Personalizing the information • Catering to personal preference in content and presentation • Learning about the consumers • What does the customer want to do? • Using web data to effectively market products and/or services

Other Approaches • Web mining is not the only approach • Database approach (DB) • Information retrieval (IR) • Natural language processing (NLP) • In-depth syntactic and semantic analysis • Web document community • Standards, manually appended meta-information, maintained directories, etc

Direct vs Indirect Web Mining • Web mining techniques can be used to solve the information overload problems: • Directly • Attack the problem with web mining techniques • E.g. newsgroup agent classifies news as relevant • Indirectly • Used as part of a bigger application that addresses problems • E.g. used to create index terms for a web search service

The Research • Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) • Paper focuses on research from the machine learning point of view

Web Mining: Definition • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • Can be viewed as four subtasks • Not the same as Information Retrieval • Not the same as Information Extraction

Web Mining: Subtasks • Resource finding • Retrieving intended documents • Information selection/pre-processing • Select and pre-process specific information from selected documents • Generalization • Discover general patterns within and across web sites • Analysis • Validation and/or interpretation of mined patterns

Web Mining: Not IR or IE • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

Web Mining: Not IR or IE • Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select the relevant documents • IE systems for the general Web are not feasible • Most focus on specific Web sites or content

Web Mining and Machine Learning • As a broad subfield of artificial intelligence, machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". • Web mining not the same as learning from the Web. • Some applications of machine learning on the web are not Web Mining • Some methods used for Web Mining besides machine learning • However, there is a close relationship between web mining and machine learning.

Web Mining Categories • Web Content Mining • Discovering useful information from web contents/data/documents. • IR view for finding • DB view for modeling • Web Structure Mining • Discovering the model underlying link structures (topology) on the Web • E.g. discovering authorities and hubs • Web Usage Mining • Make sense of data generated by surfers • Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.

Web Content Data Structure • Unstructured – free text • Semi-structured – HTML • More structured – Table or Database generated HTML pages • Multimedia data – receive less attention than text or hypertext

Web Mining: The Agent Paradigm • User Interface Agents • information retrieval agents, information filtering agents, & personal assistant agents. • Distributed Agents • distributed agents for knowledge discovery or data mining. • Problem solving by a group of agents • Mobile Agents

Web Mining: The Agent Paradigm • Content-based approach • The system searches for items that match based on an analysis of the content using the user preferences. • Collaborative approach • The system tries to find users with similar interests • Recommendations given based on what similar users did

Web Content Mining: IR View • Unstructured Documents • Bag of words, or phrase-based feature representation • Features can be boolean or frequency based • Features can be reduced using different feature selection techniques • Word stemming, combining morphological variations into one feature

Web Content Mining: IR View • Semi-Structured Documents • Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks) • Uses common data mining methods (whereas unstructured might use more text mining methods)

Web Content Mining: DB View • Tries to infer the structure of a Web site or transform a Web site to become a database • Better information management • Better querying on the Web • Can be achieved by: • Finding the schema of Web documents • Building a Web warehouse • Building a Web knowledge base • Building a virtual database

Web Content Mining: DB View • Mainly uses the Object Exchange Model (OEM) • Represents semi-structured data (some structure, no rigid schema) by a labeled graph • Process typically starts with manual selection of Web sites for content mining • Main application: building a structural summary of semi-structured data (schema extraction or discovery)

Web Structure Mining • Interested in the structure between Web documents (not within a document) • Inspired by the study of social networks and citation analysis • Example: PageRank – Google • Application: Discovering micro-communities in the Web • Measuring the “completeness” of a Web site

Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs) • Web client data • Proxy server data • Web server data • Two common approaches • Map usage data into relational tables before using adapted data mining techniques • Use log data directly by utilizing special pre-processing techniques

Web Usage Mining • Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers • Often Usage Mining uses some background or domain knowledge • E.g. site topology, Web content, etc

Web Usage Mining • Two main categories: • Learning a user profile (personalized) • Web users would be interested in techniques that learn their needs and preferences automatically • Learning user navigation patterns (impersonalized) • Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

Conclusions • Tried to resolve confusion with regards to the term Web Mining • Differentiated from IR and IE • Suggest three Web mining categories: • Content, Structure, and Usage Mining • Briefly described approaches for the three categories • Explored connection with agent paradigm

Exam Question #1 • Question: Outline the main characteristics of Web information. • Answer: Web information is huge, diverse, and dynamic.

Exam Question #2 • Question: How data mining techniques can be used in Web information analysis? Give at least two examples. • Classification: classification on server logs using decision tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class • Clustering: Clustering can be used to group users exhibiting similar browsing patterns. • Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session.

Exam Question #3 • Question: What are the three main areas of interest for Web mining? • Answer: (1) Web Content (2) Web Structure (3) Web Usage

Thank you!

Web Mining Research : A Survey

Web Mining Research : A Survey

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

Optimization and Data Mining in Epilepsy Research

CS490D: Introduction to Data Mining Prof. Walid Aref

Data Mining

Data Mining

Drug Safety Assessment and Data Mining

CIRP Your First College Year Survey 2012

BUDT 725: Models and Applications in Operations Research

Advanced Topics in Data Mining: Association Rules

Presentation for Deforestation project

Graph Mining - surprising patterns in real graphs

Survey Sampling - 2

Data Mining Tools

Data Mining with DB

New Mining Technology 采矿新技术

Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab

Presentation for Deforestation project

Survey Monkey – A “How To” Guide