130 likes | 339 Views
DSCI 5240 Graduate Presentation Xxxxxx. Research paper: Web Mining Research: A survey SIGKDD Explorations , June 2000. Volume 2, Issue 1 Author: R. Kosala and H. Blockeel. Outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion.
E N D
DSCI 5240 Graduate PresentationXxxxxx Research paper: Web Mining Research: A survey SIGKDD Explorations, June 2000. Volume 2, Issue 1 Author: R. Kosala and H. Blockeel
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion
Introduction • The World Wide Web is a popular and interactive medium to disseminate information • Information users may encounter four problems 1. Finding relevant information a. low precision b. low recall 2. Creating new knowledge out of the information available on the web ---data-triggered process 3. Personalizing of the information People differ in the content and presentations of information 4. Learning about consumers or individual users Mass customizing or even personalizing
Web Mining • Definition: web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data • Four subtasks • Resource finding: retrieving intended web documents • Information selection and pre-processing: selecting and pre-processing specific information • Generalization: discovering general patterns • Analysis: validation and/or interpretation of mined patterns
Web Mining • Web Mining and Information Retrieval Definition: IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible. goal: indexing and searching for useful documents • Web Mining and Information Extraction IE has the goal of transforming a collection of documents into information that is more readily digested and analyzed. • Compare IR and IE a. aims b. fields
Web Mining • Web Mining and the Agent Paradigm Web mining is often viewed from or implemented within an agent paradigm • User interface agents • Distributed agents • Mobile agents Two approaches used to develop intelligent agents • Content-based approach • Collaborative approach
Web Content Mining • Definition: discovering useful info from web page contents/data/documents • Several types of data: text, image, audio, video, hyperlinks • Types of Data Structure: 1.Unstructured: free text 2.Semi- structured: HTML 3.More structured: data in tables or database generated HTML pages
Web Content Mining • IR view: Unstructured Documents • Bag of words to represent unstructured documents • Feature: Boolean, Frequency based • Variations of the feature selection • Features could be reduced using different feature selection techniques Semi-Structured Documents • Uses richer representations for features • Uses common data mining methods
Web Content Mining • DB view: DB view tries to infer the structure of a web site or transform a web site to become a database Methods: • Finding the scheme of web documents • Building a web warehouse • Building a web knowledge base • Building a virtual database
Web Structure Mining • Interested in the structure of the hyperlinks within the web • Inspired by the study of social networks and citation analysis Discover specific types of pages based on the incoming and outgoing links • Application: • discovering micro-communities in the web • measuring the completeness of a web site
Web Usage Mining • Tries to predict user behavior from interaction with the web • Wide range of data • Two commonly used approaches • Maps the usage data of Web server into relational tables before an adapted data mining technique is performed • Uses the log data directly by utilizing special pre-processing techniques • problems: • Distinguishing among unique users, server sessions, episodes in the presence of caching and proxy servers • Often usage mining uses some background or domain knowledge • applications
Conclusions • Survey of research in the area of web mining • Three web mining categories: content structure usage mining • Connection between web mining categories and related agent paradigm