280 likes | 486 Views
April 23rd 2014 CS332 Data Mining. pg 01. Web Mining Research: A Survey. Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson. pg 02. outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions. pg 03.
E N D
April 23rd 2014 CS332 Data Mining pg 01 Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson
pg 02 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 03 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 04 Introduction “The Web is huge, diverse, and dynamic . . . we are currently drowning in information and facing information overload.” Web users encounter problems: • Finding relevant information • Creating new knowledge out of the information available on the Web • Personalization of the information • Learning about consumers or individual users
pg 05 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 06 Web Mining “Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services.” Web mining subtasks: • Resource finding • Information selection and pre-processing • Generalization • Analysis
pg 07 Web Mining Information Retrieval & Information Extraction • Information Retrieval (IR) • the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant as possible • Information Extraction (IE) • transforming a collection of documents into information that is more readily digested and analyzed
pg 08 Live demo
pg 09 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 10 Web Content Mining Information Retrieval View Unstructured Documents • Most utilizes “bag of words” representation to generate documents features • ignores the sequence in which the words occur • Document features can be reduced with selection algorithms • ie. information gain • Possible alternative document feature representations: • word positions in the document • phrases/terms (ie. “annual interest rate”) Semi-Structured Documents • Utilize additional structural information gleaned from the document • HTML markup (intra-document structure) • HTML links (inter-document structure)
pg 11 Web content mining, IR unstructured documents
pg 12 Web content mining, IR semi structured documents
pg 13 Web Content Mining Database View “the Database view tries . . . to transform a Web site to become a database so that . . . querying on the Web become[s] possible.” • Uses Object Exchange Model (OEM) • represents semi-structured data by a labeled graph • Database view algorithms typically start from manually selected Web sites • site-specific parsers • Database view algorithms produce: • extract document level schema or DataGuides • structural summary of semi-structured data • extract frequent substructures (sub-schema) • multi-layered database • each layer is obtained by generalizations on lower layers
pg 14 Web content mining, Database view
pg 15 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 16 Web Structure Mining “. . . we are interested in the structure of the hyperlinks within the Web itself” • Inspired by the study of social networks and citation analysis • based on incoming & outgoing links we could discover specific types of pages (such as hubs, authorities, etc) • Some algorithms calculate the quality/relevancy of each Web page • ie. Page Rank • Others measure the completeness of a Web site • measuring frequency of local links on the same server • interpreting the nature of hierarchy of hyperlinks on one domain
pg 17 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 18 Web Usage Mining “. . . focuses on techniques that could predict user behavior while the user interacts with the Web.” • Web usage is mined by parsing Web server logs • mapped into relational tables → data mining techniques applied • log data utilized directly • Users connecting through proxy servers and/or users or ISP’s utilizing caching of Web data results in decreased server log accuracy • Two applications: • personalized - user profile or user modeling in adaptive interfaces • impersonalized - learning user navigation patterns
pg 19 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 20 Review • Web mining • 4 subtasks • IR & IE • Web content mining • primarily intra-page analysis • IR view vs DB view • Web structure mining • primarily inter-page analysis • Web usage mining • primarily analysis of server activity logs
pg 21 Web mining categories
pg 22 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions
pg 23 Exam Question 1 Q: Of the following Web mining paradigms: • Information Retrieval • Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.
pg 24 Exam Question 1 Q: Of the following Web mining paradigms: • Information Retrieval • Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer. A: Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.
pg 25 Exam Question 2 Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.
pg 26 Exam Question 2 Q: State one common problem hampering accurate Web usage mining? Briefly support your answer. A: • Users connecting to a Web site though a proxy server, • Users (or their ISP’s) utilizing Web data caching, will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.
pg 27 Exam Question 3 Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?
pg 28 Exam Question 3 Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents? A: “Bag of words” representation.