Web Mining

WebMining By- Pawan Singh PiyushArora PoojaMansharamani Pramod Singh Praveen Kumar

Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions

Four Problems • Finding relevant information • Low precision-which is due to the irrelevance of many of the search results. This results in a difficulty finding the relevant information. • LOW RECALL which is due to the inability to index all the information available on the web.This results in a difficulty finding the unindexed information that is relevant. • Creating new knowledge out of available information on the web • While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process .

Personalizing the information • Catering to personal preference in content and presentation(associated with the type and presentation of the information ) • Learning about the consumers • What does the customer want to do? • Using web data to effectively market products and/or services

Other Approaches Web mining is NOT the only approach • Database approach (DB) • Information retrieval (IR) • Natural language processing (NLP) • In-depth syntactic and semantic analysis • Web document community • Standards, manually appended meta-information, maintained directories, etc

Direct vs. Indirect Web Mining • Web mining techniques can be used to solve the information overload problems: • Directly Attack the problem with web mining techniques E.g. newsgroup agent classifies news as relevant • Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service

The Research • Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) • Focusing on research from the machine learning point of view

Web Mining: Definition • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • Can be viewed as four subtasks • Not the same as Information Retrieval • Not the same as Information Extraction

Web Mining: Subtasks • Resource finding • Retrieving intended documents • Information selection/pre-processing • Select and pre-process specific information from retrieved web resources. • Generalization • Discover general patterns within and across web sites • Analysis • Validation and/or interpretation of mined patterns

Web Mining: Not IR • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

Web Mining: Not IE • Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents. • IE systems for the general Web are not feasible • Most focus on specific Web sites or content

IE - IR • Information Retrieval • Automatic retrieval of relevant documents • Primary Goals: • Indexing Text • Searching for useful documents in a collection • “Bag of unordered words” • “Web document classification “ task is an instance of IR • Information Extraction • Extract relevant facts from documents • Primary Goals: • Transform collection of retrieved documents to information. • Structure of representation of a document • “Web document classification “ task is an instance of IR • IE has a higher level of granularity • Result: • Structured Database • Compression or summary of Text or documents

Types of IE • I E from unstructured texts ( Classical) • Unstructured ?? Free texts eg.News stories • Basic to deep linguistic pre-processing. • IE from semi-structured texts (Structural) • Semi-Structured ?? HTML • Uses meta-information eg. HTML tags • Wrapper Induction, • Machine learning used to build systems (semi-)automatically

Web Mining and Machine Learning • Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". • Web mining is NOT learning from the Web. • Some applications of machine learning on the web are NOT Web Mining • Methods used for Web Mining are NOT limited to machine learning • There is a close relationship between web mining and machine learning

Web Mining and Machine Learning • Machine learning techniques support and help web mining as they could be applied to the processes in the web mining. • For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques. • In short,web mining intersects with the application of the machine learning on the web.

Web Mining Categories • Web Content Mining • Discovering useful information from web contents/data/documents. • Web Structure Mining • Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs • Web Usage Mining • Make sense of data generated by surfers • Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.

Web Content Data Structure • Unstructured – free text • Semi-structured – HTML • More structured – Table or Database generated HTML pages • Multimedia data – receive less attention than text or hypertext

Web Structure Mining • Interested in the structure between Web documents (not within a document) • Example: PageRank – Google • Application: Discovering micro-communities in the Web • Measuring the “completeness” of a Web site

Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs) • Web client data • Proxy server data • Web server data • Two common approaches • Map usage data into relational tables before using adapted data mining techniques • Use log data directly by utilizing special pre-processing techniques

Thank you!

Web Mining

Web Mining

Presentation Transcript

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web mining

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

WEB MINING

Web-Mining Agents Data Mining