250 likes | 627 Views
Web Mining. By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar. Outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions. Four Problems. Finding relevant information
E N D
WebMining By- Pawan Singh PiyushArora PoojaMansharamani Pramod Singh Praveen Kumar
Outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Conclusion & Exam Questions
Four Problems • Finding relevant information • Low precision-which is due to the irrelevance of many of the search results. This results in a difficulty finding the relevant information. • LOW RECALL which is due to the inability to index all the information available on the web.This results in a difficulty finding the unindexed information that is relevant. • Creating new knowledge out of available information on the web • While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process .
Personalizing the information • Catering to personal preference in content and presentation(associated with the type and presentation of the information ) • Learning about the consumers • What does the customer want to do? • Using web data to effectively market products and/or services
Other Approaches Web mining is NOT the only approach • Database approach (DB) • Information retrieval (IR) • Natural language processing (NLP) • In-depth syntactic and semantic analysis • Web document community • Standards, manually appended meta-information, maintained directories, etc
Direct vs. Indirect Web Mining • Web mining techniques can be used to solve the information overload problems: • Directly Attack the problem with web mining techniques E.g. newsgroup agent classifies news as relevant • Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service
The Research • Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) • Focusing on research from the machine learning point of view
Web Mining: Definition • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • Can be viewed as four subtasks • Not the same as Information Retrieval • Not the same as Information Extraction
Web Mining: Subtasks • Resource finding • Retrieving intended documents • Information selection/pre-processing • Select and pre-process specific information from retrieved web resources. • Generalization • Discover general patterns within and across web sites • Analysis • Validation and/or interpretation of mined patterns
Web Mining: Not IR • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)
Web Mining: Not IE • Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents. • IE systems for the general Web are not feasible • Most focus on specific Web sites or content
IE - IR • Information Retrieval • Automatic retrieval of relevant documents • Primary Goals: • Indexing Text • Searching for useful documents in a collection • “Bag of unordered words” • “Web document classification “ task is an instance of IR • Information Extraction • Extract relevant facts from documents • Primary Goals: • Transform collection of retrieved documents to information. • Structure of representation of a document • “Web document classification “ task is an instance of IR • IE has a higher level of granularity • Result: • Structured Database • Compression or summary of Text or documents
Types of IE • I E from unstructured texts ( Classical) • Unstructured ?? Free texts eg.News stories • Basic to deep linguistic pre-processing. • IE from semi-structured texts (Structural) • Semi-Structured ?? HTML • Uses meta-information eg. HTML tags • Wrapper Induction, • Machine learning used to build systems (semi-)automatically
Web Mining and Machine Learning • Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". • Web mining is NOT learning from the Web. • Some applications of machine learning on the web are NOT Web Mining • Methods used for Web Mining are NOT limited to machine learning • There is a close relationship between web mining and machine learning
Web Mining and Machine Learning • Machine learning techniques support and help web mining as they could be applied to the processes in the web mining. • For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques. • In short,web mining intersects with the application of the machine learning on the web.
Web Mining Categories • Web Content Mining • Discovering useful information from web contents/data/documents. • Web Structure Mining • Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs • Web Usage Mining • Make sense of data generated by surfers • Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.
Web Content Data Structure • Unstructured – free text • Semi-structured – HTML • More structured – Table or Database generated HTML pages • Multimedia data – receive less attention than text or hypertext
Web Structure Mining • Interested in the structure between Web documents (not within a document) • Example: PageRank – Google • Application: Discovering micro-communities in the Web • Measuring the “completeness” of a Web site
Web Usage Mining • Tries to predict user behavior from interaction with the Web • Wide range of data (logs) • Web client data • Proxy server data • Web server data • Two common approaches • Map usage data into relational tables before using adapted data mining techniques • Use log data directly by utilizing special pre-processing techniques