1.25k likes | 1.86k Views
Web Mining. Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu. Web Mining. (Etzioni, 1996) Web mining – 웹문서 혹은 서비스로부터 자동적으로 정보를 발견 , 추출하기 위한 data mining 기법 . (Kosala and Blockeel, July 2000)
E N D
Web Mining Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu
Web Mining • (Etzioni, 1996) • Web mining –웹문서 혹은 서비스로부터 자동적으로 정보를 발견, 추출하기 위한 data mining 기법. • (Kosala and Blockeel, July 2000) • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • Web mining 의 연구분야 –아래와 같은 다양한 연구분야들을 통합한 연구분야 • Database (DB) • Information retrieval (IR) • The sub-areas of machine learning (ML) • Natural language processing (NLP)
Web Mining : Subtasks • Resource Finding • Task of retrieving intended web-documents • Information Selection & Pre-processing • Automatic selection and pre-processing specific information from retrieved web resources • Generalization • Automatic Discovery of patterns in web sites • Analysis • Validation and / or interpretation of mined patterns
Web Mining: Not IR or IE • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine) • Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select the relevant documents • IE systems for the general Web are not feasible • Most focus on specific Web sites or content
Mining the World-Wide Web • WWW 은 다음과 같은 다양한 내용들을 포함하고 있는 거대한 정보의 원천이다 • Information services: • news, advertisements, consumer information, • financial management, education, government, e-commerce, etc. • Hyper-link information • Access and usage information • Web Site contents and Organization • Growing and changing very rapidly: Broad diversity of user communities • 그럼에도 불구하고 웹에 있는 정보 중에서 실질적으로 웹이용자들에게 이용가치가 있는 것은 극히 일부분에 불과하다. • 어떻게 하면 특정 토픽에 대한 양질의Web pages를 찾을 수 있을까? • 이를 위해서 web data를 이용한 data mining 기법이 요구되는 것이다.
Challenges on WWW Interactions • 중요한 정보에 대한 탐색 • Finding Relevant Information • 이용 가능한 정보에서 지식 창출 • Creating knowledge from Information available • 정보의 개인화, 즉 개개인에 need에 적합한 정보를 제공 • Personalization of the information • 고객전체 또는 개별고객에 대한 학습모형 개발 • Learning about customers / individual users Web Mining can play an important Role!
Web Mining: more challenging • Searches for • Web access patterns • Web structures • Regularity and dynamics of Web contents • Problems • The “abundance” problem • Limited coverage of the Web: hidden Web sources, majority of data in DBMS • Limited query interface based on keyword-oriented search • Limited customization to individual users • Dynamic and semistructured
Web Mining Taxonomy Web Mining Web Content Mining Web Structure Mining Web Usage Mining
Web Content Mining • Discovery of useful information from web contents / data / documents • Web data contents: • text, image, audio, video, metadata and hyperlinks. • Information Retrieval View ( Structured + Semi-Structured) • Assist / Improve information finding • Filtering Information to users on user profiles • Database View • Model Data on the web • Integrate them for more sophisticated queries
Issues in Web Content Mining • Developing intelligent tools for IR • Finding keywords and key phrases • Discovering grammatical rules and collocations • Hypertext classification/categorization • Extracting key phrases from text documents • Learning extraction models/rules • Hierarchical clustering • Predicting (words) relationship • Developing Web query systems • WebOQL, XML-QL • Mining multimedia data • Mining image from satellite (Fayyad, et al. 1996) • Mining image to identify small volcanoes on Venus (Smyth, et al 1996) .
Web Structure Mining • To discover the link structure of the hyperlinks at the inter-document level to generate structural summary about the Website and Web page. • Direction 1: based on the hyperlinks, categorizing the Web pages and generated information. • Direction 2: discovering the structure of Web document itself. • Direction 3: discovering the nature of the hierarchy or network of hyperlinks in the Website of a particular domain.
Web Structure Mining • Finding authoritative Web pages • 특정 주제에 관련된 의미가 있으면서 양질인 페이지의 검색 • Hyperlinks can infer the notion of authority • Web은 페이지와 페이지간을 연결해 주는 hyperlink로 구성 • 이런 hyperlink는 제작자의 주석을 통해서 나타나는 잠재적인 평가를 의미한다 • A hyperlink pointing to another Web page, 이것은 웹페이지 제작자가 다른 페이지를 볼 수 있도록 승인하는 것을 의미
Web Structure Mining • Web pages categorization (Chakrabarti, et al., 1998) • Discovering micro communities on the web • Clever system (Chakrabarti, et al., 1999), • Google (Brin and Page, 1998) • Schema Discovery in Semistructured Environment
Web Usage Mining • Web usage mining also known as Web log mining • mining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the web
Web Usage Mining • Applications • Target potential customers for electronic commerce • Enhance the quality and delivery of Internet information services to the end user • Improve Web server system performance • Identify potential prime advertisement locations • Facilitates personalization/adaptive sites • Improve site design • Fraud/intrusion detection • Predict user’s actions (allows prefetching)
Problems with Web Logs • Identifying users – Clients may have multiple streams – Clients may access web from multiple hosts – Proxy servers: many clients/one address – Proxy servers: one client/many addresses • Data not in log – POST data (i.e., CGI request) not recorded – Cookie data stored elsewhere • Proxy server: WWW 서버에서 어떤 인터넷 주소의 정보검색에 대한 요구를 받으면, 그 주소를 그 전에 읽어 저장한 장소에서 찾아, 있으면 그 정보를 즉시 찾아 주고, 없으면 그 주소지의 서버로부터 가지고 와서 저장장소에 복사한 후 요구자에게 알려 준다. 이러한 역할을 하는 서버(저장장소).
Cont… • Missing data • Pages may be cached • Referring page requires client cooperation • When does a session end? • Use of forward and backward pointers • Typically a 30 minute timeout is used • Web content may be dynamic • May not be able to reconstruct what the user saw • Use of spiders and automated agents – automatic request web pages • Like most data mining tasks, web log mining requires preprocessing • To identify users • To match sessions to other data • To fill in missing data • Essentially, to reconstruct the click stream
Log Data - Simple Analysis • Statistical analysis of users • Length of path • Viewing time • Number of page views • Statistical analysis of site • Most common pages viewed • Most common invalid URL
Web Log – Data Mining Applications • Association rules • Find pages that are often viewed together • Clustering • Cluster users based on browsing patterns • Cluster pages based on content • Classification • Relate user attributes to patterns
Web Logs • Web servers have the ability to log all requests • Web server log formats: • Most use the Common Log Format (CLF) • New, Extended Log Format allows configuration of log file • Generate vast amounts of data
Common Log Format • Remotehost: browser hostname or IP # • Remote log name of user • (almost always "-" meaning "unknown") • Authuser: authenticated username • Date: Date and time of the request • "request”: exact request lines from client • Status: The HTTP status code returned • Bytes: The content-length of response
Fields • Client IP: 128.101.228.20 • Authenticated User ID: - - • Time/Date: [10/Nov/1999:10:16:39 -0600] • Request: "GET / HTTP/1.0" • Status: 200 • Bytes: - • Referrer: “-” • Agent: "Mozilla/4.61 [en] (WinNT; I)"
Web Usage Mining • Commonly used approaches (Borges and Levene, 1999) • Maps the log data into relational tables before an adapted data mining technique is performed. • Uses the log data directly by utilizing special pre-processing techniques. • Typical problems • Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers (McCallum, et al., 2000; Srivastava, et al., 2000).
Request • Method: GET – Other common methods are POST and HEAD • URI: / • – This is the file that is being accessed. When a directory is specified, it is up to the Server to decide what to return. Usually, it will be the file named “index.html” or “home.html” • Protocol: HTTP/1.0
Status • Status codes are defined by the HTTP protocol. • Common codes include: – 200: OK – 3xx: Some sort of Redirection – 4xx: Some sort of Client Error – 5xx: Some sort of Server Error
Web Mining Taxonomy Web Mining Web Structure Mining Web Content Mining Web Usage Mining Web Page Content Mining Search Result Mining General Access Pattern Tracking Customized Usage Tracking
Web Mining Mining the World Wide Web Web Content Mining Web Structure Mining Web Usage Mining • Web Page Content Mining • Web Page Summarization • WebOQL(Mendelzon et.al. 1998) …: • Web Structuring query languages; • Can identify information within given web pages • (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages • ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages General Access Pattern Tracking Customized Usage Tracking Search Result Mining
Web Mining Mining the World Wide Web Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining • Search Result Mining • Search Engine Result Summarization • Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): • Categorizes documents using phrases in titles and snippets General Access Pattern Tracking Customized Usage Tracking
Web Mining Mining the World Wide Web Web Content Mining Web Usage Mining • Web Structure Mining • Using Links • PageRank (Brin et al., 1998) • CLEVER (Chakrabarti et al., 1998) • Use interconnections between web pages to give weight to pages. • Using Generalization • MLDB (1994) • Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure. General Access Pattern Tracking Search Result Mining Web Page Content Mining Customized Usage Tracking
Web Mining Mining the World Wide Web Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining Customized Usage Tracking • General Access Pattern Tracking • Web Log Mining (Zaïane, Xin and Han, 1998) • Uses KDD techniques to understand general access patterns and trends. • Can shed light on better structure and grouping of resource providers. Search Result Mining
Web Mining Mining the World Wide Web Web Content Mining Web Structure Mining Web Usage Mining • Customized Usage Tracking • Adaptive Sites (Perkowitz and Etzioni, 1997) • Analyzes access patterns of each user at a time. • Web site restructures itself automatically by learning from user access patterns. Web Page Content Mining General Access Pattern Tracking Search Result Mining
Web Content Mining • Agent-based Approaches: • Intelligent Search Agents • Information Filtering/Categorization • Personalized Web Agents • Database Approaches: • Multilevel Databases • Web Query Systems
Intelligent Search Agents • Locating documents and services on the Web: • WebCrawler, Alta Vista (http://www.altavista.com): scan millions of Web documents and create index of words (too many irrelevant, outdated responses) • MetaCrawler: mines robot-created indices • Retrieve product information from a variety of vendor sites using only general information about the product domain: • ShopBot
Intelligent Search Agents (Cont’d) • Rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents: • Harvest • FAQ-Finder • Information Manifold • OCCAM • Parasite • Learn models of various information sources and translates these into its own concept hierarchy: • ILA (Internet Learning Agent)
Information Filtering/Categorization • Using various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. • HyPursuit: uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space • BO (Bookmark Organizer): combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information
Personalized Web Agents • This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests (using collaborative filtering) • WebWatcher • PAINT • Syskill&Webert • GroupLens • Firefly • others
Multiple Layered Web Architecture More Generalized Descriptions Layern ... Generalized Descriptions Layer1 Layer0
Multilevel Databases • At the higher levels, meta data or generalizations are • extracted from lower levels • organized in structured collections, i.e. relational or object-oriented database. • At the lowest level, semi-structured information are • stored in various Web repositories, such as hypertext documents
Multilevel Databases (Cont’d) • (Han, et. al.): • use a multi-layered database where each layer is obtained via generalization and transformation operations performed on the lower layers • (Kholsa, et. al.): • propose the creation and maintenance of meta-databases at each information providing domain and the use of a global schema for the meta-database
Multilevel Databases (Cont’d) • (King, et. al.): • propose the incremental integration of a portion of the schema from each information source, rather than relying on a global heterogeneous database schema • The ARANEUS system: • extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views
Multi-Layered Database (MLDB) • A multiple layered database model • based on semi-structured data hypothesis • queried by NetQL using a syntax similar to the relational language SQL • Layer-0: • An unstructured, massive, primitive, diverse global information-base. • Layer-1: • A relatively structured, descriptor-like, massive, distributed database by data analysis, transformation and generalization techniques. • Tools to be developed for descriptor extraction. • Higher-layers: • Further generalization to form progressively smaller, better structured, and less remote databases for efficient browsing, retrieval, and information discovery.
Three major components in MLDB • S (a database schema): • outlines the overall database structure of the global MLDB • presents a route map for data and meta-data (i.e., schema) browsing • describes how the generalization is performed • H (a set of concept hierarchies): • provides a set of concept hierarchies which assist the system to generalize lower layer information to high layeres and map queries to appropriate concept layers for processing • D (a set of database relations): • the whole global information base at the primitive information level (i.e., layer-0) • the generalized database relations at the nonprimitive layers
The General architecture of WebLogMiner(a Global MLDB) Generalized Data Higher layers Site 1 Concept Hierarchies Site 2 Resource Discovery (MLDB) Knowledge Discovery (WLM) Site 3 Characteristic Rules Discriminant Rules Association Rules
Techniques for Web usage mining • Construct multidimensional view on the Weblog database • Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. • Perform data mining on Weblog records • Find association patterns, sequential patterns, and trends of Web accessing • May need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer • Conduct studies to • Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping
Web Usage Mining - Phases • Three distinctive phases: preprocessing, pattern discovery, and pattern analysis • Preprocessing - process to convert the raw data into the data abstraction necessary for the further applying the data mining algorithm • Resources: server-side, client-side, proxy servers, or database. • Raw data: Web usage logs, Web page descriptions, Web site topology, user registries, and questionnaire. • Conversion: Content converting, Structure converting, Usage converting
User: The principal using a client to interactively retrieve and render resources or resource manifestations. • Page view: Visual rendering of a Web page in a specific client environment at a specific point of time • Click stream: a sequential series of page view request • User session: a delimited set of user clicks (click stream) across one or more Web servers. • Server session (visit): a collection of user clicks to a single Web server during a user session. • Episode: a subset of related user clicks that occur within a user session.
Content Preprocessing - the process of converting text, image, scripts and other files into the forms that can be used by the usage mining. • Structure Preprocessing - The structure of a Website is formed by the hyperlinks between page views, the structure preprocessing can be done by parsing and reformatting the information. • Usage Preprocessing - the most difficult task in the usage mining processes, the data cleaning techniques to eliminate the impact of the irrelevant items to the analysis result.