Web Data Mining Research Issues and Techniques

Research Issues in Web Data Mining Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN skm@cs.purdue.edu Sourav Bhowmick, Ng Wee Keong and Lim Ee Peng Nanyang Technological University, Singapore copy-right@sanjaymadria

WHOWEDA!www.cais.ntu.edu.sg:8000/~whoweda • A WareHouse Of WEb DAta • Web Information Coupling Model (WICM) • Web Objects • Web Schema • Web Information Coupling Algebra • Web Information Maintenance • Web Mining and Knowledge discovery copy-right@sanjaymadria

Web Objects • Node - url, title, format, size, date, text • Link - source-url, target-url, label, link-type • Web tuple • Web table • Web schema • Web database copy-right@sanjaymadria

User WWW Warehouse Concept Mart Web Querying & Analysis Component Web Information Mining System Web Information Coupling System Web Information Maintenance System Web Mart Web Mart Web Warehouse Web Mart Web Mart

User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Union Web Select Web Intersection Web Project Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match

Web Schema • Structural ‘summary’ of web table • Information Coupling using a Query graph • Query graph ->Web schema • directed graph as ordered 4-tuple: • Set of node variables • Set of link variables • Connectivities • Predicates copy-right@sanjaymadria

copy-right@sanjaymadria

Information Square's homepage Headline article 1 Headline article n News@TCS Local news 1 (List of video files) List of links to local news News specials Local news k World news 1 Airport info List of links to world news World news t copy-right@sanjaymadria

e e x x y y target_url CONTAINS "article” g g f z label CONTAINS "Local News" target_URL CONTAINS "newshub/specials" url CONTAINS "local" h w label CONTAINS "World News" url CONTAINS "world" url contains “headlines” copy-right@sanjaymadria

Information Square's homepage Headline article 1 List of links to local news Local news 1 News specials World news 1 List of links to world news copy-right@sanjaymadria

Schema- example • Node variables: Xn = { x, y, z, w } • Link variable: Xl = { e, f, g } • Connectivities: C = { x<e>y and x<fg->z and x<fh->w } copy-right@sanjaymadria

Predicates • P={x.url=”http://www.mediacity.com.sg/i-square”, • y.url CONTAINS “headlines” • e.target_url CONTAINS "article", • f.target.url CONTAINS "newshub/specials", • g.label CONTAINS "Local News", • z.url CONTAINS "local", • h.label CONTAINS "World News", • w.url CONTAINS "world" } copy-right@sanjaymadria

Query Graph - Example • Query graph - same as schema • Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. • Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/ • Web tableDiseases copy-right@sanjaymadria

Treatment list q g Treatment http://www.panacea.org/ Issues Symptoms list f y x z Symptoms List of Diseases e Evaluation Evaluation w p

q1 Treatment list g1 Treatment http://www.panacea.org/ Issues f1 Symptoms list x0 y1 z1 Symptoms AIDS List of Diseases e1 Evaluation Evaluation w1 p2 Elisa Test

Example 2 • Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/ • Web tableDrugs copy-right@sanjaymadria

Drug list Side effects http://www.panacea.org/ Issues r c a b d Side effects List of Diseases Use s k Uses

Side effects of Indavir Drug list http://www.panacea.org/ Issues AIDS r1 a0 b1 c1 d1 Indavir Side effects List of Diseases Use s1 k1 Uses of Indavir

WWW Data Mining • web structure mining : Web structure mining involves mining the web document’s structures and links. • web content mining : Web content mining describes the automatic search of information resources available on-line. • web usage mining : Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc. copy-right@sanjaymadria

Web Structure Mining : Issues • Measuring the frequency of the local links (links in the same server) in the web tuples in a web table. • web tuples have more information about inter-related documents that exists at the same server. • measures the completeness of the web site in a sense that most of the closely related information is available at the same site(server). • For example, an airline’s home page will have more local links connecting the “routing information with air-fares and schedules”. copy-right@sanjaymadria

Measuring the frequency of web tuples in a web table containing links which are interior; links which are within the same file. • measures a web document’s ability to cross-reference other related web pages within the same document. • measures the flow of the web documents. copy-right@sanjaymadria

Measuring the frequency of web tuples in a web table that contains links that are global; links which span different web sites. • measures the visibility of the web documents and ability to relate similar or related documents across different sites. • For example, research documents related to “semi-structured data” will be available at many sites and such sites should be visible to other related sites by providing cross references by the popular phrases such as “more related links”. copy-right@sanjaymadria

Measuring the frequency of identical web tuples that appear in a web table or among the web tables. • measures the replication of web documents and may help in identifying the mirrored sites. • What is the in-degree and out-degree of each node (web document)? What is the meaning of high and low in- and out-degrees? • Locating links to popular web sites in the web tuples in a table. • Number of web tuples are returned in response to a query on some popular phrases such as “Bio-science” with respect to queries containing keywords like “earth-science”. copy-right@sanjaymadria

discover the nature of the hyperlinks in the web sites of a particular domain. • What information do they provide and how are they related conceptually. • Is it possible to extract a conceptual hierarchical information for designing web sites of a particular domain. • generalizing the flow of information in web sites representing some particular domain. copy-right@sanjaymadria

Web Bags and Web Structure Mining • Most of the search engines fail to handle the following knowledge discovery goals: • locate the most visible web sites or documents for reference. Many paths (high fan in) can reach that sites or documents. • locate the most luminous web sites or documents for reference. web sites or documents which have the most number of outgoing links. • find the most traversed path for a particular query result. To identify the set of most popular interlinked web documents that have been traversed frequently to obtain the query result. copy-right@sanjaymadria

Applications of Visibility • Association rules • e-commerce copy-right@sanjaymadria

From the results returned, find most visible pages. Assume Z1 is the most visible page with the given threshold. • This gives estimates about different restaurants selling pizzas. • Lower threshold gives you set (Z1, Z2) as visible pages, which sells both pizza and pasta. • Generalize rules such as out of 66% of restaurants which offer pizza to their customers, 33% also offers pasta. copy-right@sanjaymadria

E-commerce application • My web site’s visibility is going down!!!! copy-right@sanjaymadria

Application - Luminosity • Association rules such as X% of all the companies which makes a product “A”, Y% of them also makes a set of products “B and C”. • Exmple - certain companies (33%) if they make a product A also make products B and C. • the company C makes only the product A. • That is, 66% of companies which make a product “A” , 33% of them also make products B and C. copy-right@sanjaymadria

Web Content Mining • what does it mean to mine content from the web? • Is extracting information from a very small subset of all HTML web pages is also an instance of web data mining? • mining a subset of web pages stored in one or more web tables is more feasible option. • Similarity and difference between web content mining in web warehouse context and conventional data mining. copy-right@sanjaymadria

Selection of type of data in the WWW to do web content mining. • Cleaning of selected data to mine effectively. • Types of knowledge that can be discovered in a web warehouse context. • Discovery of types of information hidden in a web warehouse which are useful for decision making. • specify, measure and justify the interestingness of the discovered knowledge • knowledge to be discovered are as follows: generalized relation, characteristic rule, discriminate rule, classification rule, association rule, and deviation rule. copy-right@sanjaymadria

Do the data mining techniques applicable to web mining and if yes, how? For example, we are interested in generating the following types of rules: 40% of web tuples (i..e, web pages) in response to a “travel information query from Hong Kong to Macau” suggest that popular means of traveling is by ferry. • To derive some additional knowledge in a web warehouse for web content mining. • mining previously unknown knowledge in a web warehouse. • Presentation of discovered knowledge to the users to expedite complex decision making. copy-right@sanjaymadria

Web Usage Mining • discovery of user access patterns from web servers; user profile, access pattern for pages, etc. used for efficient and effective web site management and the user behavior. • In WHOWEDA, the user initiates a coupling framework to collect related information. • For example, coupling a query graph “to find the hotel information” with the query graph “to find the places of interest”. • From this query graph, we can generate some user access pattern of coupling framework like “50% of users who query “hotel” also couple their query with “places of interest”. copy-right@sanjaymadria

find coupled concepts from the coupling framework. • helps in organizing web sites. • For example, web documents that provide information on “hotels” should also have hyperlinks to web pages providing information on “places of interest”. copy-right@sanjaymadria

Warehouse Concept Mart • Knowledge discovery in web data becomes more and more complex due to the large number of data on WWW. • build the concept hierarchies involving web data to use them in knowledge discovery. • collection of concept hierarchies a Warehouse Concept Mart (WCM). • concept mart is build by extracting and generalizing terms from web documents to represent classification knowledge of a given class hierarchy. copy-right@sanjaymadria

For unclassified words, they can be clustered based on their common properties. Once the clusters are decided, the keywords can be labeled with their corresponding clusters, and common features of the terms are summarized to form the concept description. • associate a weight at each level of concept marts to evaluate the importance of a term with respect to the concept level in the concept hierarchy. copy-right@sanjaymadria

Web Concept Mart Applications • Intelligent answering of web queries • supply the threshold for a given key word in the warehouse concept mart and the words with the threshold more than the given value can be taken into consideration when answering the query. • use different levels of concepts in the warehouse concept mart or can provide approximate answers. • provide the user some knowledge in framing the global coupling query graph. • Example - DBMS and Oracle • .Web mining and Concept Mart • Mining association rules techniques to mine the association between words appearing in the concept mart at various levels and in the web tuples returned as the result of a query. • Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise. • A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values. copy-right@sanjaymadria

Web mining and Concept Mart • mine the association between words appearing in the concept mart at various levels and in the web tuples returned as the result of a query. • Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise. • A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values. • capture the flow of web sites of particular domain; helpful in location information copy-right@sanjaymadria

Conclusions • web mining issues in context of the web warehousing project called WHOWEDA (Warehouse of Web Data). • discussed web mining issues with respect to web structure, web content and web usage. • Our focus is to design tools and techniques for web mining to generate some useful knowledge from the WWW data. • We are working on formal algorithms to generate association rules and classification rules. copy-right@sanjaymadria

پایگاه پاورپوینت ایرانwww.txtzoom.comبانک اطلاعات هوشمند پاورپوینت copy-right@sanjaymadria

Web Data Mining Research Issues and Techniques

Web Data Mining Research Issues and Techniques

Presentation Transcript

Issues with Data Mining

Web/Google Data Mining

Web Mining Issues

Web Mining Research : A Survey

Advanced Topics in Data Mining: Web Mining

Issues in Data Mining Infrastructure

Current Research in Data Mining Research Group

Data Mining | Web Scraping

Issues in Data Mining Applications -Tutorial-

Data Mining Web Sites

Current Research in Data Mining Research Group

Issues in Data Mining Infrastructure

PRIVACY AND security Issues IN Data Mining

Issues in Data Mining Infrastructure

Computational and Statistical Issues in Data-Mining

Web-Mining Agents Data Mining

Web Mining Research : A Survey