210 likes | 452 Views
Web Community Mining and Web log Mining : Commody Cluster based execution. Romeo Zitarosa. Overview. Introduction Web Community Mining Web log mining on MIS Parallel Data Mining on Pc Cluster Performance Evaluation Conclusion. Introduction. Proposed two application of web mining:
E N D
Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa Mining di Dati Web
Overview • Introduction • Web Community Mining • Web log mining on MIS • Parallel Data Mining on Pc Cluster • Performance Evaluation • Conclusion Mining di Dati Web
Introduction • Proposed two application of web mining: 1) Extract web Communities 2) Understand Behaviour of Mobile Internet Users (Usage Mining) Mining di Dati Web
Web Community Mining • Web Community def: A web Community is a collection of web pages created by individuals or association that have common interests on a specific topic. Mining di Dati Web
Proposed technique • Starts from a set o seed • Based on RPA • Create a Community Chart Mining di Dati Web
Authorities and Hubs • Authority : page with good contents on a topic linked by many good hub pages. • Hub : page with a list of hyperlink to valuable pages on a topic, that points to good authorities. • Community Core = Authority + Hubs Mining di Dati Web
Web Community Mining • Algorithm: 1. Seed set 2. Apply RSA to each seed: Built web subgraph and extract (using HITS) hubs and authority. 3. Investigate how seed derive other seed as related pages. Mining di Dati Web
Example 1. Consider that s derivest as related page and vice versa. “s” and “t” are pointed to by similar set of hubs. 2. Consider that s derivest as related page and but t doesn’t derives s. “t” is pointed to by many different hubs so “t” derives a different set of related pages Mining di Dati Web
Observation In this way we define a symmertic derivation relationship for identify Communities. Def. Community : Set of pages strongly connected by “s.d.r”. Two Communities are related if a member of one community derives a member of the other community. Mining di Dati Web
Web Community Chart • Def. Is a Graph that consist of communities as nodes and weighted edges between nodes. The weight represents the relevance of the community • We need a tool to browse Communities Mining di Dati Web
Web Community Chart(2) • Label assigned manually • Box = list of URLs sorted by connectivity score. • Def. Connectivity score: number of derivation relatioship from the node to others node of the community. Mining di Dati Web
Example Mining di Dati Web
Mobile Info Search (MIS) • NTT laboratories • Goal : provide location aware information from internet collecting, structuring, filtering and organizing. • www.kokono.net Mining di Dati Web
kokono There is a database-type resource between user and information souces (online maps,yellow pages, etc.) Mining di Dati Web
MIS Functionalities • User Location Acquisition - GPS,PHS,postal number • Location Oriented Robot-Based Search(kokono) - search documents close to a location - display documents in order of distance written in the doc and user position • Location Oriented Meta Search - backbone database accessed by CGI programs. Mining di Dati Web
Association Rule Mining • Support , confidence • Hierarchy => Taxonomy • Hierarchy allow to find not only rules specific to a location but also wider area that covers that location. • Identify Acces patterns of MIS users. • Prefetch information. • Reduce acces time. • Spatial information gives valuabel information to mobile users. Mining di Dati Web
Sequential Rule Mining • Sequential Patterns • Derive how different services are used together. Example: Define the plan after checking the weather: Submit_weather = Wether Forecast subimit_shop = Shop Info && shop_web = townpage Submit_kokono = KOKONOSearch Submit_map = MAP Mining di Dati Web
Parallel DM and Pc Cluster • Parallel Apriori - nodes keep all candidate itemsets - scan indipendently the dataset - comunicate only at the end of the phase Problem : Too much memory used!!! Solution (Partial) : Hash Partitioned Apriori (HPA). - candidates are partitioned using hash function - each node buils candidate Itemsets - a lot of disk I/O when support is small Mining di Dati Web
Parallel Algorithm for Association Rule Mining • Non partitioned generalized (NPGM) • Hash Partitioned (HPGM) - reduce communications • Hierarchical HPGM (H-HPGM) - candidate whoose root is identical allocated on the same node • H-HPGM with Fine Grain Duplicates (H-HPGM-FGD) - use remaining free space Mining di Dati Web
Performance evaluation Oss. Time increase when support becomes small Mining di Dati Web
Conclusion • Real web Mining application need high performance computing system • Pc Cluster with his scalable performance (and high costs) is a promising platform… Mining di Dati Web