440 likes | 580 Views
Action 2: Mine the Web. Industrial Day Roma, 10 Giugno 2004. Action 2 - Partners. Dipartimento di Informatica, Università di Pisa. KDD & HPC Labs ISTI-CNR, Pisa. ICAR-CNR, Cosenza. Action 2 – Mine the Web. The project: four Work Packages
E N D
Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004
Action 2 - Partners Dipartimento di Informatica, Università di Pisa KDD & HPC Labs ISTI-CNR, Pisa ICAR-CNR, Cosenza ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web • The project: four Work Packages (Action Coordinator Dott. Fosca Giannotti, ISTI-CNR) • Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR) • WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica • Work Package 2.2. Indexing and compression (UNIPI) • WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica • Work Package 2.3. Managing Terabytes (ISTI, ICAR) • WP Coordinator : Dott. Raffaele Perego, ISTI-CNR • Work Package 2.4. Participatory Search Services (UNIPI) • WP Coordinator : Prof. Maria Simi, Dip. Informatica ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web • The main goals of the ECD Project, content enhancement and delivery, are here pursued in a complementary way w.r.t. Action 1 • The focus is onDelivering Enhanced Web Contents to (Communities of) Users: • Exploiting Web Mining to extract knowledge/models that can be used to enhance efficacy and efficiency of the various phases of the information search process • Design, validate and provide efficient and scalable solutions for retrieving, storing, and delivering Web contents to users ECD - Industrial Day, Roma 10 Giugno 2004
Motivations • On-line data grows rapidly: • 50+M new pages/day, font: IBM • 100+k news, articles/day font: IBM • Databases, digital libraries, etc. • Internet use tracking produces additional interesting data: • Servers logs, WSE logs, network traffic logs • Goldman Sachs estimates (2002): “between 80 and 90 percent of information on the Internet and corporate networks is unstructured” ECD - Industrial Day, Roma 10 Giugno 2004
Motivations • The limits of the current means of access to web contents are becoming clear • Low precision and quality, difficulty of matching users’ subjective relevance • over-abundance of low-quality web material • low covering and freshness • much relevant information in the hidden web • ranking mechanisms penalize important pages that enter the scene • Difficulties in • managing size, complexity, heterogeneity • identifying Patterns and Trends within huge amounts of unstructured contents Web Mining plays an important role. It allows to synthesize and extract precious information and knowledge ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining Web Mining: Exploiting Data Mining techniques with data coming from the Web • User-Centric View (Client-Side) • discovery of documents on a subject • discovery of semantically related documents or document segments • extraction of relevant knowledge about a subject from multiple sources Data Mining: the process of discovery interesting knowledge from large amount of data stored in databases, data warehouses, or other repositories Goal: assist users or site owners in finding something useful/interesting/relevant • Owner-Centric View (Server-Side) • increasing contact / conversion efficiency (Web marketing) • targeted promotion of goods, services, products, ads • measuring effectiveness of site content / structure • providing dynamic personalized services or content ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining Taxonomy WebMining Web Usage Mining Web Content Mining Web Structure Mining 131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/finger.jpg HTTP/1.1" 304 - 131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/logokdd.jpg HTTP/1.1 " 304 - 131.114.21.41 - - [27/May/2004:19:24:09 +0200] "GET /didattica/BDM2004/TDM_intro .19.02.04.pdf HTTP/1.1" 200 131072 131.114.21.41 - - [27/May/2004:19:24:12 +0200] "GET /didattica/BDM2004/TDM_intro .19.02.04.pdf HTTP/1.1" 206 196608 131.114.21.41 - - [27/May/2004:19:24:13 +0200] "GET /didattica/BDM2004/TDM_intro .19.02.04.pdf HTTP/1.1" 206 338224 ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining Applications • Web Usage Mining • discovering customer preference and behavior • Web personalization / collaborative filtering • adaptive Web sites / improving Web site organization • e-business intelligence, etc. • Web Content Mining • information filtering / knowledge extraction • Web document categorization • discovery of ontologies on the Web, etc. • Web Structure Mining • Finding "Quality" or "authoritative" sites based on linkage and citations • IBM CLEVER project • Google • Etc. ECD - Industrial Day, Roma 10 Giugno 2004
Some related projects • WebFountain - IBM • WebBase - Stanford DBGroup ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain IBM World-Wide Web, News Forums, Weblogs, etc. Customer Electronic Text WebFountain Infrastructure for Advanced Text Analytics Finds patterns, trends and relationships in text • Application Examples: • Marketing • Intelligence • Research Newspapers, Magazines, etc. ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain: an infrastructure for Advanced Text Analytics applications ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain: Reputation Tracking ECD - Industrial Day, Roma 10 Giugno 2004
WebBase Stanford DBgroup ECD - Industrial Day, Roma 10 Giugno 2004
WebBase Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web: application scenario • So far, barely no approach analyzes how a given group of users access the Web, with the aim of exploiting usage information to provide enhanced access to web resources to the users from this group • We think that it is possible to learn from usage data of a group of web users new models and patterns that, in combination with document content and structure, may yield enhanced content access and delivery • better search services, better categorization and document classification services, better question answering services ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web • Ambitious objective: Exploit the combination of Web data about: USAGE, STRUCTURE, CONTENT originated/accessed by a Virtual Organization, to improve the efficacy and efficiency of the knowledge extraction process from the users point of view • Developing solutions: • Innovative w.r.t. the state of the art • Appropriate for the Web domain ECD - Industrial Day, Roma 10 Giugno 2004
Virtual Organizations Internet Virtual Community ECD - Industrial Day, Roma 10 Giugno 2004
Tracking Virtual Organizations Virtual Community • Tracking the interaction of the virtual community with internet allows us to collect several interesting information • Network Traffic data provide detailed information about: • Usage • Preferred sites, user sessions • Content • Accessed Documents • Structure • From client sessions we can build the usage Web subgraph • By parsing the documents retrieved we can build the corresponding link graph ECD - Industrial Day, Roma 10 Giugno 2004
Link and Traffic graph Traffic graph Link graph Tracking Virtual Organizations Virtual Community ECD - Industrial Day, Roma 10 Giugno 2004
We need an infrastructure: the Web Object Store (WOS) • A Web Data Management System optimized to efficiently handle content, usage, and structure web data Purpose: Enable (possibly) innovative Web IR and Web Mining research by locally providing a small, but significant, portion of the Web built according to our user-centric view • Manage large collections of • Web pages • Preprocessed Usage data • Structure data • Collected within our virtual community ECD - Industrial Day, Roma 10 Giugno 2004
Persistent store of objects Web data management system for web content, structure and usage data Management of data at many abstraction levels Fast development of new applications Easy C++ annotation of new persistent objects Read and write data in tables • Related activities: • Clustering Emails • Caching of Documents and of Query results • Efficient and scalable pattern mining and clustering algorithms • Enhanced compression methods • Clustering/categorizing query results snippets • Clustering XML documents • Etc. Clustering/Pattern/Classification Web Mining algorithms • Efficient and scalable access methods: • IXE b-trees, full-text indexes • search in compressed data • Efficient and scalable storage: • IXE persistent objects • compression • distributed architecture Data cleaning, preprocessing, filtering • Population: • traffic raw data of our community • IXE Crawler • Partecipatory search WOS and related activities ECD - Industrial Day, Roma 10 Giugno 2004
WOS applications • Some innovative applications are currently pursued within our project: • Characterization, on the basis of usage only or usage + contents + structure, of new important emerging sites, or irrelevant sites (e.g., advertising sites); • crucial to instruct the crawler of the community web repository towards fresh, relevant documents while avoiding unimportant documents • Page ranking based also on usage information, for achieving a more accurate and dynamic measurement of document relevance • Recommendation of similar/related documents and keywords, on the basis of combined usage/content analysis • Caching and clustering of web search results ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: usage data (WP 2.1) • We collected long periods of proxy-level IP traffic originated from SERRA network (domain unipi.it) • The whole University of Pisa • Many-to-many interactions • Inter-site user sessions • Massive data • Millions/day HttpRequest • ~1 GB/day raw data ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: content data (WP 2.4) • Methods to gather contents to populate Web Object Store • IXE Crawler • Participatory Search System (main activity this year) • Hidden Web Search ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: content data (WP 2.4) initial urls init • IXE crawler get next url Internet get page extract urls web pages ECD - Industrial Day, Roma 10 Giugno 2004
IXE Crawler • Parallel/distributed crawler • High performance through: • asynchronous I/O (500 connections/thread) • asynchronous DNS resolution • keep-alive connections • multi-threads • URL compression • 9 Mb/sec transfer rate (7 times nutch.org crawler) ECD - Industrial Day, Roma 10 Giugno 2004
Participatory search: the idea • Participatory search: • each participant builds an index of the local contents and sends it to a central server • the central server implements a community search service collecting and merging the participants' indexes • A model that fits community needs for dedicated search services • A trade-off between a centralized search model (e.g.: Google), and a distributed approach (e.g.: Gnutella, Kazaa) ECD - Industrial Day, Roma 10 Giugno 2004
CI S CI CI CI S CI Search results Search Index Documents Participatory Search CIS CIS CIS CIS CIS C – Crawler I – Indexer S – Search Engine ECD - Industrial Day, Roma 10 Giugno 2004
Participatory Search: benefits • Participants are in charge of • selecting what to index and to publish • when to publish (no need of coordination with an external crawler) • Control on index update and freshness • Publishing of Hidden Web content ECD - Industrial Day, Roma 10 Giugno 2004
A s c Booster c’ Storage and access methods: compression (WP 2.2) Our technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee The better isA, the better isAboost The more compressible iss, the better is Aboost Qualitatively, we show that • c’is shorter thanc, ifsis compressible • Time(Aboost) = Time(A), i.e. no slowdown • Ais used as a black-box Key Components: Burrows-Wheeler Transform, Suffix Tree, and a Greedy processing of them ECD - Industrial Day, Roma 10 Giugno 2004
Storage and access methods (WP 2.1 and 2.2) • Repository of URLs • Compressed • Prefix and Suffix search within URLs • Search by hostname, path, file-ext, … select count(*) from … where url LIKE ‘http://%.it/%.asp’ • Up to two order of magnitude faster than using sequential scan and B-tree • Space occupacy << B-tree ECD - Industrial Day, Roma 10 Giugno 2004
Storage and access methods: index compression (WP 2.3) • Assigning DocIDs in a clever way could improve the compression factor of traditional variable-[bit/byte] encoding methods by increasing the number of small DGaps. • Clustering property: within each posting lists there are dense zones (i.e. a lot of small DGaps). • Our problem consists of enhancing the Clustering Property of posting lists. ECD - Industrial Day, Roma 10 Giugno 2004
Compression Enhancement ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.1, 2.2 and 2.3) • Web Caching • Mining of web/proxy server requests aimed at improving LRU-based document caching (WP 2.1) • Recommendation system • (On line/Off line) Mining of web sessions aimed at profiling users and recommending them related pages(WP 2.1, 2.3) • Transactional Clustering • Clustering specialized on transactional data aimed at categorizing web pages, user sessions, snippet sequences, search engine results (WP 2.1, 2.2) ECD - Industrial Day, Roma 10 Giugno 2004
Citation PageView Session sCluster Content delivery (WP 2.3) • SUGGEST: a recommendation system made up of two distinct modules • Offline: performing model extraction by a clustering algorithm which partition the Usage Graph • Online: performing users classification and suggestion generation • The WOS remarkably shortened implementation time (< 500 C++ lines) • We used three WOS objects to produce a persistent clustering structure ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: clusters the web-snippets returned by many search engine(s) into hierarchically labeled folders which are created on-the-fly to catch the various meaning of the answers returned for a user query Content delivery (WP 2.2) Goal: Retrieve the pages which match the user needs. This is a much difficult task in the light of the fact that: • the Web size is increasing and so the number of answers • the Web coverage is a problem for a single search engine • Web pages are heterogeneous • User needs are subjective and time-varying • “list of keywords” paradigm for a user query may be ambiguous ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: An example fo use ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: An example fo use Look at the DEMO ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.1) • Clustering of • E-mails (manco) • XML documents (chiara) • ?? ECD - Industrial Day, Roma 10 Giugno 2004
On going and future activities • Work in progress • Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT Web data to improve efficacy and efficiency in the interaction of the user with the Web • Implementation of additional WOS layers • Compression booster, XML clustering • Future work (medium-long term) • WOS, final version • Community-oriented ranking • Content (news, xml, ..) clustering • Cooperation with Nutch.org (Doug Cutting in Pisa next October) • etc ECD - Industrial Day, Roma 10 Giugno 2004
Deployment scenarios • Concerning the role of the WOS and of the ECD applications three (non-exclusive) possible deployment scenarios could be devised • The WOS is a research infrastructure, in the spirit of the WebBase project at Stanford University • The WOS is an infrastructure for web analytics services to be offered to third parties, in a spirit close to the WebFountain IBM project • The WOS can become a product for Web Data Management Systems aimed at developing and engineering web mining ECD applications, again in a spirit close to WebBase ECD - Industrial Day, Roma 10 Giugno 2004
Demo Session • Three demos here • WOS: browsing usage data (Mirko Nanni, Vincenzo Bacarella) • SnakeT: Web snippets clustering (Paolo Ferragina, Antonio Gullì) • ANTIX: Participatory Search System (Andrea Esuli) • Some other activities described in the Posters ECD - Industrial Day, Roma 10 Giugno 2004
More information • Interested people can find these slides, more information, documents and the full list of publications at the address: • http://ecd.isti.cnr.it ECD - Industrial Day, Roma 10 Giugno 2004