1 / 65

Mining di dati web

c/o ISTI, Area Ricerca CNR, localit San Cataldo, Pisa, ingresso 19 ... attributed to the sage Valmiki, it recounts the life and exploits of Lord Rama. ...

PamelaLan
Download Presentation

Mining di dati web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Mining di dati web

    A.A 2006/2007

    Slide 2:Il Corso

    Codice:nw451 Sigla:MDW Crediti:6 Orario: Mercoledì e Venerdì 16:00-18:00, aula B Ricevimento: Richiedere appuntamento per e-mail c/o ISTI, Area Ricerca CNR, località San Cataldo, Pisa, ingresso 19

    Slide 3:Docenti

    Raffaele Perego raffaele.perego@isti.cnr.it, tel.0503152993 Claudio.Lucchese claudio.lucchese@isti.cnr.it, tel.0503152967 Fabrizio Silvestri fabrizio.silvestri@isti.cnr.it, tel.0503153011 Diego Puppin diego.puppin@isti.cnr.it, tel.0503153011 Antonio Panciatici antonio.panciatici@isti.cnr.it, tel.0503152967

    Slide 4:Obiettivi del corso

    Il World Wide Web (WWW) ha cambiato il modo di concepire le informazioni, di renderle fruibili e di gestirle. Scoprire nel web informazioni non note, non banali e rilevanti è sempre più importante e difficile. Il Web mining è quindi diventato fondamentale per l’ottimizzazione di strumenti strategici quali i siti di e-commerce, i motori di ricerca, le directory Il corso si propone l’obiettivo di fornire strumenti e conoscenze in questo settore

    Slide 5:Contenuti del Corso

    Introduzione Data Mining, Knowledge Discovery e il Web Motori di Ricerca Crawling, indexing, querying Web Content Mining Similarità, clustering, classificazione di testi Web Structure Mining Social networks, ranking, ecc. Web Usage Mining Recommender systems, ecc. Argomenti avanzati (?!)

    Slide 6:Materiale didattico

    Libro di testo Mining the Web: discovering knowledge from hypertext data. S. Chakrabarti. Morgan Kaufmann, 2003. Libri Consigliati Managing Gigabytes. I.H. Witten e A. Moffat e T.C. Bell. Morgan Kaufmann, 1999. Modern Information Retrieval. R. Baeza-Yates e B. Ribeiro-Neto. Addison Wesley, 1999. Lucidi delle lezioni e articoli Pubblicati su http://malvasia.isti.cnr.it/~raffaele/webmining

    Slide 7:Materiale didattico

    Si ringraziano Chakrabarti e Ramakrishnan Per i lucidi allegati al libro di testo scaricabili all’indirizzo: http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Fosca Giannotti e Dino Pedreschi Per i lucidi introduttivi mutuati dal corso TDM KDNUGGETS (http://www.kdnuggets.com) Ferragina, Attardi, Garcia Molina, ecc. Internet :-)

    Slide 8:Esame

    Prerequisiti (consigliati) AA270 – TDM – Tecniche di “Data Mining” – Primo Semestre. Modalità di Esame Il superamento dell’esame è condizionato al corretto svolgimento di un progetto (individuale o di gruppo?) e da una discussione orale sui contenuti del corso (seminario su un articolo a scelta?).

    Slide 9:Introduzione

    Data Mining e Knowledge Discovery Ipertesti e cenni di storia del Web Web Mining

    Slide 10:What is DM?

    Slide 11:What is DM?

    Slide 12:Motivations for DM

    Data explosion problem: Automated data collection tools, mature database technology and internet, lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in information, but starving for knowledge! (John Naisbett) Data mining : Extraction of interesting knowledge (rules, regularities, patterns, constraints) from large amounts of data

    Slide 13:Abundance of business and industry data Competitive focus - Knowledge Management Inexpensive, powerful computing engines Strong theoretical/mathematical foundations machine learning & logic statistics database management systems Etc.

    Motivations for DM

    Slide 14:Sources of Data (e.g.)

    Business Transactions widespread use of bar codes => storage of millions of transactions daily (e.g., Walmart: 2000 stores => 20M transactions per day, credit card records!!) most important problem: effective use of the data in a reasonable time frame for competitive decision-making e-commerce data Scientific Data data generated through multitude of experiments and observations examples, geological data, satellite imaging data, NASA earth observations, CERN HEP rate of data collection far exceeds the speed by which we analyze them Financial Data company information economic data (GNP, price indexes, etc.) stock markets

    Slide 15:Sources of Data (e.g.)

    Personal / Statistical Data government census medical histories customer profiles demographic data data and statistics about sports and athletes World Wide Web and Online Repositories Billions of Web documents, images, video, etc. emails, news, messages link structure of the hypertext from millions of Web sites Web usage data (from server/proxy logs, network traffic, and user registrations) online databases, and digital libraries

    Slide 16:Classes of DM applications

    Database analysis and decision support Market analysis target marketing, customer relation management, market basket analysis Risk analysis Forecasting, customer retention, quality control, competitive analysis. Fraud detection Text mining E.g. Mining opinions from email, documents

    Slide 17:THE WEB!! Searching: google, askjeeves, yahoo, etc. Social networks analysis Web advertizing E.g. IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior, analyzing effectiveness of Web marketing, improving Web site organization, etc. Watch for the PRIVACY pitfall! Many Others …. Sports. IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. Astronomy. JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

    Classes of DM applications

    Slide 18:The selection and processing of data for: the identification of novel, accurate, and useful patterns, and the modeling of real-world phenomena. Data mining is a major component of the KDD process automated discovery of patterns and development of predictive and explanatory models.

    What is KDD? A process!

    Slide 19:The KDD process

    Slide 20:The KDD Process in Practice

    KDD steps can be merged or combined Data Selection + Data Transformation = Data Consolidation Data Cleaning + Data Integration = Data Preprocessing KDD is an Iterative Process art + engineering rather than science

    Identify Problem or Opportunity Measure effect of Action Act on Knowledge Knowledge Results Strategy Problem

    Slide 21:The virtuous cycle

    Slide 22:Learning the application domain: relevant prior knowledge and goals of application Data consolidation: Creating a target data set Selection and Preprocessing Data cleaning : (may take 60% of effort!) Data reduction and projection: find useful features, dimensionality/variable reduction, invariant representation. Choosing data mining methods E.g., classification, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results. visualization, transformation, removing redundant patterns, … Use of discovered knowledge

    The steps of the KDD process

    Slide 23:Roles in the KDD process

    Slide 24:Major Data Mining Tasks

    Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships …

    Slide 25:Classification

    Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...

    Slide 26:Clustering

    Find “natural” grouping of instances given un-labeled data

    Slide 27:Association Rules & Frequent Itemsets

    Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)

    Slide 28:Visualization & Data Mining

    Visualizing the data to facilitate human discovery Presenting the discovered results in a visually "nice" way

    Slide 29:Summarization

    Describe features of the selected group Use natural language and graphics Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...

    Slide 30:Data Mining Central Quest

    Find true patterns and avoid overfitting

    Slide 31:Overfitting

    Finding seemingly significant but really random patterns due to searching too many possibilities Violation of Occam’s razor the explanation of any phenomenon should make as few assumptions as possible lex parsimoniae entia non sunt multiplicanda praeter necessitatem,

    Slide 32:Hypertexts and the Web

    Slide 33:World Wide Web

    Hypertext documents Text Links Web billions of documents authored by millions of diverse people edited by no one in particular distributed over millions of computers, connected by variety of media

    Slide 34:History of Hypertext

    Citation Hyperlinking Branching, non-linear discourse, nested commentary Ramayana - one of the great epic poems of India; attributed to the sage Valmiki, it recounts the life and exploits of Lord Rama. Mahabharata- an epic poem that recounts the struggle between the Kauravas and Pandavas over the disputed kingdom of Bharata, the ancient name for India Talmud - compilation of Jewish oral teachings, assembled in written form in the early centuries of the Christian era Dictionary, encyclopedia self-contained networks of textual nodes joined by referential links

    Slide 35:Hypertext systems

    Memex, 1945 [Vannevar Bush, US President Roosevelt's science advisor] stands for “memory extension” Aim: to create and help follow hyperlinks across documents photoelectrical-mechanical storage and computing device that could store vast amounts of information, in which a user had the ability to create links of related text and illustrations. This trail could then be stored and used for future reference. Bush believed that using this associative method of information gathering was not only practical in its own right, but was closer to the way the mind ordered information."

    Slide 36:Hypertext systems

    Hypertext, term coined by Ted Nelson in a 1965 paper to the ACM 20th national conference: [...] By 'hypertext' mean nonsequential writing - text that branches and allows choice to the reader, best read at an interactive screen.

    Slide 37:Hypertext systems

    The first hypertext-based system was developed in 1967 by a team of researchers led by Dr. Andries van Dam at Brown University. The research was funded by IBM and the first hypertext implementation, Hypertext Editing System, ran on an IBM/360 mainframe. IBM later sold the system to the Houston Manned Spacecraft Center which reportedly used it for the Apollo space program documentation

    Slide 38:Hypertext systems

    Xanadu hypertext, by Ted Nelson, 1981: In the Xanadu scheme, a universal document database (docuverse), would allow addressing of any substring of any document from any other document. "This requires an even stronger addressing scheme than the Universal Resource Locators used in the World-Wide Web." [De Bra] Additionally, Xanadu would permanently keep every version of every document, thereby eliminating the possibility of a broken link. Xanadu would only maintain the current version of the document in its entirety.

    Slide 39:World-wide Web

    Initiated at CERN in 1989 By Tim Berners-Lee, now w3c director: “W3 was originally developed to allow information sharing within internationally dispersed teams, and the dissemination of information by support groups. Originally aimed at the High Energy Physics community, it has spread to other areas and attracted much interest in user support, resource discovery and collaborative work areas. It is currently the most advanced information system deployed on the Internet, and embraces within its data model most information in previous networked information systems.”

    Slide 40:World-wide Web

    GUIs Berners-Lee (WorldWideWeb - 1990) Erwise and Viola(1992), Midas (1993) Mosaic (1993) a hypertext GUI for the X-window system HTML: markup language for rendering hypertext HTTP: hypertext transport protocol for sending HTML and other data over the Internet CERN HTTPD: server of hypertext documents

    The early days of the Web : CERN HTTP traffic grows by 1000 between 1991-1994 (image courtesy W3C)

    The early days of the Web: The number of servers grows from a few hundred to a million between 1991 and 1997 (image courtesy Nielsen)

    Slide 43:1994: the landmark year

    Foundation of the “Mosaic Communications Corporation” (later Nestcape) first World-Wide Web conference MIT and CERN agreed to set up the World-wide Web Consortium (W3C).

    Slide 44:The Web

    A populist, participatory medium number of writers =(approx) number of readers. enables near-zero-cost dissemination of information Abundance and authority crisis liberal and informal culture of content generation and dissemination. Very little uniform civil code. redundancy and non-standard form and content. millions of qualifying pages for most broad queries Example: java or kayaking no per se authoritative information about the reliability of a site

    Slide 45:Problems due to Uniform accessibility

    little support for adapting to the background of specific users. commercial interests routinely influence the operation of Web search Users pay for connection costs, not for contents Profit depends from ads, sales, etc “Search Engine Optimization“ !!

    Slide 46:What is Web Mining?

    Examples: Web search, e.g. Google, Yahoo, MSN, Ask, … Specialized search: e.g. Froogle (comparison shopping), job ads (Flipdog) eCommerce : Recommendations: e.g. Netflix, Amazon improving conversion rate: next best product to offer Advertising, e.g. Google Adsense Fraud detection: click fraud detection, … Improving Web site design and performance Discovering interesting and useful information from Web content, structure and usage

    Reproduced from Ullman & Rajaraman with permission

    Slide 47:How does it differ from “classical” Data Mining?

    The web is not a relation Textual information and linkage structure Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses Content and structure data rich in features and patterns spontaneous formation and evolution of topic-induced graph clusters hyperlink-induced communities Ability to react in real-time to usage patterns No human in the loop

    Slide 48:How big is the Web ?

    Number of pages Technically, infinite Because of dynamically generated content Lots of duplication (30-40%) Best estimate of “unique” static HTML pages comes from search engine claims Google = 8 billion, Yahoo = 20 billion Lots of marketing hype Reproduced from Ullman & Rajaraman with permission

    Slide 49:96,854,877 web sites (Sept 2006)

    http://news.netcraft.com/archives/web_server_survey.html Total Sites Across All Domains August 1995 - September 2006

    Slide 50:The web as a graph

    Pages = nodes, hyperlinks = edges Ignore content Directed graph High linkage 8-10 links/page on average Power-law degree distribution Reproduced from Ullman & Rajaraman with permission

    Slide 51:Power-law degree distribution

    Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission

    Slide 52:Power-laws abounding

    In-degrees Out-degrees Number of pages per site Number of visitors Term distribution in pages Query distribution in query logs Let’s take a closer look at structure Broder et al. (2000) studied a crawl of 200M pages and other smaller crawls Not a “small world” Reproduced from Ullman & Rajaraman with permission

    Slide 53:Bow-tie Structure

    Source: Broder et al, 2000 Reproduced from Ullman & Rajaraman with permission

    Slide 54:Searching the Web

    Content consumers Reproduced from Ullman & Rajaraman with permission

    Slide 55:Ads vs. search results

    Reproduced from Ullman & Rajaraman with permission

    Slide 56:Ads vs. search results

    Search advertising is the revenue model Multi-billion-dollar industry Advertisers pay for clicks on their ads Interesting problems How to pick the top 10 results for a search from 2,230,000 matching pages? What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid? Reproduced from Ullman & Rajaraman with permission

    Slide 57:Sidebar: What’s in a name?

    Geico sued Google, contending that it owned the trademark “Geico” Thus, ads for the keyword geico couldn’t be sold to others Court Ruling: search engines can sell keywords including trademarks No court ruling yet: whether the ad itself can use the trademarked word(s) Reproduced from Ullman & Rajaraman with permission

    Slide 58:Extracting Structured Data

    http://www.simplyhired.com Reproduced from Ullman & Rajaraman with permission

    Slide 59:Extracting structured data

    http://www.fatlens.com Reproduced from Ullman & Rajaraman with permission

    Slide 60:The Long Tail (yet another power-law)

    Source: Chris Anderson (2004) Reproduced from Ullman & Rajaraman with permission

    Slide 61:The Long Tail

    Shelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters,… The web enables near-zero-cost dissemination of information about products More choices necessitate better filters Recommendation engines (e.g., Amazon) Reproduced from Ullman & Rajaraman with permission

    Slide 62:Major Web Mining topics

    Crawling the web Web graph analysis Structured data extraction Classification and vertical search Collaborative filtering Web advertising and optimization Mining web logs Systems Issues Reproduced from Ullman & Rajaraman with permission

    Slide 63:Web search basics

    Reproduced from Ullman & Rajaraman with permission

    Slide 64:Search engine components

    Spider (a.k.a. crawler/robot) – builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources The indexer – creates inverted indexes Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor – serves query results Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them Reproduced from Ullman & Rajaraman with permission

More Related