Iets over data mining en Information Retrieval : Motivatie & Samenvatting

Gegevensbanken 2012Iets over data miningen Information RetrievalBettina Berendthttp://people.cs.kuleuven.be/~bettina.berendt/

Iets over data miningen Information Retrieval: Motivatie & Samenvatting

Waar zijn we? Les # wie wat 1 ED intro, ER 2 ED EER, (E)ER naar relationeel schema 2 ED relationeel model 3 KV Relationele algebra & relationeel calculus 4,5 KV SQL 6 KV Programma's verbinden met gegevensbanken 7 KV Functionele afhankelijkheden & normalisatie 8 KV PHP 10 BB Beveiliging van gegevensbanken 11 BB Geheugen en bestandsorganisatie 12 BB Externe hashing 13 BB Indexstructuren 14 BB Queryverwerking 15-17 BB Transactieverwerking en concurrentiecontrole 18 BB Data mining en Information Retrieval 9 ED XML (en meer over het Web als GB), NoSQL Nieuwe thema‘s / vooruitblik

Aan wie zou een bank geld lenen? Gegevensbanken queries: • Wie heeft ooit een krediet niet terugbetaald? SELECT DISTINCT Fname, Lname FROM Clients, Loans WHERE clientID = loantakerID AND paid = „NO“ Data Warehousing / Online Analytical Processing OLAP: • In welke wijken hebben meer dan 20% van de clienten vorig jaar een krediet niet terugbetaald? Data Mining: • Bij welke mensen is te verwachten dat ze een krediet niet terugbetalen? (= wijk, baan, leeftijd, geslacht, ...)

nog een toepassingsgebied • Het Web • Je gebruikt Web data mining elke dag 

Indexering en ranking

Gedragsanalyse voorrecommender systems

Tekstmining voor recommender systems

Of ook

Wie koopt de printer XYZ ? • Mijn Klant (ezf.): database lookup • „Ik ken het antwoord niet, maar de volgende 2398445 pagina‘s zijn relevant voor uw query“: zoekmachine / information retrieval / document retrieval • Deze gebruiker (omwille van zijn profiel, zijn postings, zijn vrienden en hun eigenschappen, …): data mining • Iemand die pas zijn oude printer verkocht/weggegooid heeft: logica • Verschillende methodes voor inferentie; • verschillende types van antwoorden • Beschrijven / bekende gegevens versus voorspellen

Het volgende is ook … • … een vooruitblik op verschillende cursussen in de Master, bv. • Advanced Databases • Text-based Information Retrieval • Current Trends in Databases • Data Mining • Ook interessant / gerelateerd (logica!), maar niet het onderwerp van vandaag: • Modellering van complexe systemen

Agenda Hoe worden gegevens machtig? Mining & combinatie Methoden (1): Classifier learning op relaties Methoden (2): Itemset mining Van relaties naar teksten Methoden (3): Classifier learning op teksten (Een beetje) KD proces: Preprocessing Wat doen zoekmachines? Wat kunnen WIJ doen?

Knowledge discovery (en data mining) • “het niet-triviale proces voor het identificeren van geldige, nieuwe, mogelijk te gebruiken, en uiteindelijk verstaanbare patronen in data.” Datamining

Data mining technieken • Verkennende data-analyse met interactieve, vaak visuele methoden • Beschrijvende modellering (schatting van de dichtheid, clusteranalyse en segmentatie, afhankelijkheidsmodellering) • Voorspellende modelleringen (classificatie en regressie) • Het doel is een model te bouwen waarmee de waarde van één variable te voorspellen is, op basis van de gekende waarden voor de andere variabelen. • In classificatie is de voorspelde waarde een categorie; • bij regressie is deze waarde quantitatief • Het ontdekken van (lokale) patronen en regels • Typische voorbeelden zijn frequente patronen zoals • verzamelingen, sequenties, subgrafen • en regels die hieruit afgeleid kunnen worden (bv. associatieregels)

Bijzonder interessant op basis van gecombineerde gegevens ... ... en ... ... en ...

Gegevens • relationele gegevens, • teksten, • grafen, • semi-gestructureerde gegevens (bv. Web clickstreams) • beelden, • …

Input data ... Q: when does this person play tennis? Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No

Terminology (using a popular data example) Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Rows: • Instances • (think of them as objects) • Days, described by: Columns: • Features • Outlook, Temp, … In this case, there is a feature with a special role: • The class • Play (does X play tennis on this day?) This is “relational DB mining“. We will later see other types of data and the mining applied to them.

The goal: a decision tree for classification / prediction • In which weather • will someone play (tennis etc.)?

Constructing decision trees • Strategy: top downRecursive divide-and-conquer fashion • First: select attribute for root nodeCreate branch for each possible attribute value • Then: split instances into subsetsOne for each branch extending from the node • Finally: repeat recursively for each branch, using only instances that reach the branch • Stop if all instances have the same class

Which attribute to select?

Criterion for attribute selection • Which is the best attribute? • Want to get the smallest tree • Heuristic: choose the attribute that produces the “purest” nodes • Popular impurity criterion: information gain • Information gain increases with the average purity of the subsets • Strategy: choose attribute that gives greatest information gain

Computing information • Measure information in bits • Given a probability distribution, the info required to predict an event is the distribution’s entropy • Entropy gives the information required in bits(can involve fractions of bits!)‏ • Formula for computing the entropy:

Example: attribute Outlook

Computing information gain • Information gain: information before splitting – information after splitting • Information gain for attributes from weather data: gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])‏ = 0.940 – 0.693 = 0.247 bits gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) = 0.152 bits gain(Windy ) = 0.048 bits

Continuing to split gain(Temperature ) = 0.571 bits gain(Humidity ) = 0.971 bits gain(Windy ) = 0.020 bits

Final decision tree • Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further

V: entropy, heeft dit iets te maken met het thermodynamische concept ( een maat voor de wanorde van iets, een grootheid die enkel kan toenemen, ongeacht wat er gebeurd) of staat dit hier helemaal los van? • A: Ja en neen … • Aanbevolene bron: • Stanford encyclopedia of Philosophy • http://plato.stanford.edu/entries/information-entropy/ • Iets korter (maar ik kan de inhoud niet beoordelen): • http://en.wikipedia.org/wiki/Entropy_in_thermodynamics_and_information_theory

Gegevens • „Market basket (winkelmandje) data“: attributen met booleaanse domeinen • In een tabel  elke rij is een basket (ook: transactie)

Als relationele tabel

Solution approach: The apriori principle and the pruning of the search tree (1) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter spaghetti Tomato sauce bread butter 

Genereren van grote k-itemsets met Apriori • Min. support = 40% • Stap 1: kandidaat-1-itemsets • Spaghetti: support = 3 (60%) • Tomatensaus: support = 3 (60%) • Brood: support = 4 (80%) • Boter: support = 1 (20%)

Stap 2: grote 1-itemsets • Spaghetti • Tomatensaus • Brood • kandidaat-2-itemsets • {Spaghetti, tomatensaus}: support = 2 (40%) • {Spaghetti, brood}: support = 2 (40%) • {tomatensaus, brood}: support = 2 (40%)

Stap 3: grote 2-itemsets • {Spaghetti, tomatensaus} • {Spaghetti, brood} • {tomatensaus, brood} • kandidaat-3-itemsets • {Spaghetti, tomatensaus, brood}: support = 1 (20%) • Stap 4: grote 3-itemsets • { }

Van itemsets naar associatieregels • Schema: Als subset dan grote k-itemset met support s en confidence c • s = (support van grote k-itemset) / # tupels • c = (support van grote k-itemset) / (support van subset) • Voorbeeld: • Als {spaghetti} dan {spaghetti, tomatensaus} • Support: s = 2 / 5 (40%) • Confidence: c = 2 / 3 (66%)

Het kan beter … (een mogelijkheid)V: de FP-boom NULL Br:4 S:1 T:1 S:2 T:1 T:1

Teksten als relaties IF star AND Britney THEN Celebrity IF star AND Dipper THEN Astronomy

Teksten als itemsets („sets of words“) IF star AND Britney THEN Spears IF star AND Dipper THEN Big

Teksten als bags of words

GB-Structuren daarachter:Wat en waarvoor een index? (3) – vinden (hier: volledig geïnverteerde bestanden)

Teksten als bags of words • Welke documenten zijn waarschijnlijk • meest belangrijk voor een zoek naar • Britney • star ? Gelijkaar-digheid query – doc ! Britney is zeer characteristiek voor doc 1. Star is niet characteristiek (in elke doc!).  Term frequency / inverse doc. Freq. TF.IDF gewichten voor worden

V: Is het hierbij de bedoeling dat je een webpagina omzet in één of andere soort vector waarin de belangrijkste info staat? Hoe gaat zoiets in zijn werk, wat staat er dan in zo een vector?

Iets over data mining en Information Retrieval : Motivatie & Samenvatting