700 likes | 813 Views
Statistical Learning from Relational Data. Daphne Koller Stanford University Joint work with many many people. Relational Data is Everywhere. The web Webpages (& the entities they represent), hyperlinks Social networks People, institutions, friendship links Biological data
E N D
Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people
Relational Data is Everywhere • The web • Webpages (& the entities they represent), hyperlinks • Social networks • People, institutions, friendship links • Biological data • Genes, proteins, interactions, regulation • Bibliometrics • Papers, authors, journals, citations • Corporate databases • Customers, products, transactions
Relational Data is Different • Data instances not independent • Topics of linked webpages are correlated • Data instances are not identically distributed: • Heterogeneous instances (papers, authors) No IID assumption This is a good thing
New Learning Tasks • Collective classification of related instances • Labeling an entire website of related webpages • Relational clustering • Finding coherent clusters in the genome • Link prediction & classification • Predicting when two people are likely to be friends • Pattern detection in network of related objects • Finding groups (research groups, terrorist groups)
Probabilistic Models • Uncertainty model: • space of “possible worlds”; • probability distribution over this space. • Worlds: often defined via a set of state variables • medical diagnosis: diseases, symptoms, findings, … each world: an assignment of values to variables • Number of worlds is exponential in # of vars • 2n if we have n binary variables
Outline • Relational Bayesian networks* • Relational Markov networks • Collective Classification • Relational clustering * with Avi Pfeffer, Nir Friedman, Lise Getoor
CPD P(G|D,I) A B C Bayesian Networks Difficulty Intelligence Grade nodes = variables edges = direct influence SAT Job Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade
Intell_Jane Diffic_CS101 Intell_George Diffic_CS101 Grade_Jane_CS101 Grade_George_CS101 Intell_George Diffic_Geo101 Grade_George_Geo101 Bayesian Networks: Problem • Bayesian nets use propositional representation • Real world has objects, related to each other Intelligence Difficulty These “instances” are not independent A C Grade
Registration Grade Satisfaction Relational Schema • Specifies types of objects in domain, attributes of each type of object & types of relations between objects Classes Student Professor Intelligence Teaching-Ability Teach Take Attributes Relations In Course Difficulty
Teaching-ability Teaching-ability Intelligence Welcome to Geo101 Difficulty Intelligence Difficulty St. Nordaf University World Prof. Jones Prof. Smith Teaches Teaches Grade In-course Registered Satisfac George Grade Registered Satisfac In-course Welcome to CS101 Grade Registered Jane Satisfac In-course
Professor Teaching-Ability Student Intelligence Course Difficulty A B C Reg Grade Satisfaction Relational Bayesian Networks • Universals: Probabilistic patterns hold for all objects in class • Locality: Represent direct probabilistic dependencies • Links define potential interactions [K. & Pfeffer; Poole; Ngo & Haddawy]
Teaching-ability Teaching-ability Grade Intelligence Satisfac Welcome to Geo101 Grade Difficulty Satisfac Intelligence Grade Difficulty Satisfac RBN Semantics • Ground model: • variables: attributes of all objects • dependencies: determined by relational links & template model Prof. Jones Prof. Smith George Welcome to Welcome to CS101 CS101 Jane
C Welcome to Geo101 Welcome to A low high CS101 The Web of Influence easy / hard low / high
Outline • Relational Bayesian networks* • Relational Markov networks† • Collective Classification • Relational clustering * with Avi Pfeffer, Nir Friedman, Lise Getoor † with Ben Taskar, Pieter Abbeel
Why Undirected Models? • Symmetric, non-causal interactions • E.g., web: categories of linked pages are correlated • Cannot introduce direct edges because of cycles • Patterns involving multiple entities • E.g., web: “triangle” patterns • Directed edges not appropriate • “Solution”: Impose arbitrary direction • Not clear how to parameterize CPD for variables involved in multiple interactions • Very difficult within a class-based parameterization [Taskar, Abbeel, K. 2001]
Template potential Markov Networks James Mary Kyle Noah Laura
Student1 Reg1 Template potential Intelligence Grade Course Study Group Difficulty Reg2 Student2 Grade Intelligence Relational Markov Networks • Universals: Probabilistic patterns hold for all groups of objects • Locality: Represent local probabilistic dependencies • Sets of links give us possible interactions
Welcome to Geo101 Difficulty Difficulty RMN Semantics Intelligence Grade Geo Study Group George Grade Welcome to CS101 Intelligence Grade Jane CS Study Group Grade Intelligence Jill
Outline • Relational Bayesian Networks • Relational Markov Networks • Collective Classification* • Discriminative training • Web page classification • Link prediction • Relational clustering * with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel
Student Course Reg Collective Classification Probabilistic Relational Model Training Data Features: .x Labels: .y* Learning Model Structure New Data Conclusions • Train on one year of student intelligence, course difficulty, and grades • Given only grades in following year, predict all students’ intelligence Inference Features: ’.x Labels: ’.y Example:
Student1 Reg1 Intelligence Grade Course Study Group Difficulty Reg2 Student2 Grade Intelligence Learning RMN Parameters Parameterize potentials as log-linear model Template potential
Max Likelihood Estimation We don’t care about the joint distribution P(.x, .y) Estimation Classification maximizew argmaxy .x .y*
Tom Mitchell Professor Project-of WebKB Project Advisor-of Member Sean Slattery Student Web KB [Craven et al.]
Web Classification Experiments • WebKB dataset • Four CS department websites • Bag of words on each page • Links between pages • Anchor text for links • Experimental setup • Trained on three universities • Tested on fourth • Repeated for all four combinations
Category ... WordN Word1 Standard Classification Page Categories: faculty course project student other Professor department extract information computer science machine learning …
Category ... ... LinkWordN WordN Word1 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Logistic Standard Classification Page working with Tom Mitchell … Discriminatively trained naïve Markov = Logistic Regression test set error • 4-fold CV: • Trained on 3 universities • Tested on 4th
Power of Context Professor? Student? Post-doc?
Compatibility (From,To) FT To- Page From- Page Category Category ... ... WordN WordN Word1 Word1 Link Collective Classification
To- Page From- Page Category Category ... ... WordN WordN Word1 Word1 Link 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Logistic Links Collective Classification Classify all pages collectively, maximizing the joint label probability test set error [Taskar, Abbeel, K., 2002]
Students Faculty W1 C Wn S S Courses More Complex Structure
0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Collective Classification: Results 35.4% error reduction over logistic test set error [Taskar, Abbeel, K., 2002] Logistic Links Section Link+Section
Max Conditional Likelihood We don’t care about the conditional distribution P(.y|.x) Estimation Classification maximizew argmaxy .x .y*
Estimation Classification maximize ||w||=1 argmaxy .x .y* Max Margin Estimation What we really want: correct class labels Quadratic program margin # labeling mistakes in y Exponentially many constraints [Taskar, Guestrin, K., 2003] (see also [Collins, 2002; Hoffman 2003])
Max Margin Markov Networks • We use structure of Markov network to provide equivalent formulation of QP • Exponential only in tree width of network • Complexity = max-likelihood classification • Can solve approximately in networks where induced width is too large • Analogous to loopy belief propagation • Can use kernel-based features! • SVMs meet graphical models [Taskar, Guestrin, K., 2003]
WebKB Revisited 16.1% relative reduction in error relative to cond. likelihood RMNs
Member Advisor-of Member Predicting Relationships Tom Mitchell Professor WebKB Project Even more interesting: relationships between objects Sean Slattery Student
Introduce exists/type attribute for each potential link Learn discriminative model for this attribute Collectively predict its value in new world Predicting Relations 72.9% error reduction over flat To- Page From- Page Category Category ... ... Word1 WordN Word1 WordN Relation Exists/ Type ... LinkWordN LinkWord1 [Taskar, Wong, Abbeel, K., 2003]
Outline • Relational Bayesian Networks • Relational Markov Networks • Collective Classification • Relational clustering • Movie data* • Biological data† * with Ben Taskar, Eran Segal † with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er, Haidong Wang, Micha Shapira, David Botstein
Student Course Reg Relational Clustering Probabilistic Relational Model Unlabeled Relational Data Learning Clustering of instances Model Structure • Given only students’ grades, cluster similar students Example:
Learning w. Missing Data: EM • EM Algorithm applies essentially unchanged • E-step computes expected sufficient statistics, aggregated over all objects in class • M-step uses ML (or MAP) parameter estimation • Key difference: • In general, the hidden variables are not independent • Computation of expected sufficient statistics requires inference over entire network
Students Courses A B C P(Registration.Grade | Course.Difficulty, Student.Intelligence) Learning w. Missing Data: EM [Dempster et al. 77] low / high easy / hard
Movie Data Internet Movie Database http://www.imdb.com
Actor Director Movie Rating Genres #Votes Year MPAA Rating Discovering Hidden Types Learn model using EM Type Type Type [Taskar, Segal, K., 2001]
Directors Movies Actors Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger Terminator 2 Batman Batman Forever GoldenEye Starship Troopers Mission: Impossible Hunt for Red October Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman … … Discovering Hidden Types [Taskar, Segal, K., 2001]
Gene 1 Gene 2 Coding Coding Control Control RNA Protein Swi5 Transcription factor Biology 101: Gene Expression Swi5 DNA Cells express different subsets of their genes in different tissues and under different conditions
Gene Expression Microarrays • Measure mRNA level for all genes in one condition • Hundreds of experiments • Highly noisy Expression of gene i in experiment j Experiments Induced Genes Repressed
Clustering Standard Analysis • Cluster genes by similarity of expression profiles • Manually examine clusters to understand what’s common to genes in cluster
Gene Experiment Properties of Experiment j Properties of Gene i Attributes Attributes Expression level of Gene i in Experiment j Level Expression General Approach • Expression level is a function of gene properties and experiment properties • Learn model that best explains the data • Observed properties: gene sequence, array condition, … • Hidden properties: gene cluster • Assignment to hidden variables (e.g., module assignment) • Expression level as function of properties
g.C P(Ei.L| g.C) Naïve Bayes 1 g.C 0 CPD 1 CPD 2 CPD k 2 0 g.E1 g.E2 g.Ek 3 0 Clustering as a PRM Gene Experiment Cluster ID Level Expression