200 likes | 317 Views
Learning Relational Probability Trees. Jennifer Neville David Jensen Lisa Friedland Michael Hay. Presented by Andrew Tjang. About the authors. David Jensen – Director of KDL, research asst. prof @ umass
E N D
Learning Relational Probability Trees Jennifer Neville David Jensen Lisa Friedland Michael Hay Presented by Andrew Tjang
About the authors • David Jensen – Director of KDL, research asst. prof @ umass • Research focuses on the statistical aspects and architecture of data mining systems and the evaluation of those systems for government and business applications
Authors, cont’d • Lisa Friedland – graduate student, KDL, interest in DB/data mining • Michael Hay – graduate student, KDL • Jennifer Neville – • masters student @ Umass in Knowledge Discovery Lab (KDL investigates how to find useful patterns in large and complex databases.)
Learning Relational Probability Trees (RPTs) • Classification trees used in data mining: easily interpretable • Conventional: homogeneous data, statistically independent • Much research gone into classification trees -> relational data, so variances can be safely “ignored” and relational effects can be on the front burner
Why is relational data so complicated? • Traditional assumption of instance independence is violated. • 3 characteristics of relational data • Concentrated linkage • Degree disparity • Relational autocorrelation • All 3 potentially produce overly complex models with excess structure • This leads to factually incorrect relations and more space/computation requirements.
What did they do? • Conventional techniques/modeling assume independence • RPT and class of algorithms to adjust for relational characteristics. RPTs the only statistical model to adjust for these relational characteristics • Applied these algorithms to a set of databases (IMDb, Cora, WebKB, human genome) • Then evaluated accuracies of their RPT applications
Examples of relational data • Social networks, genomic data, data on interrelated people, places, things/events from text documents. • First part was to apply RPT to WebKB • (a collection of websites classified into categories – course, faculty staff, student, research project) • Appx 4000 pages, 8000 hyperlinks • DB contains attributes associated with each page/hyperlink (path/domain, direction of hyperlink)
RBTs • Estimate probability distros over att values • These algs differ from regular – they consider the effect of related object • Movie example • Use Aggregation used to “proportionalize” relational data. • E.g. some movies have 10 actors, some have 1000 • Aggregate as avg age. (either preprocess and or dynamically during learning process
The Algorithm • RPT algorithm takes collection of subgraphs as input (single target object ot be classified + other objects/links in relat’l neighborhood • Construct a probability estimation tree (like previous fig) to predict target class label given: • The atts of target object • The atts of other objects in neighborhood • The degree atts counting object and neighborhood objects
Algorithm (cont’d) • Searches over space of binary relat’l features to “split data” • MODE(actor.gender)=female • Constructs features based on atts of different objects in subgraphs and aggregation of values. • Feature scores calculated from class after split using chi squared analysis
Algorithm Part trois • The p-value from the chi squared will reveal which features are significantly correlated. The max of these is included in tree • Does this alg recursively greedily until class distros don’t change significantly • “Bonferroni adjustment” + Laplace correction to improve probability estimates
Relational features • Traditional features • Genre=comedy • ID att, operator, value • Relational features • Movie(x),Y = {y|ActedIn(x,y)}:Max(Age(Y))>65 • ID relation that links an object x to a set of other object Y.
Feature Selection Biases • Concentrated linkage and autocorrelation combine to reduce effective sample size and increase variance. • Degree disparity can cause sprious elements and miss useful elements.
Randomization Test • Used on relational data sets where hypothesis testing not as useful. • Used to account for biases • Computationally intensive • Pseudo samples generated by permuting variables • In this case permute att values before aggregation • This new permuted vector preserves intrinsic links/relations but removes correlations btw atts and class label
Tests • Use IMDb to predict $2million box office receipts on opening weekend • 1) IMDb with class labels only correlated with degree features. • 2)Use structure of IMDb as well as attributes • 3)From Cora : if paper is of topic “neural networks” • 4) Genome project: if gene is in nucleus • 5) WebKB: is student web page
Tests (cont’d) • Each of these tasks a set of classification techniques were applied: • RPT w/ Chi squared (CT) • RPT with randomized testing (RT) • RPT (pseudo) w/o randomized testing (C4.5) • Baseline Relat’l Bayesian Classifier (RBC) • * entries statistically significant differences
Results explained • None except for random showed that their algorithm performed any better than other algorithms • For random they show: • RPTs can adjust for linkage and autocorrelation. • RTs perfrom better than other 3 models • Biased models select random attributes that hinder performance.
Conclusions (their’s and mine) • Possible to extend conventional methods for estimation tree algorithms • RPT w/randomization and RPT w/o randomization perform about the same (smaller trees in the former) • Randomization adjusts for biases (but is this so important if they perform the same?) • Non selective don’t have biases, but lose interpretablilty. • Future work enrichments to RPT • Better ways of calc’ing ranking scores