1 / 52

Aggregate features for relational data Claudia Perlich, Foster Provost

Aggregate features for relational data Claudia Perlich, Foster Provost. Pat Tressel 16-May-2005. Overview. Perlich and Provost provide... Hierarchy of aggregation methods Survey of existing aggregation methods New aggregation methods Concerned w/ supervised learning only

phong
Download Presentation

Aggregate features for relational data Claudia Perlich, Foster Provost

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aggregate features for relational dataClaudia Perlich, Foster Provost Pat Tressel 16-May-2005

  2. Overview • Perlich and Provost provide... • Hierarchy of aggregation methods • Survey of existing aggregation methods • New aggregation methods • Concerned w/ supervised learning only • But much seems applicable to clustering

  3. The issues… • Most classifiers use feature vectors • Individual features have fixed arity • No links to other objects • How do we get feature vectors from relational data? • Flatten it: • Joins • Aggregation • (Are feature vectors all there are?)

  4. Joins • Why consider them? • Yield flat feature vectors • Preserve all the data • Why not use them? • They emphasize data with many references • Ok if that’s what we want • Not ok if sampling was skewed • Cascaded or transitive joins blow up

  5. Joins • They emphasize data with many references: • Lots more Joes than there were before...

  6. Joins • Why not use them? • What if we don’t know the references? • Try out everything with everything else • Cross product yields all combinations • Adds fictitious relationships • Combinatorial blowup

  7. Joins • What if we don’t know the references?

  8. Aggregates • Why use them? • Yield flat feature vectors • No blowup in number of tuples • Can group tuples in all related tables • Can keep as detailed stats as desired • Not just max, mean, etc. • Parametric dists from sufficient stats • Can apply tests for grouping • Choice of aggregates can be model-based • Better generalization • Include domain knowledge in model choice

  9. Aggregates • Anything wrong with them? • Data is lost • Relational structure is lost • Influential individuals are lumped in • Doesn’t discover critical individuals • Dominates other data • Any choice of aggregates assumes a model • What if it’s wrong? • Adding new data can require calculations • But can avoid issue by keeping sufficient statistics

  10. Taxonomy of aggregates • Why is this useful? • Promote deliberate use of aggregates • Point out gaps in current use of aggregates • Find appropriate techniques for each class • Based on “complexity” due to: • Relational structure • Cardinality of the relations (1:1, 1:n, m:n) • Feature extraction • Computing the aggregates • Class prediction

  11. Taxonomy of aggregates • Formal statement of the task: • Notation (here and on following slides): • t, tuple (from “target” table T, with main features) • y, class (known per t if training) • Ψ, aggregation function • Φ, classification function • σ, select operation (where joins preserve t) • Ω, all tables; B, any other table, b in B • u, fields to be added to t from other tables • f, a field in u • More, that doesn’t fit on this slide

  12. Taxonomy of aggregates • Formal statement of the task: • Notation (here and on following slides): • Caution! Simplified from what’s in the paper! • t, tuple (from “target” table T, with main features) • y, class (known per t if training) • Ψ, aggregation function • Φ, classification function • σ, select operation (where joins preserve t) • Ω, all tables; B, any other table, b a tuple in B • u, fields to be added to t from joined tables • f, a field in u • More, that doesn’t fit on this slide

  13. Aggregation complexity • Simple • One field from one object type • Denoted by:

  14. Aggregation complexity • Multi-dimensional • Multiple fields, one object type • Denoted by:

  15. Aggregation complexity • Multi-type • Multiple object types • Denoted by:

  16. Relational “concept” complexity • Propositional • No aggregation • Single tuple, 1-1 or n-1 joins • n-1 is just a shared object • Not relational per se – already flat

  17. Relational “concept” complexity • Independent fields • Separate aggregation per field • Separate 1-n joins with T

  18. Relational “concept” complexity • Dependent fields in same table • Multi-dimensional aggregation • Separate 1-n joins with T

  19. Relational “concept” complexity • Dependent fields over multiple tables • Multi-type aggregation • Separate 1-n joins, still only with T

  20. Relational “concept” complexity • Global • Any joins or combinations of fields • Multi-type aggregation • Multi-way joins • Joins among tables other than T

  21. Current relational aggregation • First-order logic • Find clauses that directly predict the class • Ф is OR • Form binary features from tests • Logical and arithmetic tests • These go in the feature vector • Ф is any ordinary classifier

  22. Current relational aggregation • The usual database aggregates • For numerical values: • mean, min, max, count, sum, etc. • For categorical values: • Most common value • Count per value

  23. Current relational aggregation • Set distance • Two tuples, each with a set of related tuples • Distance metric between related fields • Euclidean for numerical data • Edit distance for categorical • Distance between sets is distance of closest pair

  24. Proposed relational aggregation • Recall the point of this work: • Tuple t from table T is part of a feature vector • Want to augment w/ info from other tables • Info added to t must be consistent w/ values in t • Need to flatten the added info to yield one vector per tuple t • Use that to: • Train classifier given class y for t • Predict class y for t

  25. Proposed relational aggregation • Outline of steps: • Do query to get more info u from other tables • Partition the results based on: • Main features t • Class y • Predicates on t • Extract distributions over results for fields in u • Get distribution for each partition • For now, limit to categorical fields • Suggest extension to numerical fields • Derive features from distributions

  26. Do query to get info from other tables • Select • Based on the target table T • If training, known class y is included in T • Joins must preserve distinct values from T • Join on as much of T’s key as is present in other table • Maybe need to constrain other fields? • Not a problem for correctly normalized tables • Project • Include all of t • Append additional fields u from joined tables • Anything up to all fields from joins

  27. Extract distributions • Partition query results various ways, e.g.: • Into cases per each t • For training, include the (known) class y in t • Also (if training) split per each class • Want this for class priors • Split per some (unspecifed) predicate c(t) • For each partition: • There is a bag of associated u tuples • Ignore the t part – already a flat vector • Split vertically to get bags of individual values per each field f in u • Note this breaks association between fields!

  28. Distributions for categorical fields • Let categorical field be f with values fi • Form histogram for each partition • Count instances of each value fi of f in a bag • These are sufficient statistics for: • Distribution over fi values • Probability of each bag in the partition • Start with one per each tuple t and field f • Cft, (per-) case vector • Component Cft[i], count for fi

  29. Distributions for categorical fields • Distribution of histograms per predicatec(t) and field f • Treat histogram counts as random variables • Regard c(t) true partition as a collection of histogram “samples” • Regard histograms as vectors of random variables, one per field value fi • Extract moments of these histogram count distributions • mean (sort of) – reference vector • variance (sort of) – variance vector

  30. Distributions for categorical fields • Net histogram per predicate c(t), field f • c(t) partitions tuples t into two groups • Only histogram the c(t) true group • Could include ~c as a predicate if we want • Don’t re-count! • Already have histograms for each t and f – case reference vectors • Sum the case reference vectors columnwise • Call this a “reference vector”, Rfc • Proportional to average histogram over t for c(t) true (weighted by # samples per t)

  31. Distributions for categorical fields • Variance of case histograms per predicatec(t) and field f • Define “variance vector”, Vfc • Columnwise sum of squares of case reference vectors / number of samples with c(t) true • Not an actual variance • Squared means not subtracted • Don’t care: • It’s indicative of the variance... • Throw in means-based features as well to give classifier full variance info

  32. Distributions for categorical fields • What predicates might we use? • Unconditionally true, c(t) = true • Result is net distribution independent of t • Unconditional reference vector, R • Per class k, ck(t) = (t.y == k) • Class priors • Recall for training data, y is a field in t • Per class reference vector,

  33. Distributions for categorical fields • Summary of notation • c(t), a predicate based on values in a tuple t • f, a categorical field from a join with T • fi, values of f • Rfc, reference vector • histogram over fi values in bag for c(t) true • Cft, case vector • histogram over fi values for t’s bag • R, unconditional reference vector • Vfc, variance vector • Columnwise average squared ref. vector • X[i], i th value in some ref. vector X

  34. Distributions for numerical data • Same general idea – representative distributions per various partitions • Can use categorical techniques if we: • Bin the numerical values • Treat each bin as a categorical value

  35. Feature extraction • Base features on ref. and variance vectors • Two kinds: • “Interesting” values • one value from case reference vector per t • same column in vector for all t • assorted options for choosing column • choices depend on predicate ref. vectors • Vector distances • distance between case ref. vector and predicate ref. vector • various distance metrics • More notation: acronym for each feature type

  36. Feature extraction: “interesting” values • For a given c, f, select that fi which is... • MOC: Most common overall • argmaxiR[i] • Most common in each class • For binary class y • Positive is y = 1, Negative is y = 0 • MOP: argmaxiRft.y=1[i] • MON: argmaxiRft.y=0[i] • Most distinctive per class • Common in one class but not in other(s) • MOD: argmaxi |Rft.y=1[i] - Rft.y=0[i] | • MOM: argmaxi MOD / Vft.y=1[i] - Vft.y=0[i] • Normalizes for variance (sort of)

  37. Feature extraction: vector distance • Distance btw given ref. vector & each case vector • Distance metrics • ED: Edit – not defined • Sum of abs. diffs, a.k.a. Manhattan dist? • Σi |C[i] – R[i] | • EU: Euclidean • √(C[i] T R[i] ), omit √ for speed • MA: Mahalanobis • √(C[i] TΣ-1 R[i] ), omit √ for speed • Σshould be covariance...of what? • CO: Cosine, 1- cos(angle btw vectors) • 1 - C[i] T R[i] / √ (|C[i] ||R[i] |)

  38. Feature extraction: vector distance • Apply each metric w/ various ref. vectors • Acronym is metric w/ suffix for ref. vector • (No suffix): Unconditional ref. vector • P: per-class positive ref. vector, Rft.y=1 • N: per-class positive ref. vector, Rft.y=0 • D: difference between P and D distances • Alphabet soup, e.g. EUP, MAD,...

  39. Feature extraction • Other features added for tests • Not part of their aggregation proposal • AH: “abstraction hierarchy” (?) • Pull into T all fields that are just “shared records” via n:1 references • AC: “autocorrelation” aggregation • For joins back into T, get other cases “linked to” each t • Fraction of positive cases among others

  40. Learning • Find linked tables • Starting from T, do breadth-first walk of schema graph • Up to some max depth • Cap number of paths followed • For each path, know T is linked to last table in path • Extract aggregate fields • Pull in all fields of last table in path • Aggregate them (using new aggregates) per t • Append aggregates to t

  41. Learning • Classifier • Pick 10 subsets each w/ 10 features • Random choice, weighted by “performance” • But there’s no classifier yet...so how do features predict class? • Build a decision tree for each feature set • Have class frequencies at leaves • Features might not completely distinguish classes • Class prediction: • Select class with higher frequency • Class probability estimation: • Average frequencies over trees

  42. Tests • IPO data • 5 tables • Most fields in the “main” table, used as T • Other tables had key & one data field • Predicate on one field in T used as the class • Tested against: • First-order logic aggregation • Extract clauses using an ILP system • Append evaluated clauses to each t • Various ILP systems • Using just data in T • (Or T and AH features?)

  43. Tests • IPO data • 5 tables w/ small, simple schema • Majority of fields were in the “main” table, i.e. T • The only numeric fields were in main table, so no aggregation of numeric features needed • Other tables had key & one data field • Max path length 2 to reach all tables, no recursion • Predicate on one field in T used as the class • Tested against: • First-order logic aggregation • Extract clauses using an ILP system • Append evaluated clauses to each t • Various ILP systems • Using just data in T (or T and AH features?)

  44. Test results • See paper for numbers • Accuracy with aggregate features: • Up to 10% increase over only features from T • Depends on which and how many extra features used • Most predictive feature was in a separate table • Expect accuracy increase as more info available • Shows info was not destroyed by aggregation • Vector distance features better • Generalization

  45. Interesting ideas (“I”) & benefits (“B”) • Taxonomy • I: Division into stages of aggregation • Slot in any procedure per stage • Estmate complexity per stage • B: Might get the discussion going • Aggregate features • I: Identifying a “main” table • Others get aggregated • I: Forming partitions to aggregate over • Using queries with joins to pull in other tables • Abstract partitioning based on predicate • I: Comparing case against reference histograms • I: Separate comparison method and reference

  46. Interesting ideas (“I”) & benefits (“B”) • Learning • I: Decision tree tricks • Cut DT induction off short to get class freqs • Starve DT of features to improve generalization

  47. Issues • Some worrying lapses... • Lacked standard terms for common concepts • “position i [of vector has] the number of instances of [ith value]”... -> histogram • “abstraction hierarchy” -> schema • “value order” -> enumeration • Defined (and emphasized) terms for trivial and commonly used things • Imprecise use of terms • “variance” for (something like) second moment • I’m not confident they know what Mahalanobis distance is • They say “left outer join” and show inner join symbol

  48. Issues • Some worrying lapses... • Did not connect “reference vector” and “variance vector” to underlying statistics • Should relate to bag prior and field value conditional probability, not just “weighted” • Did not acknowledge loss of correlation info from splitting up joined u tuples in their features • Assumes fields are independent • Dependency was mentioned in the taxonomy • Fig 1 schema cannot support § 2 example query • Missing a necessary foreign key reference

  49. Issues • Some worrying lapses... • Their formal statement of the task did not show aggregation as dependent on t • Needed for c(t) partitioning • Did not clearly distinguish when t did or did not contain class • No need to put it in there at all • No, the higher Gaussian moments are not all zero! • Only the odd ones are. Yeesh. • Correct reason we don’t need them is: all can be computed from mean and variance • Uuugly notation

  50. Issues • Some worrying lapses... • Did not cite other uses of histograms or distributions extracted as features • “Spike-triggered average” / covariance / etc. • Used by: all neurobiology, neurocomputation • E.g.: de Ruyter van Steveninck & Bialek • “Response-conditional ensemble” • Used by: Our own Adrienne Fairhall & colleagues • E.g.: Aguera & Arcas, Fairhall, Bialek • “Event-triggered distribution” • Used by: me ☺ • E.g.: CSE528 project

More Related