1 / 25

Structural Generalized Linear Regression (SGLR) Method Overview

This overview covers the SGLR method, integrating statistical methods, SQL, clustering, and feature generation for large-scale problems. It explores motivations, experiments, and conclusions of SGLR and SQL usage, along with clustering and aggregation benefits in data analysis.

aaronmatos
Download Presentation

Structural Generalized Linear Regression (SGLR) Method Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers

  2. Overview • Structural Generalized Linear Regression (SGLR) Overview • Design Motivations • Experiments • Conclusions

  3. SGLR Overview • Adds statistical methods to ILP • SQL as the logical language • Generalized Linear Regression as statistical method • Uses clustering to generate new relations • Builds discriminative models • Targeted at large problems where generative models impossible • Integrates feature generation and problem modeling

  4. SGLR Loop

  5. SGLR Method • Clusters data and adds clusters as new relations • Searches the space of SQL query refinements • Features are numerical SQL aggregates • Test feature with statistical measure (e.g. AIC, BIC) • Add only significantly predictive features • Examine each feature only once • Use current set of features to guide search

  6. Overview • Structural Generalized Linear Regression (SGLR) Overview • Design Motivations • Experiments • Conclusions

  7. SQL Motivation • Most of the world’s data is in relational databases • Can exploit schema and meta-information • SQL uses a fairly expressive language • Non-recursive first-order logic formulas • Relational DBs have been studied and optimized for decades, so should be more scalable than other alternatives

  8. Clustering Motivation • Dimensionality reduction • Clusters are added as relations (new first-class concepts) • Increases expressivity of the language describing patterns in the data • Can lead to a more rapid discovery of predictive features • Done as a pre-processing step • cost(clustering) << cost(feature search)

  9. Aggregation Motivation • Summarizes the information in a table into scalar values usable by a statistical model • average, max, min, count, average, empty/exists (0/1) • Exploits database work into making them efficient • Provides a richer space of features to choose from

  10. Dynamic Feature Generation • Most features do not provide useful information • In large domains, feature generation is expensive, and precomputing all possible features is far too time consuming • Solution: Use a smarter search strategy and dynamically generate features. Let the features already selected influence which features are added • Focuses only on the promising areas in the search space

  11. Feature Streams • Put features into different evaluation queues • Choose next feature from the ‘best’ stream • If feature in multiple streams, only evaluate once • Stream design can use prior knowledge/bias

  12. Refinement Graphs (in ILP) • Start with most general rule, and ‘refines’ it to produce more specific clauses • Single variable substitution • Add predicate involving 1+ existing variables • Uses top-down breadth-first search to find the most general rule that covers only positive examples • Performs poorly in noisy domains

  13. Refinement Graphs (in SGLR) • Adds one relation to a query and expands it into all possible configurations of equality conditions of new attributes with a new or old attribute • Contains at least one equality condition between a new and old attribute • Any attribute can be set to a constant • High-level variable typing/classes are enforced • Not all refinements are most general, but simplifies pruning of equivalent subspaces (accounts only for the type and number of relations joined in a query)

  14. Example Refinement Graph Query(d) Cites(d,d1) Author_of(d, a) Word_count(d, w, int) Author_of(d, a=“Smith”) Cites(d,d1),Cites(d1,d2) DB Tables Cites(d,d1), Author_of(d1, a) Cites(d,d1), Author_of(d1, a=“Domingos”)

  15. Overview • Structural Generalized Linear Regression (SGLR) Overview • Design Motivations • Experiments • Conclusions

  16. Experiments • Used CiteSeer data • Citation(doc1, doc2), Author(doc, person), PublishedIn(doc, venue), HasWord(doc,word) • 60k Docs, 131k Authors, 173k Citations, 6.8M Words • Two Tasks • Predict the publication venue • Predict existence of a citation

  17. Experiments • Cluster all many-to-many relations • K-means • Added 6 new relations • Use logistic regression for prediction • BFS of search space • 5k+/5k- examples for venue prediction • 2.5k+/2.5k- examples for citation prediction

  18. Results Venue (87.2%) Citation (93.1%)

  19. Dynamic Feature Generation • Query expressions generated Breadth-First • Baseline puts all queries into one queue • Dynamic strategy enqueues queries into separate streams • Stream 1: exists and count over table • Stream 2: other aggregates (counts of unique elements in individual columns) • Chooses next feature from stream where (featuresAdded+1)/(featuresTried+1) is max • Stop when a stream is empty

  20. Results Venue No Clusters Clusters Citation

  21. Time Results Venue No Clusters Clusters Citation

  22. Domain Independent Learning • Most citation prediction features are research-area generic • Can we train a model for one area and test on another?

  23. Domain Independent Results • Used KDD-Cup 2003 data (High Energy Physics papers in arXiv)

  24. Conclusions • Cluster-based features add expressivity, and apply to any domain or SRL method • Generating queries dynamically can reduce search time and increase accuracy

  25. Questions?

More Related