250 likes | 266 Views
This overview covers the SGLR method, integrating statistical methods, SQL, clustering, and feature generation for large-scale problems. It explores motivations, experiments, and conclusions of SGLR and SQL usage, along with clustering and aggregation benefits in data analysis.
E N D
Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers
Overview • Structural Generalized Linear Regression (SGLR) Overview • Design Motivations • Experiments • Conclusions
SGLR Overview • Adds statistical methods to ILP • SQL as the logical language • Generalized Linear Regression as statistical method • Uses clustering to generate new relations • Builds discriminative models • Targeted at large problems where generative models impossible • Integrates feature generation and problem modeling
SGLR Method • Clusters data and adds clusters as new relations • Searches the space of SQL query refinements • Features are numerical SQL aggregates • Test feature with statistical measure (e.g. AIC, BIC) • Add only significantly predictive features • Examine each feature only once • Use current set of features to guide search
Overview • Structural Generalized Linear Regression (SGLR) Overview • Design Motivations • Experiments • Conclusions
SQL Motivation • Most of the world’s data is in relational databases • Can exploit schema and meta-information • SQL uses a fairly expressive language • Non-recursive first-order logic formulas • Relational DBs have been studied and optimized for decades, so should be more scalable than other alternatives
Clustering Motivation • Dimensionality reduction • Clusters are added as relations (new first-class concepts) • Increases expressivity of the language describing patterns in the data • Can lead to a more rapid discovery of predictive features • Done as a pre-processing step • cost(clustering) << cost(feature search)
Aggregation Motivation • Summarizes the information in a table into scalar values usable by a statistical model • average, max, min, count, average, empty/exists (0/1) • Exploits database work into making them efficient • Provides a richer space of features to choose from
Dynamic Feature Generation • Most features do not provide useful information • In large domains, feature generation is expensive, and precomputing all possible features is far too time consuming • Solution: Use a smarter search strategy and dynamically generate features. Let the features already selected influence which features are added • Focuses only on the promising areas in the search space
Feature Streams • Put features into different evaluation queues • Choose next feature from the ‘best’ stream • If feature in multiple streams, only evaluate once • Stream design can use prior knowledge/bias
Refinement Graphs (in ILP) • Start with most general rule, and ‘refines’ it to produce more specific clauses • Single variable substitution • Add predicate involving 1+ existing variables • Uses top-down breadth-first search to find the most general rule that covers only positive examples • Performs poorly in noisy domains
Refinement Graphs (in SGLR) • Adds one relation to a query and expands it into all possible configurations of equality conditions of new attributes with a new or old attribute • Contains at least one equality condition between a new and old attribute • Any attribute can be set to a constant • High-level variable typing/classes are enforced • Not all refinements are most general, but simplifies pruning of equivalent subspaces (accounts only for the type and number of relations joined in a query)
Example Refinement Graph Query(d) Cites(d,d1) Author_of(d, a) Word_count(d, w, int) Author_of(d, a=“Smith”) Cites(d,d1),Cites(d1,d2) DB Tables Cites(d,d1), Author_of(d1, a) Cites(d,d1), Author_of(d1, a=“Domingos”)
Overview • Structural Generalized Linear Regression (SGLR) Overview • Design Motivations • Experiments • Conclusions
Experiments • Used CiteSeer data • Citation(doc1, doc2), Author(doc, person), PublishedIn(doc, venue), HasWord(doc,word) • 60k Docs, 131k Authors, 173k Citations, 6.8M Words • Two Tasks • Predict the publication venue • Predict existence of a citation
Experiments • Cluster all many-to-many relations • K-means • Added 6 new relations • Use logistic regression for prediction • BFS of search space • 5k+/5k- examples for venue prediction • 2.5k+/2.5k- examples for citation prediction
Results Venue (87.2%) Citation (93.1%)
Dynamic Feature Generation • Query expressions generated Breadth-First • Baseline puts all queries into one queue • Dynamic strategy enqueues queries into separate streams • Stream 1: exists and count over table • Stream 2: other aggregates (counts of unique elements in individual columns) • Chooses next feature from stream where (featuresAdded+1)/(featuresTried+1) is max • Stop when a stream is empty
Results Venue No Clusters Clusters Citation
Time Results Venue No Clusters Clusters Citation
Domain Independent Learning • Most citation prediction features are research-area generic • Can we train a model for one area and test on another?
Domain Independent Results • Used KDD-Cup 2003 data (High Energy Physics papers in arXiv)
Conclusions • Cluster-based features add expressivity, and apply to any domain or SRL method • Generating queries dynamically can reduce search time and increase accuracy