350 likes | 427 Views
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners. David Jensen and Jennifer Neville. Relational vs. Traditional Independent Data Sets. Simple Random Partitioning Example.
E N D
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville
Simple Random Partitioning Example • Divide Movies into two subsets, Training & Test set, by randomly selecting a movie without replacement and adding it to subset • A movie may only appear in one subset • A movie may only appear once in a subset • For each movie add the corresponding Studio to the subset. • A studio may appear in both subsets.
Test Bias • Simple Random Partitioning causes training and test set dependency. (Studio in both sets) Test Set Training Set Studio Movie Movie Movie Movie
Data Set • Data set drawn from Internet Movie Database (www.imdb.com) • Contains Movies, Actors, Directors, Producers, and Studios • Selected Movies released between 1996 and 2001 • 1382 movies, 40000 objects, and 70000 links • Used various features to predict opening weekend box office receipts
Calculating Test Bias • Discretized movie receipts with a positive value indicating more than $2 million. (prob(+)=.55) • Added random attributes to studios • Created models with the random attributes. • Bias = random model accuracy – default error of .55
Linkage 0 1 Studio Studio Studio Studio Movie Movie Movie Movie Movie Movie
Autocorrelation 0 1 Studio Studio Movie Movie Movie Movie Movie Movie Movie + - + - + + +
High Linkage causes Dependence Theorem: Given simple random partitioning of relational data set S with single linkage and C’=1: probind(A,B) -> 0 as L -> 1
Solution – Subgraph Sampling • Assign movies randomly to subsets as before • Commit movie to subset iff the corresponding studio has not been placed in another subset or does not have high autocorrelation and linkage; otherwise discard the movie.
Conclusion • Using subgraphing combined with Linkage and Autocorrelation increases the evaluation accuracy of relational learners.
Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning David Jensen and Jennifer Neville
Feature Selection Bias in Relational Learning • High values of linkage (L) and autocorrelation (C’) can • Reduce the effective sample size. • Introduce additional variance, lead to feature selection bias.
Feature Selection • Feature is a mapping between raw data and a low-level inference. • Feature selection is a process of choosing among features (e.g. identifying the best feature, choosing features based on certain conditions).
Relational Feature Selection • Relational features are used for predicting the value of an attribute on one type of objects based on attributes of related objects. • Relational features increase predictive power of inference procedures. • But they can cause bias in selection process and lead to incorrect estimation.
Effects of Linkage and Autocorrelation • Linkage and autocorrelation cause relational feature selection bias in a two-step chain: • Reduce the effective sample size of a data set => increase the variance of scores estimated. • Increased variance of an object increases the probability that features from the objects will be selected as the best feature.
Decreased Effective Sample Size • A special case: data sets exhibit single linkage plus C’ = 1 and L ≥ 0. • The variance of scores estimated from relational features depends on |Y| rather on |X|. • For example, if receipts has C’ = 1, then relational features formed from studio depend on the number of studios rather than the number of movies. • We do not gain additional information as |X| increases.
Effective Sample Size (cont.) • For a wider array of values for C’ and L, Jensen and Neville use simulation. • Effective sample size drops monotonically as C’ and L increase. • Decreasing in effective sample size will increase the variance of the features. • Features with higher variance => bias in favor of these features.
How Can Feature Selection Bias? • Why do features with higher variance lead to a bias? • Features are usually formed by a local search over possible parameters of the feature. • This local search is usually done prior to feature selection, so only the best feature from each feature “family” is compared.
Feature Selection Bias • Bias increases as the variance of the score distributions increase. • Thus, the estimated score of features formed from objects with high C’ and L will be more biased. • For example, the studios have the highest variance that allow them to exceed the scores of weakly useful features on other objects.
Effects of Linkage and Autocorrelation High Linkage and Autocorrelation Decreased Effective Sample Size Increase the variance of scores estimated Bias increases as variance increases
Estimating Score Variance • Correcting for high variance is to obtain accurate estimates of variance for each feature. • Approach: bootstrap resampling.
Bootstrap Resampling • A technique for estimating characteristics of the sampling distribution of a given parameter: • Generate multiple samples (pseudosamples) by drawing, with replacement, from the original data. • Pseudosamples have the same size as the original training set. • Estimate the variance of a parameter by estimating the parameter on pseudosamples, and then finding the variance of the resulting distribution of scores.
Bootstrap Resampling (cont.) Original Training Set sample sample sample Var Var Var Variance of the original training set can be computed based on the parameters of the pseudosamples.
Using Resampled Estimates • Resampling can be used to estimate the variance of scores for particular features. • The use of resampled estimates remains an open problem. For example: In feature selection, how to compare variance estimates of different features. • A Research topic!
Conclusion • High linkage and autocorrelation can cause bias for relational learning algorithms. • Research ideas: • How to use the variance estimates of various features to avoid feature selection bias. • Avoiding feature selection bias by considering additional information such as prior estimates of the true score.