120 likes | 131 Views
Learn about sampling techniques and the bootstrap method in data mining, and how they can be used to efficiently analyze large datasets. Understand the different types of sampling and how to estimate error using the bootstrap method.
E N D
Data MiningCSCI 307, Spring 2019Lecture 30 Sampling Bootstrap
BACKGROUND: Sampling • What is Sampling? Obtaining a small sample s to represent the whole data set N • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Key principle: Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods, e.g., stratified sampling
Measuring the Central Tendency Mean (algebraic measure) Note: n is sample size and N is population size. • Weighted arithmetic mean • Also: Trimmed mean (chop off extreme values first and then get the mean) Median • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): Mode • Value that occurs most frequently in the data • Unimodal (one mode), bimodal (two modes), trimodal (three modes); two or more: multimodal • approximation (for unimodal):
Symmetric versus Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed
Types of Sampling • Simple Random Sampling • An equal probability of selecting any particular item • Sampling Without Replacement • Once an object is selected, it is removed from the population • Sampling With Replacement • A selected object is not removed from the population • Stratified Sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) • Used in conjunction with skewed data
Sampling With or Without Replacement SRSWOR (simple random sample without replacement) SRSWR (simple random sample with replacement) Raw Data
SamplingCluster or Stratified Sampling Cluster/Stratified Sample Raw Data
The Bootstrap Cross-Validation (CV) uses sampling without replacement • The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set • Sample a dataset of n instances n times with replacement to form a new dataset of n instances • Use this data as the training set • Use the instances from the original dataset that do not occur in the new training set for testing
The 0.632 Bootstrap A particular instance has a probability of 1–1/n of not being picked Its probability of ending up in the test data is: This means the training data will contain approximately 63.2% (i.e. 1-0.368) of the instances
Estimating Error with the Bootstrap • Error estimate on test data is pessimistic • Trained on just ~63% of the instances (unlike 90% training size for tenfold C.V.) • To compensate, combine it with the resubstitution error, so USE: err = 0.632 xetest_instances + 0.368 xetraining_instances • The resubstitution error gets less weight than the error on the test data • Repeat process several times with different replacement samples; average the results
Bootstrap continued +++Perhaps the best way of estimating performance for very small datasets --- Some problems • Consider the (artificial) random dataset from a few slides back, True error rate: 50% • A perfect memorizer (for the training set) will achieve 0% resubstitution error, i.e. etraining_instances= 0 and ~50% error on test data • So, the bootstrap estimate for this classifier: err = 0.632 x 50% + 0.368 x 0% = 31.6% So, it is misleadingly optimistic.
Summary: Bootstrap Works well with small data sets Samples the given training instances uniformly with replacement, i.e., each time an instance is selected, it is equally likely to be selected again and re-added to the training set Several bootstrap methods, a common one is .632 bootstrap • A data set with d instances is sampled d times, with replacement, resulting in a training set of d samples. The data instances that did not make it into the training set end up forming the test set. • Repeat sampling procedure k times, overall accuracy of the model: 12