330 likes | 622 Views
Sampling: Knowing Whole from Its Part. Instance Selection And Construction For Data Mining Ch.2 Baohua Gu, Feifang Hu, and Huan Liu 2001.5.2. Introduction. Approach to studying the characteristics of a population Complete enumeration or census : every unit of the population theoretical
E N D
Sampling: Knowing Whole from Its Part Instance Selection And Construction For Data Mining Ch.2 Baohua Gu, Feifang Hu, and Huan Liu 2001.5.2
Introduction • Approach to studying the characteristics of a population • Complete enumeration or census: every unit of the population • theoretical • Sampling: only a part of it • Practical • Show remarkable advantages such as reduced cost, reduced time, greater scope, and greater accuracy
Basics of Sampling • Population • A set of elements • Element • A unit for which information is sought • Sampling • Select a subset, so called a sample • Sampling units • The population divided into collections of elements • Cover the whole of the population without overlapping • Sampling frame • Construction of a list of sampling units
Basics of Sampling • Sampling size • The number of elements to be selected • Sampling method (sampling design) • Scheme or design to select elements into a sample • Estimator • Provide some statistical information of the population • Statistic • Function of the elements in a sample • Estimate • Particular value taken by an estimator • Statistical inference • Estimation procedure
Basic assumptions • The interested characteristics of a population are usually not available to use or hard to obtain, whereas the interested characteristics of its sample are much easier to obtain • A sample is always a subset selected from a population • One can obtain estimates that are “unbiased” for population quantities. • Any uncertainty in estimates obtained by sampling thus stems from the fact that only part of the population is observed • With the population characteristics remain fixed, the estimated of them depends on which sample is selected and what estimation method is used.
General Considerations • How large a sample should be? • Depends on cost, accuracy and some others • Probability inequality: • Practical way • We select a small preliminary sample of size m. • Observations made on the units selected • Estimate e0 • Replace e0 & solve the equation • If n m, then sample n – m additional units • Else, then no more sampling
General Considerations • How good an estimate will be? • The ultimate objective of any sampling is to make inferences about a population of interest. • Two main property • Unbiasedness • Sampling variance
General Considerations • What types of errors may involve? • Sampling error : errors in the estimates occur only because just part of the population is included in the sample. • Decrease is inversely proportional to the square root of sample size • Non-sampling error : defective sampling procedure • Are likely to increase with increase in sample size
General Considerations • Any useful auxiliary information or relationship? • Additional information about the population elements that can be exploited
General Considerations • What are the pros and cons of sampling? • Reduced cost, greater speed, greater scope, or great accuracy • Misunderstanding of sampling • Sampling can reveal the true characteristics of a population
Categories of sampling methods • General purpose vs. specific domain • Equal probability vs. varying probability • With replacement vs. without replacement • One stage vs. multi-state • Non adaptive vs. adaptive
General-purpose sampling methods • Random sampling • Every population units has the same chance of being selected in the sample. • Random sampling without replacement (popular) vs. random sampling with replacement • There is any bias in selecting the sample • Easy to operate
General-purpose sampling methods • Stratified Sampling • When a population consists of a number of approximately homogeneous groups, it will be convenient and effective to select a small number of elements from each of such groups. • We first divide a population into some non-overlapping sub-populations or strata. • Then small samples could be selected from these different strata. • The total sample is formed by combining all the small samples • It could be an alternative way apart from increasing the sample size to reach higher precision • It enables effective utilization of the available auxiliary information • It can reduce the variability within other strata thus it can deal with “outliers”. • Disadvantage : It is difficult to construct ideal strata
General-purpose sampling methods • Cluster Sampling • When a population consists of number of groups, each of which is a “miniature” of the entire population, it is possible to estimate correctly the population characteristics by selecting the smallest group and all its elements. • Separate into mutually exclusive sub-population (cluster) • A sample of theses clusters is chosen • All units in the selected clusters are selected to form the sample. • The units in each cluster are desired to be as much heterogeneous as possible and all clusters are similar to each other to obtain estimators of low variance. • Advantage: less costly • Disadvantage: hard to ensure the “representativeness”
General-purpose sampling methods • Systematic Sampling • Select first random number between 1 and k • Select the unit with this serial number and every kth unit • Constant k : sampling interval (N/n) • Operationally convenient • Well spread over the population • Disadvantage: • variance estimators cannot be obtained unbiasedly • Periodical data
General-purpose sampling methods • Double Sampling • Two-phase sampling • Initially a sample of units is selected from a large first-phase sample for obtaining auxiliary information only • Second-phase sample is selected in which the studied variable is observed. • To obtain better estimators by using the relationship between auxiliary variables and the studied variable.
General-purpose sampling methods • Network sampling • Multiplicity sampling • A simple random selection or stratified random selection is made • All units which are linked to any of the selected units are also included in the sample
General-purpose sampling methods • Inverse Sampling • Sampling is continued until some specified conditions are satisfied • We can obtain enough information about rare-occurring attribute by setting a suitable sample size.
Domain-specific sampling methods • Distance Sampling • Used for biological populations • Line transect & point transect • A set of randomly placed lines or points is established and distances are measured to those objects detected by traveling along the lines or surveying the points • Spatial Sampling • For geostatistics • Two dimensional populations
Domain-specific sampling methods • Capture-Recapture Sampling • Used to estimate the total number of individuals in a population • An initial sample is obtained and the individuals in that sample are marked, then release. • A second sample is independently obtained, and it is noted how many of the individuals in that sample are marked before. • Line-Intercept Sampling • Estimate elusive populations (rare animals)
Domain-specific sampling methods • Composite Sampling • For ecological studies • A sample is formed by taking a number of individual samples and then physically mixing them to form a composite sample • To obtain the desired information in the original samples but at reduced cost or effort • Panel Sampling • For social surveys • Sampling are often carries out at successive intervals of time on a continued basis covering the same population
Domain-specific sampling methods • Monte Carlo Strategies • For Bayesian inferences • To generate samples from a given probability distribution P(x). • To estimate expectations of functions under this distribution. • Importance sampling • Not for generating samples from P(x) but for estimating the expectation of a function. • Rejection sampling • Generate a sample of the complicated object distribution by using a simpler proposal distribution • Metropolis Strategy • Makes use of a proposal density which depend on the current state. • Gibbs Sampling • Sampling from distributions over at least two dimensions (conditional distribution)
Domain-specific sampling methods • Shannon Sampling • For signal process • Given a sample rate, Shannon’s theorem gives the cutoff frequency of the pre-filter h.
Non-Probability Sampling • Probability sampling • Depends on the theory of probability • Non-probability sampling • Without the theory of probability • Accidental Sampling • Haphazard or Convenience Sampling • Accidentally interviewing a people on the street • No evidence that they are representative • Purposive Sampling • We have one or more specific predefined groups • Last sampling methods are sub-categories of purposive sampling
Non-Probability Sampling • Modal Instance Sampling • A mode is a value most frequently occurring in a distribution. • Sample the most frequent cases, or the typical cases • Expert Sampling • Quota Sampling • We select people non-randomly according to some fixed quota. • Heterogeneity Sampling • We are not concerned about representativeness. • For “outliers” • Snowball Sampling • Begin by identifying someone who meets the criteria for inclusion in our study • Then ask them to recommend others
One-Stage vs. Multi-Stage • Many sampling methods can be done in one-stage. • Multi-stage sampling • The population is divided into a number of first-stage units • The selected first-stage units are sub-divided into a number of smaller second-stage units • The process is continued until the ultimate sampling units are reached. • Difference from multi-phase sampling • Sampling units varies the same at each phase of sampling
Multi-Stage Sampling • Usages • When it is extremely laborious and expensive to prepare such a complete frame • When a multi-stage sample plan may be more convenient • Multi-stage Simple Random Sampling • Multi-stage Varying Probability Sampling • Stratified Multi-stage Sampling
Equal vs. Varying Probability • Equal • Every units in a population have the same probabilities to be selected • Varying • Different units in the population have different probabilities. • Can be useful when units vary considerably in size. • Probability Proportional to Size (PPS) Sampling
With or Without Replacement • Sampling without replacement is more popular • Sampling with replacement is equivalent to drawing elements from an infinitely large population • Very useful for small size • Bootstrap
Adaptive vs. Non-Adaptive • Non-adaptive sampling • The selection procedure does not depend in any way on observations made during the sampling • Adaptive Sampling • The procedure for selecting sites or units to be included in the sample may depend on values of the variable of interest observed during the sampling • To obtain more precise estimates • May introduce biases into conventional estimators. • Adaptive Cluster Sampling • Whenever the variables of interest of a selected unit satisfies a given condition, additional units in the neighborhood of that unit are added.
Choosing Sampling Methods • It depends!