1 / 32

Sampling: Knowing Whole from Its Part

Sampling: Knowing Whole from Its Part. Instance Selection And Construction For Data Mining Ch.2 Baohua Gu, Feifang Hu, and Huan Liu 2001.5.2. Introduction. Approach to studying the characteristics of a population Complete enumeration or census : every unit of the population theoretical

paul2
Download Presentation

Sampling: Knowing Whole from Its Part

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling: Knowing Whole from Its Part Instance Selection And Construction For Data Mining Ch.2 Baohua Gu, Feifang Hu, and Huan Liu 2001.5.2

  2. Introduction • Approach to studying the characteristics of a population • Complete enumeration or census: every unit of the population • theoretical • Sampling: only a part of it • Practical • Show remarkable advantages such as reduced cost, reduced time, greater scope, and greater accuracy

  3. Basics of Sampling • Population • A set of elements • Element • A unit for which information is sought • Sampling • Select a subset, so called a sample • Sampling units • The population divided into collections of elements • Cover the whole of the population without overlapping • Sampling frame • Construction of a list of sampling units

  4. Basics of Sampling • Sampling size • The number of elements to be selected • Sampling method (sampling design) • Scheme or design to select elements into a sample • Estimator • Provide some statistical information of the population • Statistic • Function of the elements in a sample • Estimate • Particular value taken by an estimator • Statistical inference • Estimation procedure

  5. Basic assumptions • The interested characteristics of a population are usually not available to use or hard to obtain, whereas the interested characteristics of its sample are much easier to obtain • A sample is always a subset selected from a population • One can obtain estimates that are “unbiased” for population quantities. • Any uncertainty in estimates obtained by sampling thus stems from the fact that only part of the population is observed • With the population characteristics remain fixed, the estimated of them depends on which sample is selected and what estimation method is used.

  6. General Considerations • How large a sample should be? • Depends on cost, accuracy and some others • Probability inequality: • Practical way • We select a small preliminary sample of size m. • Observations made on the units selected • Estimate e0 • Replace e0 & solve the equation • If n m, then sample n – m additional units • Else, then no more sampling

  7. General Considerations • How good an estimate will be? • The ultimate objective of any sampling is to make inferences about a population of interest. • Two main property • Unbiasedness • Sampling variance

  8. General Considerations • What types of errors may involve? • Sampling error : errors in the estimates occur only because just part of the population is included in the sample. • Decrease is inversely proportional to the square root of sample size • Non-sampling error : defective sampling procedure • Are likely to increase with increase in sample size

  9. General Considerations • Any useful auxiliary information or relationship? • Additional information about the population elements that can be exploited

  10. General Considerations • What are the pros and cons of sampling? • Reduced cost, greater speed, greater scope, or great accuracy • Misunderstanding of sampling • Sampling can reveal the true characteristics of a population

  11. Categories of sampling methods • General purpose vs. specific domain • Equal probability vs. varying probability • With replacement vs. without replacement • One stage vs. multi-state • Non adaptive vs. adaptive

  12. General-purpose sampling methods • Random sampling • Every population units has the same chance of being selected in the sample. • Random sampling without replacement (popular) vs. random sampling with replacement • There is any bias in selecting the sample • Easy to operate

  13. General-purpose sampling methods • Stratified Sampling • When a population consists of a number of approximately homogeneous groups, it will be convenient and effective to select a small number of elements from each of such groups. • We first divide a population into some non-overlapping sub-populations or strata. • Then small samples could be selected from these different strata. • The total sample is formed by combining all the small samples • It could be an alternative way apart from increasing the sample size to reach higher precision • It enables effective utilization of the available auxiliary information • It can reduce the variability within other strata thus it can deal with “outliers”. • Disadvantage : It is difficult to construct ideal strata

  14. General-purpose sampling methods • Cluster Sampling • When a population consists of number of groups, each of which is a “miniature” of the entire population, it is possible to estimate correctly the population characteristics by selecting the smallest group and all its elements. • Separate into mutually exclusive sub-population (cluster) • A sample of theses clusters is chosen • All units in the selected clusters are selected to form the sample. • The units in each cluster are desired to be as much heterogeneous as possible and all clusters are similar to each other to obtain estimators of low variance. • Advantage: less costly • Disadvantage: hard to ensure the “representativeness”

  15. General-purpose sampling methods • Systematic Sampling • Select first random number between 1 and k • Select the unit with this serial number and every kth unit • Constant k : sampling interval (N/n) • Operationally convenient • Well spread over the population • Disadvantage: • variance estimators cannot be obtained unbiasedly • Periodical data

  16. General-purpose sampling methods • Double Sampling • Two-phase sampling • Initially a sample of units is selected from a large first-phase sample for obtaining auxiliary information only • Second-phase sample is selected in which the studied variable is observed. • To obtain better estimators by using the relationship between auxiliary variables and the studied variable.

  17. General-purpose sampling methods • Network sampling • Multiplicity sampling • A simple random selection or stratified random selection is made • All units which are linked to any of the selected units are also included in the sample

  18. General-purpose sampling methods • Inverse Sampling • Sampling is continued until some specified conditions are satisfied • We can obtain enough information about rare-occurring attribute by setting a suitable sample size.

  19. Domain-specific sampling methods • Distance Sampling • Used for biological populations • Line transect & point transect • A set of randomly placed lines or points is established and distances are measured to those objects detected by traveling along the lines or surveying the points • Spatial Sampling • For geostatistics • Two dimensional populations

  20. Domain-specific sampling methods • Capture-Recapture Sampling • Used to estimate the total number of individuals in a population • An initial sample is obtained and the individuals in that sample are marked, then release. • A second sample is independently obtained, and it is noted how many of the individuals in that sample are marked before. • Line-Intercept Sampling • Estimate elusive populations (rare animals)

  21. Domain-specific sampling methods • Composite Sampling • For ecological studies • A sample is formed by taking a number of individual samples and then physically mixing them to form a composite sample • To obtain the desired information in the original samples but at reduced cost or effort • Panel Sampling • For social surveys • Sampling are often carries out at successive intervals of time on a continued basis covering the same population

  22. Domain-specific sampling methods • Monte Carlo Strategies • For Bayesian inferences • To generate samples from a given probability distribution P(x). • To estimate expectations of functions under this distribution. • Importance sampling • Not for generating samples from P(x) but for estimating the expectation of a function. • Rejection sampling • Generate a sample of the complicated object distribution by using a simpler proposal distribution • Metropolis Strategy • Makes use of a proposal density which depend on the current state. • Gibbs Sampling • Sampling from distributions over at least two dimensions (conditional distribution)

  23. Domain-specific sampling methods • Shannon Sampling • For signal process • Given a sample rate, Shannon’s theorem gives the cutoff frequency of the pre-filter h.

  24. Non-Probability Sampling • Probability sampling • Depends on the theory of probability • Non-probability sampling • Without the theory of probability • Accidental Sampling • Haphazard or Convenience Sampling • Accidentally interviewing a people on the street • No evidence that they are representative • Purposive Sampling • We have one or more specific predefined groups • Last sampling methods are sub-categories of purposive sampling

  25. Non-Probability Sampling • Modal Instance Sampling • A mode is a value most frequently occurring in a distribution. • Sample the most frequent cases, or the typical cases • Expert Sampling • Quota Sampling • We select people non-randomly according to some fixed quota. • Heterogeneity Sampling • We are not concerned about representativeness. • For “outliers” • Snowball Sampling • Begin by identifying someone who meets the criteria for inclusion in our study • Then ask them to recommend others

  26. One-Stage vs. Multi-Stage • Many sampling methods can be done in one-stage. • Multi-stage sampling • The population is divided into a number of first-stage units • The selected first-stage units are sub-divided into a number of smaller second-stage units • The process is continued until the ultimate sampling units are reached. • Difference from multi-phase sampling • Sampling units varies the same at each phase of sampling

  27. Multi-Stage Sampling • Usages • When it is extremely laborious and expensive to prepare such a complete frame • When a multi-stage sample plan may be more convenient • Multi-stage Simple Random Sampling • Multi-stage Varying Probability Sampling • Stratified Multi-stage Sampling

  28. Equal vs. Varying Probability • Equal • Every units in a population have the same probabilities to be selected • Varying • Different units in the population have different probabilities. • Can be useful when units vary considerably in size. • Probability Proportional to Size (PPS) Sampling

  29. With or Without Replacement • Sampling without replacement is more popular • Sampling with replacement is equivalent to drawing elements from an infinitely large population • Very useful for small size • Bootstrap

  30. Adaptive vs. Non-Adaptive • Non-adaptive sampling • The selection procedure does not depend in any way on observations made during the sampling • Adaptive Sampling • The procedure for selecting sites or units to be included in the sample may depend on values of the variable of interest observed during the sampling • To obtain more precise estimates • May introduce biases into conventional estimators. • Adaptive Cluster Sampling • Whenever the variables of interest of a selected unit satisfies a given condition, additional units in the neighborhood of that unit are added.

  31. Summary of Sampling Methods

  32. Choosing Sampling Methods • It depends!

More Related