1 / 28

STT 315

STT 315. Ashwini Maurya. This note is based on Chapter 1 of the textbook. Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr . Jennifer Kaplan and Dr. Parthanil Roy for allowing him to use/edit some of their slides. Course Outline. Collecting Data Surveys

edita
Download Presentation

STT 315

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STT 315 Ashwini Maurya This note is based on Chapter 1 of the textbook. Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan and Dr. Parthanil Roy for allowing him to use/edit some of their slides.

  2. Course Outline Collecting Data Surveys Exploratory Data Analysis Data Representations Numerical Summaries of Data Data Models (probability) Inference 2

  3. What is Statistics? • Statistics - a word with 2 meanings • A subject, like mathematics or physics. • A value we compute from (sample) data. Statistics is a “mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data”. [Source: Wikipedia] 3

  4. Descriptive and inferential statistics • In descriptive statistics numerical and graphical methods are used to find patterns in a data set, to summarize and to present the information revealed in a data set in a convenient form. • Inferential statistics involves estimation, making decisions and predictions, and/or other generalizations about a larger set of data based on sample data.

  5. Data and It’s Characteristics • We need to know if the data are “good enough” to use as basis for decision • Who are the data about? • What do the data represent? • When were the data collected? • Where were the data collected? • How were the data collected? • Why were the data collected? 5

  6. An Example • In June 2000, a homeowner in tuscola, illinois, wanted to determine If generic fertilizer and weed killer is as effective as the more Expensive brand name product. After the spring rains and early Summer warmth, he counted the number of weeds and density of Grass blades. Identify who, where, when, and why for the situation Described. • A homeowner; tuscola, illinois, june 2000, compare products. • Two patches of lawn; tuscola, illinois; june 2000; compare products. • Two patches of lawn; arcola, illinois; june 2000; compare products. • A homeowner; arcola, illinois; june 2000; compare products. • Two patches of lawn’ tuscola, illinois; june 2000; compare products.

  7. Few terminologies • Population is the complete set of all items that we are interested in studying. The number of items in a population is called the population size, usually denoted by N. • A sample is a subset of the population. Usually n denotes sample size (the number of observations in a sample). • A variable is a characteristic or property of an item on which we take measurements. [Answers to the question what.] • The items or the individuals from whom/which the data are collected are often called cases. [Answers to the question who.] • Data are the observed values of the variable. • Study of a whole population is called census, and that of sample is known as sample survey.

  8. Census vs. Survey • Every 10 years, the U.S. government takes a census of the population of the U.S. and finds the values of certain parameters like average family size or income. • But a census is costly, so usually, if we want to know something about the population we survey a sample of the population and find the values of statistics like average family size or income and use the statistics as estimates of the parameter. So using a survey we get to learn something about the population, by only asking a sample.

  9. Population and Sample • Suppose we would like to estimate the fraction of East Lansing residents who are students. • In this case, the population is all East Lansing residents. • However, surveying the entire population may be costly, time-consuming and laborious and therefore, we can do our job by selecting a sample which is “a good representative of the population”.

  10. Parameter and Statistic • Parameters are the values we calculate from the population data. Population mean, population variance, population median etc. are the examples of parameters. • Statistics - a word with 2 meanings • A subject, like mathematics or physics. • Values we compute from sample data. Sample mean, sample variance, sample proportion etc. are the examples of statistics. Singular of statistics is “statistic”.

  11. Examples of Surveys Exit polling in elections Public opinion polls Nielson Television ratings J.D. Powers Car ratings 11

  12. Sampling Schemes In this course, we shall learn about 5 different sampling schemes:- • Simple Random Sampling • Stratified Sampling • Cluster Sampling • Multistage Sampling • Systematic Sampling

  13. Simple Random Sampling • A simple random sample is a subset of individuals (a sample) chosen from a population in such a way that any subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals. • The process of choosing a simple random sample is known as simple random sampling.

  14. Simple Random Sampling: An Example Suppose I would like to draw a simple random sample of size n=4 from a class of 50 students. How would I do that? • I shall assign numbers 01, 02, …, 50 to each of these 50 students, write these numbers on 50 similar looking pieces of papers, mix them well in a basket and then pick 4numbers from the basket without replacement. • Or, I can use random number table to select 4 students. • Or, I can write a computer program and which will do the job.

  15. Simple Random Sampling (SRS) • Simple Random Sampling (SRS) is usually done using random numbers. • Random number tables are available in • our textbook (Appendix D), • Internet (random number generator websites), • TI 83/84 calculator can generate random numbers. • An example: 43900 44304 30419 02647 27619 26146 57122 64194 69535 53513 01579 30823 16533 85961 51118 55649 95170 50049 58854 85557 05447 45777 71671 47104 20805 73144 16128 13733 67803 32150 65667 38559 46441 96238 46845 68467 56717 91966 86221 30014 72076 19333 04120 96643 19074 51781 80216 21469

  16. With or without replacement? • In small populations and often in large ones, such sampling is typically done "without replacement", i.e., one deliberately avoids choosing any member of the population more than once. • Although simple random sampling can also be conducted with replacement, this is less common and would normally be described more fully as simple random sampling with replacement.

  17. With or without replacement • In with replacementsampling scheme an item may be selected several times. • In without replacementsampling scheme no item is allowed to be selected more than once. Example: Suppose we are selecting 4 items out of 55 (identified with numbers 00, 01, …, 54). We use the following random number table: 43900 44304 30419 02647 27619 26146 With replacement sampling will produce: {43, 04, 43, 04}. Without replacement sampling will produce: {43, 04, 30, 41}.

  18. Stratified Sampling • Sometimes the population is first sliced into homogeneous groups called strata and simple random sampling is used within each stratum. Finally, these subsamples are combined into a sample. This sampling scheme is known as “stratified random sampling” or simply “stratified sampling”.

  19. When to use Stratified Sampling? • Suppose we would like to know how students feel about funding for the football team in a large university and the student population consists of 40% men and 60% women. Suppose we feel that men and women would have different views on the funding. In this case, a simple random sample won’t do a good job. Instead, it is better to divide the student population in two strata: male and female students. We can then choose a stratified sample consisting of 40 male students and 60 female students. This will be a better representative of the population. Moral: Whenever we have a heterogeneous population, it is better to use stratified sampling.

  20. Cluster Sampling • Splitting the population into representative clusters can make sampling more practical. Then we could simply select one or a few clusters at random and perform a census within each of them. This sampling scheme is called cluster sampling.

  21. When to use Cluster Sampling? • Suppose I am trying to find out what MSU freshmen think about the dining service on campus and I know that freshmen at MSU are all housed in 10 freshman dorms. In this case, I shall select two or three of these 10 dorms at random and contact all the residents of these selected dorms.

  22. Stratified vs. Cluster Sampling • Strata are homogeneous but different from one another while clusters are heterogeneous and resemble the overall population. • We perform simple random sampling in ALL strata where as we only choose a few clusters at random and perform a census in those clusters.

  23. Multistage Sampling • Sampling schemes that combine several methods are called multistage sampling. Most surveys conducted by professional organizations use multistage sampling. • The exact scheme depends on the nature of the populations and the nature of the survey.

  24. An Example of Multistage Sampling • Suppose I am trying to find out what MSU freshmen think about the dining service on campus and I know that freshmen at MSU are all housed in 10 freshmen dorms. Suppose I am concerned about possible differences of opinions between men and women and these dorms have men and women on alternate floors. Now I can use a combination of stratified and cluster sampling as follows: I would first choose 2 freshman dorms at random (out of 10) and then select some dorm floors at random from among those that house men, and, separately, from among those that house women. I could then treat each selected floor as a cluster and interview everyone on that floor.

  25. Types of Samples • Simple Random Sample (SRS) - every sample has an equal probability of being chosen. • Cluster - entire groups are randomly selected. • Stratified Random - the population is divided into homogenous groups and a simple random sample is chosen from each group. • Multistage - used in national polling, usually starts with random selection of states, and then counties, and then houses to call. • Convenience - individuals who are conveniently available. • Systematic – individuals are picked in a predetermined order.

  26. Example To represent the population of MSU students: • Simple Random Sample (SRS) - randomly generate a subset of the PIDs of all students or put all the names in a hat, shake it up and draw some out. • Cluster - a set of large lecture classes of different disciplines. • Stratified Random - randomly generate a set of PIDs for each class: freshmen, sophomores, juniors and seniors. • Multistage - randomly choose 3 dorms on campus, then randomly choose 2 floors of each dorm and sample from each of the floors using SRS. • Convenience - our STT 200 class. • Systematic - every 5th student I meet in the food-court.

  27. Bias Bias is any systematic failure of a sample to represent its population. • If the design of survey is such that certain experimental units have no chance of being selected then it causes selection bias. • Sometimes surveyors are unable to collect data from certain experimental units which causes nonresponse bias, while internet polling suffers from voluntary response bias where special group of people voluntarily respond who are passionate about the topic and do not represent the general public. • Inaccuracies in the recording of values the data result in measurement error.

  28. Variable and data types Variables (and hence data) can be of two types: • Qualitative or categorical, • Quantitative or numerical. • Qualitative or categorical variable cannot be usually measured in numerical scale, and simply records quality. One may use numbers to code the values of a qualitative data, but those numbers are arbitrary. • A quantitative or numerical variable assigns naturally numerical values, for which arithmetic operations, such as averaging, make sense. Caution! There are some numerical data, such as phone number, order number, zip code etc., which are not variables, but identifiers. Though often numerical, they are to identify or keep track of individuals/cases. Summing or averaging those numbers mean nothing. 28

More Related