1.54k likes | 1.56k Views
The term "statistics" has evolved over time, originally used for societal data and now as a mathematical tool for collecting, organizing, and interpreting numerical information. This article explores the historical and modern meanings of statistics, covering its application in various fields such as economics, sociology, genetics, and more. It delves into basic statistical terms, biases, and popular culture perceptions of statistics, including famous quotes. Gain insights into the importance and challenges of statistics in both historical and contemporary contexts.
E N D
Statistics I. Tamás Dusek Széchenyi István University 2016
Historical meaning of statistics • The term „statistics” have many different shades of meaning • In the older, original sense of the word (18th century meaning), statistics was used for any descriptive information about the state of society • By the 18th century, the term "statistics" designated the systematic collection of demographic and economic data by states • Today it is also used for descriptive data which have a quantitative nature and a numerical form • In this sense statistics is a method of historical research, it is a description in numerical terms of historical events that happened in a definite period of time with definite groups of people in a definite geographical area.
Modern meaning of statistics • The previous meaning has nothing in common with its modern natural science meaning • Accordingly statistics deals with mass phenomena and it enables us to analyze systems with very large numbers of particles • In the field of natural sciences, statistics is a method of inductive research. To take an example: quantum mechanics deals with the fact that we do not know how a particle will behave in an individual instance. But we know what pattern of behavior can possibly occur and the proportion in which these patterns really occur.
Modern meaning of statistics Meaning I.: • Statistics is the mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling • Classification and interpretation of quantitative data in accordance with probability theory and the application of methods such as hypothesis testing to them • The mathematical study of the theoretical nature of such distributions and tests. Meaning II.: • quantitative data on any subject
Uses of Statistics Almost all fields of study benefit from the application of statistical methods Economics, Sociology, Genetics, Insurance, Biology, Criminology, Polling, Retirement Planning, automobile fatality rates, and many more too numerous to mention. Statistics is objective, interpretation of statistics not entirely objective.
Statistics is the science of collecting, organizing, summarising, analysing, and making inference from data Descriptive statistics: collecting, organizing, summarising, analysing, and presenting data Inferential statistics: Making inferences, hypothesis testing Determining relationship, and making prediction
Image of statistics in pop culture is often negative, based on misunderstandings, mistakes or jokes
Some famous antistatistician quotations • “I only believe in statistics that I doctored myself.” (Churchill) • „I never believe in statitistics if I didn’t make it myself.” (Churchill) • "There are three kinds of lies: lies, damned lies, and statistics." (origin is uncertain; attributed to Disraeli, but popularised by Mark Twain) • Statistics is a precise and logical method for stating a half truth inaccurately. • It is proven that the celebration of birthdays is healthy. Statistics show that those people who celebrate the most birthdays become the oldest.
Basic Terms Population: A collection, or set, of individuals or objects or events whose properties are to be analyzed. Two kinds of populations: finite or infinite. Sample: A subset of the population.
Variable: A characteristic about each individual element of a population or sample. Observational unit: the individual entities whose characteristics are measured Data (singular): The value of the variable associated with one element of a population or sample. This value may be a number, a word, or a symbol. Data (plural): The set of values collected for the variable from each of the elements belonging to the sample. Experiment: A planned activity whose results yield a set of data. Parameter: A numerical value summarizing all the data of an entire population. Statistic: A numerical value summarizing the sample data.
Example: A college dean is interested in learning about the average age of faculty. Identify the basic terms in this situation. The observational unit is the persons of faculty. The population is the age of all faculty members at the college. A sample is any subset of that population. For example, we might select 10 faculty members and determine their age. The variable is the “age” of each faculty member. One data would be the age of a specific faculty member. The data would be the set of values in the sample. The experiment would be the method used to select the ages forming the sample and determining the actual age of each faculty member in the sample. The parameter of interest is the “average” age of all faculty at the college. The statistic is the “average” age for all faculty in the sample.
Two kinds of variables: Qualitative, or Attribute, or Categorical, Variable: A variable that categorizes or describes an element of a population. Note: Arithmetic operations, such as addition and averaging, are not meaningful for data resulting from a qualitative variable. Quantitative, or Numerical, Variable: A variable that quantifies an element of a population. Note: Arithmetic operations such as addition and averaging, are meaningful for data resulting from a quantitative variable.
Example: Identify each of the following examples as attribute (qualitative) or numerical (quantitative) variables. 1. The residence hall for each student in a statistics class. (Attribute) 2. The amount of gasoline pumped by the next 10 customers at a MOL gasoline station. (Numerical) 3. The amount of radon in the basement of each of 25 homes in a new development. (Numerical) 4. The color of the baseball cap worn by each of 20 students. (Attribute) 5. The length of time to complete a mathematics homework assignment. (Numerical) 6. The state in which each truck is registered when stopped and inspected at a weigh station. (Attribute)
Qualitative and quantitative variables may be further subdivided Variables • Quantitative • Discrete (counting) • Continuous (measurement) • Qualitative • Ordinal • Categorical/Attribute
Nominal Variable: A qualitative variable that categorizes (or describes, or names) an element of a population. Ordinal Variable: A qualitative variable that incorporates an ordered position, or ranking. Discrete Variable: A quantitative variable that can assume a countable number of values. Intuitively, a discrete variable can assume values corresponding to isolated points along a line interval. That is, there is a gap between any two values. Continuous Variable: A quantitative variable that can assume an uncountable number of values. Intuitively, a continuous variable can assume any value along a line interval, including every possible value between any two values.
Note: 1. In many cases, a discrete and continuous variable may be distinguished by determining whether the variables are related to a count or a measurement. • Discrete variables are usually associated with counting. If the variable cannot be further subdivided, it is a clue that you are probably dealing with a discrete variable. • Continuous variables are usually associated with measurements. The values of discrete variables are only limited by your ability to measure them. • Countinuous variables are recorded often as a discrete variable.
Example Discrete The number of eggs that hens lay; for example, 3 eggs a day. The number of cars in a parking lot. Number of the inhabitants of a town. Continuous The amounts of milk that cows produce; for example, 8.343115 liter a day. The temperature. Age of a person.
Example: Identify each of the following as examples of qualitative or numerical variables: 1. The temperature in Győr, Hungary at 12:00 pm on any given day. 2. Whether or not a 6 volt lantern battery is defective. 3. The weight of a lead pencil. 4. The length of time billed for a long distance telephone call. 5. The brand of cereal children eat for breakfast. 6. The type of book taken out of the library by an adult.
Levels of measurement 1 Nominal 1A Coding 1B Qualitativ data, categorical data (gender, nationality, ethnicity, language, genre, style, biological species) 2 Ordinal – rank order 3 Interval - degree of difference; however zero is arbitrary 4 Ratio 4A continuous quantity with true zero 4B discrete quantity
Importance of the levels of measurement • Helps you decide what statistical analysisis appropriate on the values that were assigned • Helps you decide how to interpret the data from that variable Dangers to Avoid • Attaching unwarranted significance to aspects of the numbers that do not convey meaningful information • Failing to simply data when would easily do so • Manipulating our data in ways that destroy information • Performing meaningless statistical operations on the data
Nominal and ordinal measurement • Nominal measurement: not measurement in the everyday sense of the word; the value does not imply any ordering of the cases, for example, shirt numbers in football; Even though player 17 has higher number than player 7, you can’t say from the data that he’s greater than or more than the other. When attributes can be rank-ordered • Distances between attributes do not have any meaning,for example, the distance between the winner of a sport competition and the second one, and between the second and third one
The Hierarchy of Levels Ratio Absolute zero Interval Distance is meaningful Ordinal Attributes can be ordered Nominal Attributes are only named; weakest
Types of data • Nominal and ordinal are qualitative (categorical) levels of measurement. • Interval and ratio are quantitative levels of measurement. VARIABLES QUANTITATIVE QUALITATIVE RATIO Pulse rate Height INTERVAL 36o-38oC ORDINAL Social class NOMINAL Gender Ethnicity
Example: Identify each of the following as examples of (1) nominal, (2) ordinal, (3) discrete, or (4) continuous variables: 1. The length of time until a pain reliever begins to work. 2. The number of chocolate chips in a cookie. 3. The number of colors used in a statistics textbook. 4. The brand of refrigerator in a home. 5. The overall satisfaction rating of a new car. 6. The number of files on a computer’s hard disk. 7. The pH level of the water in a swimming pool. 8. The number of staples in a stapler.
Measure and Variability • No matter what the response variable: there will always be variability in the data. • One of the primary objectives of statistics: measuring and characterizing variability. • Controlling (or reducing) variability in a manufacturing process: statistical process control.
Methods used to collect data Census: A 100% survey. Every element of the population is listed. Seldom used: difficult and time-consuming to compile, and expensive. Survey: Data are obtained by sampling some of the population of interest. The investigator does not modify the environment. Experiment: The investigator controls or modifies the environment and observes the effect on the variable under study. Administrative resources: The source of the data is an administrative activity. Other
Surveys Surveys may be administered in a variety of ways, e.g. • Personal Interview, • Telephone Interview, • Self Administered Questionnaire, and • Internet Questionnaire design principles: • Keep the questionnaire as short as possible. • Ask short, simple, and clearly worded questions. • Start with demographic questions to help respondents get started comfortably. • Use dichotomous (yes|no) and multiple choice questions. • Use open-ended questions cautiously. • Avoid using leading-questions. • Pretest a questionnaire on a small number of people. • Think about the way you intend to use the collected data when preparing the questionnaire.
Not everything that counts can be counted 5 (Quantity) Happy (Quality) Kids
Univariate descriptive statistics • After collecting data, the first task is to organize and simplify the data so that it is possible to get a general overview of the results. • This is the goal of descriptive statistical techniques. • One method for simplifying and organizing data is to present them in graphical way
Graphical presentation Graphs and statistics are often used to persuade. Advertisers and others may accidentally or intentionally present information in a misleading way. For example, art is often used to make a graph more interesting, but it can distort the relationships in the data. Questions to Ask When Looking at Data and/or Graphs: • Is the information presented correctly? • Is the graph trying to influence you? • Does the scale use a regular interval? • What impression is the graph giving you?
Pie charts and bar graphs • Both is used forcategorical variables • Pie charts show the amount of data that belongs to each category as a proportional part of a circle • Bar graphsshow the amount of data that belongs to each category as proportionally sized rectangular areas
Day Number Sold Monday 15 Tuesday 23 Wednesday 35 Thursday 11 Friday 12 Saturday 42 • Example: The table below lists the number of automobilessold last week by day for a local dealership. • Describe the data using a pie chart (circle graph) and a bar graph
Automobiles Sold Last Week Pie chart
Automobiles Sold Last Week Frequency Bar graph
Used to identify the number and type of defects that happen within a product or service • Separates the “vital few” from the “trivial many” • The Pareto diagram is often used in quality control applications Pareto Diagram • Pareto Diagram: A bar graph with the bars arranged from the most numerous category to the least numerous category. It includes a line graph displaying the cumulative percentages and counts for the bars.
Defect Number Dent 5 Stain 12 Blemish 43 Chip 25 Scratch 40 Others 10 Pareto diagram example The final daily inspection defect report for a cabinet manufacturer is given in the table below:
Daily Defect Inspection Report 1) 1 4 0 1 0 0 1 2 0 8 0 1 0 0 6 0 8 0 Count Percent 6 0 4 0 4 0 2 0 2 0 0 0 Defect: Blemish Scratch Chip Stain Others Dent Count 43 40 25 12 10 5 Percent 31.9 29.6 18.5 8.9 7.4 3.7 Cum% 31.9 61.5 80.0 88.9 96.3 100.0 2) The production line should try to eliminate blemishes and scratches. This would cut defects by more than 50%.
Frequency distributions and histograms Frequency distributions and histograms are used to summarize large data sets Used for quantitative variables Frequency Distribution: A listing, often expressed in chart form, that pairs each value of a variable with its frequency Ungrouped Frequency Distribution: Each value of x in the distribution stands alone Grouped Frequency Distribution: Group the values into a set of classes 1. A table that summarizes data by classes, or class intervals 2. In a typical grouped frequency distribution, there are usually 5-12 classes of equal width 3. The table may contain columns for class number, class interval, tally (if constructing by hand), frequency, relative frequency, cumulative relative frequency, and class midpoint 4. In an ungrouped frequency distribution each class consists of a single value
Guidelines for constructing a frequency distribution 1. All classes should be of the same width. In the case of very uneven distribution of the data or outliers, class width can be different. 2. Classes should be set up so that they do not overlap and so that each piece of data belongs to exactly one class 3. For problems in the text, 5-12 classes are most desirable. The square root of n is a reasonable guideline for the number of classes if n is less than 150. 4. Use a system that takes advantage of a number pattern, to guarantee accuracy 5. If possible, an even class width is often advantageous
Histogram Histogram: A bar graph representing a frequency distribution of a quantitative variable. A histogram is made up of the following components: 1. A title, which identifies the population of interest 2. A vertical scale, which identifies the frequencies in the various classes 3. A horizontal scale, which identifies the variable x. Values for the class boundaries or class midpoints may be labeled along the x-axis. Use whichever method of labeling the axis best presents the variable. Notes: • The relative frequency is sometimes used on the vertical scale • It is possible to create a histogram based on class midpoints
Example: A recent survey of Roman Catholic nuns summarized their ages in the table below. Age Frequency Class Midpoint ------------------------------------------------------------ 20 up to 30 34 25 30 up to 40 58 35 40 up to 50 76 45 50 up to 60 187 55 60 up to 70 254 65 70 up to 80 241 75 80 up to 90 147 85