510 likes | 730 Views
measures of centrality. Last lecture summary. Which graphs did we meet? scatter plot ( bodový graf ) bar chart (sloupcový graf) histogram pie chart (koláčový graf) How do they work, what are their advantages and/or disadvantages?. SDA women – histogram of heights 2014. n = 48 or N = 48
E N D
Last lecture summary • Which graphs did we meet? • scatter plot (bodový graf) • bar chart (sloupcový graf) • histogram • pie chart (koláčový graf) • How do they work, what are their advantages and/or disadvantages?
SDA women – histogram of heights 2014 n = 48 or N = 48 bin size = 3.8
Distributions negatively skewed skewed to the left positively skewed skewed to the left e.g., body height e.g., life expectancy e.g., income http://turnthewheel.org/free-textbooks/street-smart-stats/
statistics is beatiful new stuff
Life expectancy data • Watch TED talk by Hans Rosling, Gapminder Foundation: http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html
UC Berkeley Though data are fake, the paradox is the same Simpson’s paradox www.udacity.com – Introduction to statistics
Male www.udacity.com – Introduction to statistics
Male www.udacity.com – Introduction to statistics
Female www.udacity.com – Introduction to statistics
Female www.udacity.com – Introduction to statistics
Gender bias What do you think, is there a gender bias? Who do you think is favored? Male or female? www.udacity.com – Introduction to statistics
Gender bias male female www.udacity.com – Introduction to statistics
Gender bias male female www.udacity.com – Introduction to statistics
Statistics is ambiguous • This example ilustrates how ambiguous the statistics is. • In choosing how to graph your data you may majorily impact what people believe to be the case. “I never believe in statistics I didn’t doctor myself.” “Nikdy nevěřím statistice, kterou si sám nezfalšuji.” Who said that? Winston Churchill www.udacity.com – Introduction to statistics
What is statistics? • Statistics – the science of collecting, organizing, summarizing, analyzing and interpreting data • Goal – use imperfect information (our data) to infer facts, make predictions, and make decisions • Descriptive statistic – describing and summarising data with numbers or pictures • Inferential statistics – making conclusions or decisions based on data
Variables • variable – a value or characteristics that can vary from individual to individual • example: favorite color, age • How variables are classified? • quantitative variable – numerical values, often with units of measurement, arise from the how much/how many question, example: age, annual income, number children • continuous (spojitá proměnná), example: height, weight • discrete (diskrétní proměnná), example: number of children • continuous variables can be discretized
Variables • categorical (qualitative) variables • categories that have no particular order • example: favorite color, gender, nationality • ordinal • they are not numerical but their values have a natural order • example: tempterature low/medium/high
Variables variable (proměnná) quantitative (kvantitativní) categorical (kategorická) ordinal (ordinální) continuous (spojitá) discrete (diskrétní)
Choosing a profession Chemistry Geography 50 000 – 60 000 40 000 – 55 000 www.udacity.com – Statistics
Choosing a profession • We made an interval estimate. • But ideally we want one number that describes the entire dataset. This allows us to quickly summarize all our data. www.udacity.com – Statistics
Choosing a profession • The value at which frequency is highest. • The value where frequency is lowest. • Value in the middle. • Biggest value of x-axis. • Mean Geography Chemistry www.udacity.com – Statistics
Three big M’s • The value at which frequency is highest is called the mode. i.e. the most common value is the mode. • The value in the middle of the distribution is called the median. • The mean is the mean (average is the synonymum). Geography Chemistry www.udacity.com – Statistics
Quick quiz • What is the mode in our data? 2 5 6 5 2 6 9 8 5 2 3 5 www.udacity.com – Statistics
Mode in negatively skewed distribution www.udacity.com – Statistics
Mode in uniform distribution www.udacity.com – Statistics
Multimodal distribution www.udacity.com – Statistics
Mode in categorical data www.udacity.com – Statistics
More of mode True or False? • The mode can be used to describe any type of data we have, whether it’s numerical or categorical. • All scores in the dataset affect the mode. • If we take a lot of samples from the same population, the mode will be the same in each sample. • There is an equation for the mode. • Ad 3. • http://onlinestatbook.com/stat_sim/sampling_dist/ • http://www.shodor.org/interactivate/activities/Histogram/ - mode changes as you change a bin size. • Because 3. is not true, we can’t use mode to learn something about our population. Mode depends on how you present the data. www.udacity.com – Statistics
Life expectancy data www.coursera.org – Statistics: Making Sense of Data
Minimum minimum = 47.8 Sierra Leone www.coursera.org – Statistics: Making Sense of Data
Maximum maximum = 84.3 Japan www.coursera.org – Statistics: Making Sense of Data
Life expectancy data all countries www.coursera.org – Statistics: Making Sense of Data
Life expectancy data half larger 73.2 half smaller Egypt 1 99 197 www.coursera.org – Statistics: Making Sense of Data
Life expectancy data Maximum= 83.4 Median= 73.2 Minimum = 47.8 www.coursera.org – Statistics: Making Sense of Data
Q1 1st quartile = 64.7 Sao Tomé & Príncipe 50 (¼ way) 1 197 www.coursera.org – Statistics: Making Sense of Data
Q1 1st quartile = 64.7 ¼ smaller ¾ larger www.coursera.org – Statistics: Making Sense of Data
Q3 3rd quartile = 76.7 Netherland Antilles 148 (¾ way) 1 197 www.coursera.org – Statistics: Making Sense of Data
Q3 3rd quartile = 76.7 ¾ smaller ¼ larger www.coursera.org – Statistics: Making Sense of Data
Life expectancy data Maximum= 83.4 3rd quartile = 76.7 Median= 73.2 1st quartile = 64.7 Minimum = 47.8 www.coursera.org – Statistics: Making Sense of Data
Box Plot www.coursera.org – Statistics: Making Sense of Data
Box plot maximum 3rd quartile median 1st quartile minimum
Modified box plot outliers 1.5 x IQR IQR interquartile range outliers
Quartiles, median – how to do it? Find min, max, median, Q1, Q3 in these data. Then, draw the box plot. 79, 68, 88, 69, 90, 74, 87, 93, 76 www.coursera.org – Statistics: Making Sense of Data
Another example Min. 1st Qu. Median 3rd Qu. Max. 68.00 75.00 81.00 88.50 93.00 78, 93, 68, 84, 90, 74
Percentiles věk [roky] http://www.rustovyhormon.cz/on-line-rustove-grafy
3rd M – Mean • Mathematical notation: • … Greek letter capital sigma • means SUM in mathematics • Another measure of the center of the data: mean (average) • Data values:
Robust statistic Salary of 25 players of the American football (NY red Bulls) in 2012. median = 112 495 mean = 518 311 Mean is not arobuststatistic. Median is a robust statistic.