380 likes | 507 Views
Correlation. (Lírios-Vicent Van Gogh,1889). setosa. virginica. versicolor. Iris data. Fisher’s iris data. S.Length S.Width P.Length P.Width Species 1 5.1 3.5 1.4 0.2 setosa
E N D
setosa virginica versicolor Iris data
Fisher’s iris data S.Length S.Width P.Length P.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa ………………. 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor …………………. 99 6.2 2.9 4.3 1.3 versicolor 100 5.7 2.8 4.1 1.3 versicolor 101 6.3 3.3 6.0 2.5 virginica ………………… 150 5.9 3.0 5.1 1.8 virginica
2.0 3.0 4.0 0.5 1.5 2.5 7.5 6.5 Sepal.Length 5.5 4.5 4.0 Sepal.Width 3.0 2.0 7 6 5 Petal.Length 4 3 2 1 2.5 1.5 Petal.Width 0.5 3.0 Species 2.0 1.0 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0 Scatter-plot matrix
setosa 4.0 3.5 Sepal.Width 3.0 2.5 virginica versicolor 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Sepal.Length Scatter plot (by group) and Trendlines
4.0 3.5 Sepal.Width 3.0 2.5 4.5 5.0 5.5 Sepal.Length Scatterplot for setosa of iris data
90 90 90 70 70 70 50 50 50 30 30 30 30 50 70 90 30 50 70 90 30 50 70 90 no apparent relationship negative relationship positive relationship How to quantify the relationship ?
90 90 90 70 70 70 50 50 50 30 30 30 30 50 70 90 30 50 70 90 30 50 70 90 count pairs
90 90 70 70 50 50 30 30 300 500 700 900 30 50 70 90 Need to consider scale matters
-10, -2, 3, 5, 7, 9 5, -7, 10, -3, 8, 5 Maximize the sum of products of each pair.
-10, -2, 3, 5, 7, 9 -7, -3, 5, 5, 8, 10 -10, -2, 3, 5, 7, 9 10, 8, 5, 5, -3, -7 positively matched, negatively matched
90 90 90 70 70 70 50 50 50 30 30 30 30 50 70 90 30 50 70 90 30 50 70 90 - 0 +
Cauchy-Schwartz inequality -1 +1 90 90 70 70 50 50 30 30 30 50 70 90 30 50 70 90 (very strong) negative linear relationship (very strong) positive linear relationship
Sample version Population version
Population covariance Exercise
150 100 50 y 0 -50 Covariance is a measure of linear association between two variables. Covariance is not a measure curved association. -100 20 40 60 80 100 x
Covariance may be any real value, but correlation is a value only in [-1,1]. Covariance is affected by scales of variables, but correlation is not, except of sign of scale.
90 90 70 70 50 50 30 30 300 500 700 900 30 50 70 90 Covariance = 189 Covariance = ? Correlation = 0.78 Correlation = ?
90 70 50 30 30 50 70 90 Covariance = 189 Covariance = ? Correlation = 0.78 Correlation = ? 120 100 80 60 -90 -80 -70 -60
Grouped by Zip Code Gathering groups with + corr's does not give + corr.
Correlation is a measure of linear association, but not a causation. High correlation does not mean that one variable is the cause of the other.
Correlation and causality The more Starbucks, the higher APT price ! APT prices in Seoul The more STBK stores, the higher will APT price increase ?
STBK: number of Starbucks stores APT price: Average APT price by a 1 m2 The more Starbucks, the deeper financial crisis are !