EART20170 Computing, Data Analysis & Communication skills

EART20170 Computing, Data Analysis & Communication skills Lecturer: Dr Paul Connolly (F18 – Sackville Building) p.connolly@manchester.ac.uk 1. Data analysis (statistics) 3 lectures & practicals statistics open-book test (2 hours) 2. Computing (Excel statistics/modelling) 2 lectures assessed practical work Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly Recommended reading: Cheeney. (1983) Statistical methods in Geology. George, Allen & Unwin

Recap – last lecture • The four measurement scales: nominal, ordinal, interval and ratio. • There are two types of errors: random errors (precision) and systematic errors (accuracy). • Basic graphs: histograms, frequency polygons, bar charts, pie charts. • Gaussian statistics describe random errors. • The central limit theorem • Central values, dispersion, symmetry • Weighted mean.

Some common problems

Use tables

Lecture 2 • Correlation between two variables • Classical linear regression • Reduced major axis regression • Propagation of errors in compound quantities.

Correlation • Many real-life quantities have a dependence on some thing else. E.g dependence of rock permeability on porosity. • How can we quantify the strength and direction of a linear relationship between X and Y variables?

Correlation •  y = sum of all y-values •  x = sum of all x-values •  x2 = sum of all x2 values •  y2 = sum of all y2 values •  xy = sum of the x times y values • Like other numerical measures, the population correlation coefficient is (the Greek letter ``rho'‘, ) and the sample correlation coefficient is denoted by r. • Linear correlation (Pearson’s coefficient)

Correlation • Values of r r = +1 r = -1 r = 0 y y y x x x Perfect positive correlation Perfect negative correlation No correlation

1.0 0.9 0.8 0.7 0.6 r2, fraction of explained variation 0.5 0.4 0.3 0.2 0.1 0.0 +1.0 +0.5 +0.0 -0.5 -1.0 Correlation coefficient, r Correlation • r2 is the amount of variation in x and y that is explained by the linear relationship. It is often called the `goodness of fit’ • E.g. if an r = 0.97 is obtained then r2 = 0.95 so 100x0.95=95% of the total variation in x and y is explained by the linear relationship, but the remaining 5% variation is due to “other” causes.

Regression analysis • How can we fit an equation to a set of numerical data x, y such that it yields the best fit for all the data?

Classical linear regression • An approximate fit yields a straight line that passes through the set of points in the best possible manner without being required to pass exactly through any of the points.

y Linear Regression m { ei c x Classical linear regression Y=mx+c • Where ei is the deviation of the data point from the fit line, c is the intercept, m is the gradient. • Assumes that the error is present only in y.

How do we define a good fit? • If the sum of all deviations is a minimum? ei • If the sum of all the absolute deviations is a minimum? |ei| • If the maximum deviation is a minimum? emax • If the sum of all the squares of the deviations is a minimum? ei2

Classical linear regression • The best way is to minimise the sum of the squares of the deviation. Formally this involves some Mathematics: • At each value of xi: • Therefore the deviations from the curve are: • The sum of the squares:

Classical linear regression • How do you find the minimum of a function? • Use calculus • Differentiate and set to zero • Two simultaneous equations

Classical linear regression • Solving the two equations yields:

Classical linear regression

Classical linear regression • Classical linear regression only considered errors in the Y values of the data. • How can we consider errors in both x and y values? • Use Reduced major axis regression

dx { y { dy c x Reduced major axis regression • Method to quantify a linear relationship where both variables are dependent and have errors • Instead of minimising e2=(Y-y)2 we minimise e2=dy2+dx2.

Reduced major axis regression

Error propagation • Every measurement of a variable has an error. • Often the error quoted is one standard deviation of the mean (mean ± standard deviation) • The standard deviation of the sample mean is usually our best estimate of the population standard deviation

Error propagation • Error propagation is a way of combining two or more random errors together to get a third. The equations assume that the errors are Gaussian in nature. • It can be used when you need to measure more than one quantity to get at your final result. For example, if you wanted to predict permeability from a measured porosity and grainsize. The equations introduced here let you propagate the uncertainties on your data through the calculation and come up with an uncertainty on your results. • How then do we combine variables which have errors?

Error propagation - quoted Relationship Error propagation (k=constant)

Example of propagation of error • Suppose we measure the thickness of a rock bed using a tape measure. • The tape measure is shorter then the bed thickness so we have to do it in two steps x and y. • We repeat the measurements 100 times and obtain the following mean and standard deviation values for x and y: • The thickness of the bed should be simply: • But what about the error on the total thickness? x=12.1±0.3 cm y=4.2±0.2 cm x+y=16.3 cm

Example of propagation of error • It is given by propagating the individual errors as follows: • So the final answer for the total thickness of the bed is: • Error propagation formulae are non-intuitive and understanding how they are derived requires some mathematical knowledge 16.3±0.4 cm

More complex examples • What if we have several functions of several variables? • E.g. calculating density using Archimedes Principle: • This equation contains two functions and two variables • Error propagation is best done in parts, so first work out value and error in denominator: • Then the value and error of: • In a few of weeks we will use a Monte Carlo method for solving more complex functions

Reminder Statistics practical #2 • Those not taking BIOL20451: Roscoe 3.5 1100 – 1300 Tuesday • Those taking BIOL20451: Williamson 1.12 1400 – 1600 Tuesday

Some common problems • Weighted mean f x

What does adding two variables really mean?

EART20170 Computing, Data Analysis &amp; Communication skills

EART20170 Computing, Data Analysis &amp; Communication skills

Presentation Transcript

EART20170 Computing, Data Analysis & Communication skills

EART20170 Computing, Data Analysis & Communication skills