1 / 63

Extreme values

Extreme values. Seminar at MLURI, January 2008. Adam Butler Biomathematics & Statistics Scotland. 1. Motivation What is EVT? Applications Current research. Motivation. Flooding, Budapest, 2002 Graham Berry http://en.wikipedia.org/wiki/Image:Floods_in_Budapest_2002.jpg.

faraji
Download Presentation

Extreme values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extreme values Seminar at MLURI, January 2008 Adam Butler Biomathematics & Statistics Scotland

  2. 1. Motivation What is EVT? Applications Current research Motivation

  3. Flooding, Budapest, 2002 Graham Berry http://en.wikipedia.org/wiki/Image:Floods_in_Budapest_2002.jpg

  4. What is the probability that the flood defenses of Budapest will be overtopped during 2008?

  5. Northern Rock branch, London, 2007 Alex Gunningham http://en.wikipedia.org/wiki/Image:1378965141_7817eb7212_o.jpg

  6. What is the probability of today’s value of the Dow Jonesindex being at least 9.5% lower than yesterday’s?

  7. Log daily return = log(value today / value yesterday) Value drops by 9.5%  LDR drops by log(0.905) = -0.10 Q. On this particular day, what is the chance of getting a log daily return of less than –0.10?

  8. Dow Jones Data for the period 1996-2000

  9. To answer this question we clearly need to extrapolate, since –0.1 is well outside the range of the data… Extrapolation should be avoided whenever possible, but in many real-life problems it is unavoidable

  10. So how should we go about estimating this probability? We could assume that the data are normally distributed…

  11. P(X < –0.1)  10-20

  12. …but the extreme values that have been observed don’t play much of a role when we estimate the parameters (e.g. the mean and variance) Hence, our chosen model (e.g. the normal distribution) might do badly in describing their properties…

  13. Empirical: P(X < –0.05)  0.002 Normal: P(X < –0.05)  0.000001

  14. …and, worse still, extrapolations beyond the range of the data often differ radically between models that provide a very similar fit to the bulk of the data For example, we might decide to fit a Cauchy rather than a normal distribution…

  15. Cauchy: P(X < –0.1)  0.02 Normal: P(X < –0.1)  10-20

  16. We need an alternative statistical approach that is more robust, in the sense that it is does not require us to make strong and untestable assumptions about the process that is generating our data This is the motivation for EVT – Extreme Value Theory

  17. Motivation 2. What is EVT? Applications Current research Motivation

  18. General characteristics of an “EVT” problem • We are interested in a process that can be quantified, and for which we have some data • …and we want to use this data to say something about the probability that a rare or extreme event will occur • We will usually be interested in events that are beyond the range of the data e.g. we want to extrapolate

  19. To deal with such problems, we begin from the principle that our inferences should only be based on the most extreme data that we have actually observed e.g. we should throw away almost all of the data

  20. Extreme valuetheory (EVT) then provides us with some simple and robust models that can then be used to describe the properties of these extreme data

  21. Q. What is the probability of getting more than 100mm of rain on any given day?

  22. We might decide to only use data for days with 25mm or more of rainfall…

  23. Histogram of data above a threshold of 25mm

  24. Threshold exceedance = Value - Threshold

  25. The GPD model • A good statisticalmodel for threshold exceedances is the GPD (Generalised Pareto Distribution) • The probability density function is of the form f(x) = 1 – (1 + x / )-1/ • There are two parameters, a scale parameter  and a shape parameter , which needed to be estimated

  26. GPD model fitted to threshold exceedances • Threshold = u = 25mm  and  estimated by maximum likelihood to be 7.70 and 0.108 P(X > 100) estimated to be 0.0000209 (once per 131 years)

  27. But why is the GPD a good model to use? The mathematical justification is given by asymptotic theory • The theory says that, for almost any random variable X, the exceedances of a high threshold u will tend towards following the GPD model as u tends towards infinity • In practice, we use a threshold that is high but still finite: we rely on the fact that if this level is sufficiently high then the asymptotic result will still be approximately true

  28. When choosing a threshold, we need to balance • Precision: If the threshold is low then our results will tend to be more certain than if it is high • Bias: extreme value methods will only be valid when the threshold is sufficiently high We can do this in a partly subjective way using parameter stability plots

  29. Parameter stability plot for shape parameter, 

  30. The GEV model • Another approach involves analysing block maxima • For example, if we have hourly sea level data then we may choose to analyse only the largest value that occurs each year: the annual maximum value • The same method can also be used to analyse minima

  31. A good statisticalmodel for block maxima is the GEV (Generalised Extreme Value Distribution) • The probability density function is of the form f(x) = exp{-[1 + ((x - ) / )]-1/} • There are three parameters - a location parameter , a scale parameter , and a shape parameter  - which need to be estimated

  32. The r-largest model • The GEV model uses only one value per block • An extension of this model involves using the r largest values per block, where r is greater than one • e.g. We might model the 20 highest sea levels per year

  33. The shape parameter • All of the extreme value models contain a common parameter  that determines the shape of the distribution • The extremes of a light tailed distribution will have a negative shape parameter ( < 0) & the extremes of a heavy tailed distribution have a positive shape ( > 0) • The extreme values of a normal distribution have  = 0

  34. GPD: impact of the shape parameter,  •  = 0 •  = 1 •  = -0.5

  35. Covariates • The properties of extreme values may depend on time, location, or other covariates (explanatory variables) • We can easily build these covariates into our extreme value models, in a similar way that we would build them into a regression model or GLM • The key difference is that in a GLM we only build covariates into the mean, whereas in EV models we might build them into any of the three parameters

  36. Venice sea level data – linear trend in location parameter

  37. More advanced statistical modelling • Methods to deal with clustering: e.g. declustering algorithms, estimation of the extremal index • Semiparametric modelling: allow trends to vary smoothly over time, using local likelihood or smoothing splines • Bayesian methods: allow for the incorporation of prior information, and for the construction of relatively complicated hierarchical models

  38. Example of semiparametric modelling: estimated trends in storm surge levels at Dover

  39. Software • Add-on packages are available for R (extRemes, ismev, evir, evd, evdbayes), Splus (EVIS, S+FinMetrics) and Matlab (EVIM, EXTREMES) • The extremes toolkit provides a user-friendly interface - www.isse.ucar.edu/extremevalues/evtk.html • Some methods are also available in Genstat • Stand-alone commercial software: Xtremes, HYFRAN

  40. Should I be using EVT? • Disadvantages • Inefficient • Most of the data are thrown away • …we may over-estimate uncertainty • …relies on having a large sample size • Asymptotics • The theory only holds exactly for infinitely extreme events • Difficult to extend to multivariate case • Data quality • Sensitive to errors in extreme data Advantages • Robust Relies on weak assumptions Avoids bias • Theoretically sound Justified by asymptotic theory • Quick& relatively easy to use • Honest …about the uncertainties involved in making statements about very rare events

  41. Motivation What is EVT? 3. Applications Current research Motivation

  42. Environmental sciences • EVT is widely used by scientists working in hydrology, climatology, oceanography and fire science • It is also used for operational purposes in flood risk assessment and civil engineering • Particular interest in studying the impact of climate change upon extreme events – e.g. MICE project (www.cru.uea.ac.uk/projects/mice) WASA project: Waves & Storms in the NE Atlantic • .

  43. Thames Barrier, London Source: Roger Haworth http://en.wikipedia.org/wiki/Image:Thames_Barrier_059184.jpg

  44. Risk assessment and design • Extreme value problems in hydrology and coastal engineering are often phrased in terms of return levels • N-year return level: the level that is exceeded with probability 1/N in a particular year – definition applies to nonstationary processes too, but interpretation is harder • e.g. Thames Barrier: “…was originally designed to protect London against a flood level with a return period of 1000 years in the year 2030…” (Wikipedia)

  45. Biology • Biologists are also often interested in studying the properties of extreme or rare events, but rarely use EVT • Some likely reasons – • Relatively small sample sizes (compared to e.g. hydrology) • Extreme events not so easily defined in quantitative terms • New applications are likely to arise from the increasing use of large datasets (e.g. in genetics), and from an increased focus on quantitative risk assessment

  46. Genetics A major application of EVT is in sequence alignment, and extreme value models are used by BLAST and FASTA Compare a sequence against a vast database of known sequences - 1. define a similarity score 2. search for the best match within the database 3. use EVT to evaluate the significance of this match “…a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences…” (Wikipedia)

More Related