NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA

INTRODUCTION Book list Level of course Aims of course What are multivariate data? What is multivariate data analysis? Aims of multivariate data analysis Why do multivariate data analysis? Terminology Types of variables Geometrical models and concept of similarity (dissimilarity or distance) Computing Course topics

LEVEL OF THE COURSE Approach from practical biological and geological viewpoint, not statistical theory viewpoint. Assume no background in matrix algebra, eigenanalysis, or statistical theory. Emphasis on techniques that are ecologically realistic and useful and that are computationally feasible.

“Truths which can be proved can also be known by faith. The proofs are difficult and can only be understood by the learned; but faith is necessary also to the young, and to those who, from practical preoccupations, have not the leisure to learn. For them, revelation suffices.” Bertrand Russell 1946 The History of Western Philosophy

“It cannot be too strongly emphasised that a long mathematical argument can be fully understood on first reading only when it is very elementary indeed, relative to the reader’s mathematical knowledge. If one wants only the gist of it, he may read such material once only, but otherwise he may expect to read it at least once again. Serious reading of mathematics is best done sitting bolt upright on a hard chair at a desk. Pencil and paper are indispensable.” L Savage 1972 The Foundations of Statistics. BUT: “A journey of a thousand miles begins with a single step” Lao Tsu

STATUS OF MULTIVARIATE NUMERICAL DATA ANALYSIS Basic mathematics of correlation, regression, analysis of variance, eigenanalysis, randomisation etc. not new, worked out in 1920-1930s. Arithmetic manipulations and calculationsinvolved so numerous and so time consuming; virtually impossible to work with anything other than smallest data-sets on hand calculator or early computer. Development of numerical data analysis closely linked to development of computers. Now possible to do in seconds what would have taken hours, days, even weeks. Increased availability of computer program packages has advantages and disadvantages. Advantages • fast • painless • simple Disadvantages • too fast • too easy • too simple Need to understand a technique well before one can critically evaluate results. Sound interpretation requires a good understanding of the technique.

AIMS Provide introductory understanding to the most appropriate methods for the numerical analysis of complex multivariate biological and environmental data. Recent maturation of methods. Provide introduction to what these methods do and do not do. Provide some guidance as to when and when not to use particular methods. Provide an outline of major assumptions, limitations, strengths, and weaknesses of different methods. Indicate to you when to seek expert advice. Encourage numerical thinking (ideas, reasons, potentialities behind the techniques). Not so concerned here with numerical arithmetic (the numerical manipulations involved).

Syllabus for Edgeworth’s 1892Newmarch Lectures,University College London ON THE USES AND METHODS OF STATISTICS By Professor F. Y. Edgeworth, M. A., D. C. L. I. FIRST PRINCIPLES The extent of the subject here treated is that which is denoted by two leading definitions of statistics, viz: the study of numerical statements relating to society, and the theory of means. The subject may be divided according as the element of induction is more or less prevalent. First come general directions as to the acquisition of data; e.g., that figures should be accurate, and terms unambiguous. Examples of the violation of these rules; together with other precepts and cautions. Use of relative figures (per head, per cent, &c.). Analysis of the data. References: Conférences sur la Statistique (Rozier Editeur), 1891; Pidgin, Practical Statistics, 1888; Giffen, International Statistical Comparisons, Economic Journal, June, 1892. II. GRAPHICAL METHODS The Cartesian system of co-ordinates. Integration and interpolation. Case where several dependent variables (i.e. diseases from different causes) are referred to one independent variable (i.e. the time). The case of one variable dependent on two independent variables is properly represented by a surface; but curves of level and variously coloured planes are more convenient. Methods of expressing variation of a quantity relative to its initial, or average, value. Miscellaneous devices for exhibiting numerical relations to the eye. References: Marey, La Méthode Graphique, 1885; Favaro, Leçons de StatiqueGraphique (translated into French by Terrier), Ch. V. with appendix by the translator. Levasseur, La Statistique Graphique, Journal of the Statistical Society, Jubilee vol., 1885; Marshall, The Graphic Method of Statistics, Ibid; Cheysson, LesCartogrammes à teintes graduées, Journal de la Société de Statistique de Paris, 1887; Scribner’s Statistical Atlas of the United States; Longstaff, Studies in Statistics, 1891.

III. THE DOCTRINE OF AVERAGES The general idea of a mean comprehends innumerable species, of which the most important are, the Arithmetic Mean, the Median, the Greatest Ordinate (or centre of greatest condensation) and the Geometric Mean. A cross division is between simple and weighted means. Concrete instances of these varieties. Subtle distinction between so-called objective and subjective means. Peculiar prestige attaches to the means of which the constituents are grouped according to the Probability Curve, or law oferror. A priori demonstration, and empirical verification, that this form arises under certain conditions. References: Venn, Logic of Chance, Third Edition, 1888, chap, xviii., and xix.; On….Averages. Journal of the Statistical Society, 1891; Galton, Statistics by inter-comparison, Philosophical Magazine, 1875; Bertillon, Moyenne, Dictionnaire Encyclopédique des Science Médicales; Edgeworth, On the Choice of Means, Phil. Mag., 1887, On the empirical proof of thelaw of error, Ib., 1887. IV. TYPES AND CORRELATIONS The ‘mean man’ has for stature, length of cubit, height of knee, &c, the respective means of the statures, lengths, &c., of a greater number of men. Reply of the objection that such a combination of partial means may not form a possible whole. Relation between the deviation of one organ or attribute, e.g. length of cubit, from its mean; as established by Mr. Galton, and illustrated by Mr. H. Dickson. Abridged method of ascertaining the co-efficient which expresses the correlation between three attributes, e.g. stature, length of cubit and height of knee. The formula for the most probable attribute, e.g. stature corresponding to assigned values of two other attributes, e.g. length of cubit and height of knee, may be ascertained either from three simple correlations, between stature and cubit, stature and height of knee, cubit and height of knee; or by observations special to the case of three variables. Correlation between any number of attributes. References: Quetelet, Anthropométrie; Galton, Family Likeness in Stature, Proceedings of the Royal Society, 1886; Co-relations and their measurements Ibid. 1888; Weldon, Correlated Variations, Ibid, 1892.

V. THE STATISTICAL PART OF INDUCTIVE LOGIC Passing Insurance and other direct applications of statistics, we come to the investigation of causes. The inductive method to which statistics lends itself, theMethod of Agreement, is liable to the fallacy Post hoc propter hoc; of which numerous examples occur. The Method of Concomitant variations is facilitated by the use of parallel curves. The Method of Residues is exemplified when in comparing the death rates of different classes, we make allowance for their different ages; and in similar cases. References: Mill, Logic; Giffen, Essays on Finance, and Article in June No. of Economic Journal; Humphreys, Value of death rates as a test of Sanitary conditions, Journal of the Statistical Society, 1874, Class Mortality Statistics, Ibid, 1887. VI. THE ELIMINATION OF CHANCE One case of the Method of Residues, for which there exists a technical apparatus, is where the agency allowed for consists of those “fleeting causes” called chance. The simple method of eliminating chance, described by Mill (Logic, iii, xviii, 4) and the higher method derived from the theory of error. The latter method is particularly applicable where the deviation from the average value of a ratio – e.g. that between male and female births – follows the analogy of the simpler games of chance. In other cases the higher theory affords rather regulative ideas than exact conclusions; in this respect, comparable to the use of the mathematical theory of economics. References: Westergaard, Grundzüge der Theorie der Statistik, 1891; Duesing, Dasgeschlechtverhaltniss in Preussen, 1890; Edgeworth, Methods of Statistics, Journal of the Statistical Society, Jubilee vol., 1885. [The lectures were presented on six consecutive Wednesdays at 5:00 P.M., beginning 11 May 1892, admission free.]

AIMS At its best, statistical analysis sharpens thinking about data, reveals new patterns, prompts creative thinking, and stimulates productive discussions in multi-disciplinary research groups. For many scientists, these positive possibilities of statistics are over-shadowed by negatives; abstruse assumptions, emphasis of things one can’t do, and convoluted logic based on hypothesis rejection. One colleague’s reaction to this Special Feature (on statistical analysis of ecosystem studies) was that “statistics is the scientific equivalent of a trip to the dentist.” This view is probably widespread. It leads to insufficient awareness of the fact that statistics, like ecology, is a vital, evolving discipline with ever-changing capabilities. At the end of the semester, could my students fully understand all of the statistical methods used in a typical issue of Ecology? Probably not, but they did have the foundation to consider the methods if authors clearly described their approach. Statistics can still mislead students, but students are less apt to see all statistics as lies and more apt to constructively criticise questionable methods. They can dissect any approach by applying the conceptual terms used throughout the semester. Students leave the course believing that statistics does, after all, have relevance, and that it is more accessible than they believed at the beginning of the semester.

July 18, 1998. Plot 6 (quadrats) (Rt. Bank, c 300 m S of mouth of Steepbank R., 40m inland) A typical page from a field notebook. This one records observations on the ground vegetation in Populus balsamifera woodland in the flood plain of the Athabasca River, Alberta.

TYPES OF MULTIVARIATE DATA Features in common – MANY OBJECTS n MANY VARIABLES m CAN BE ARRANGED IN DATA MATRIX of SAMPLES or OBJECTS x VARIABLES

DATA MATRIX Matrix X with n columns x m rows. n x m matrix. Order (n x m). subscript X21 Xik element in row two column one row i column k

FEATURES OF MULTIVARIATE DATA Complex Show: Noise Redundancy Internal relationships Outliers Some information in the data is only indirectly interpretable ENVIRONMENTAL DATA fewer variables +/–, ranks, quantitative non-normal linear inter-relationships, often high correlations, some redundancy BIOLOGICAL DATA many species +/–, quantitative, often %, many zero values, skewed non-linear responses to environment

STATISTICS AND DATA ANALYSIS • Hypothesis testing ‘confirmatory data analysis’ (CDA). • Model building explanatory empirical [statistical] Pielou (1981) Quart. Rev. Biol. “Models are often displayed with little or no effort to link them with the real world. As a result the whole body of knowledge and theory has grown top-heavy with models... Models are not useless but too much should not be expected of them. Modelling is only a part, and a subordinate part, of research.” • Hypothesis generation ‘exploratory data analysis’ (EDA). Detective work CDA & EDA - different aims, philosophies, methods “We need both exploratory and confirmatory”. J W Tukey 1980

CONFIRMATORY DATA ANALYSIS EXPLORATORYDATA ANALYSIS Real world ’facts’ Hypotheses Real world ‘facts’ Observations Measurements Data Observations Measurements Data Data analysis Statistical testing Patterns ‘Information’ Hypothesis testing Hypotheses Decisions Theory

Underlying statistical model (e.g. linear or unimodal response) Biological Data Y Exploratory data analysis Description Confirmatory data analysis Additional (e.g. environmental data) X Testable ‘null hypothesis’ Rejected hypotheses

induction Scientific H0 Scientific HA deduction Observation Prediction deduction Theory/Paradigm Conceptual design of study, choice of format (experimental, non-experimental) and classes of data Evaluate theory/paradigm Evaluate scientific H0, HA Statistical H0 Statistical HA Evaluate prediction Evaluate statistical H0, HA Sampling or experimental design Data collection Analysis The Popperian hypothetico-deductive method, after Underwood and others. HO = null hypothesis HA = alternative hypothesis

A WELL-DESIGNED MODERN ECOLOGICAL STUDY COMBINES BOTH.

Data diving with cross-validation: an investigation of broad-scale gradients in Swedish weed communities. ERIK HALLGREN, MICHAEL W. PALMER and PER MILBERG. Journal of Ecology, 1999, 87, 1037-1051. Full data set Remove observations with missing data Clean data set Ideas for more analysis Random split Flow chart for the sequence of analyses. Solid lines represent the flow of data and dashed lines the flowof analysis. Exploratory data set Confirmatory data set Hypotheses Choice of variables Some previously removed data Hypothesis tests Combined data set Analyses for display RESULTS

EUROPEAN FOOD (From A Survey of Europe Today, The Reader’s Digest Association Ltd.) Percentage of all households with various foods in house at time of questionnaire. Foods by countries. Country

Classification Dendrogram showing the results of minimum variance agglomerative cluster analysis of the 16 European countries for the 20 food variables listed in the table. Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland

Ordination Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland Correspondence analysis of percentages of households in 16 European countries having each of 20 types of food.

Minimum spanning tree fitted to the full 15-dimensional correspondence analysis solution superimposed on a rotated plot of countries from previous figure.

Percentages of people employed in nine different industry groups in Europe. (AGR = agriculture, MIN = mining, MAN = manufacturing, PS = power supplies, CON = construction, SER = service industries, FIN = finance, SPS = social and personal services, TC = transport and communications). Source: Euromonitor (1979, pp. 76-7) with the percentage employed in finance in Spain reduced from 14.7 to the more reasonable figure of 8.5

Correspondence analysis

WHY DO MULTIVARIATE DATA ANALYSIS? 1: Data simplification and data reduction - “signal from noise” 2: Detect features that might otherwise escape attention. 3: Hypothesis generation and prediction. 4: Data exploration as aid to further data collection. 5: Communication of results of complex data. Ease of display of complex data. 6: Aids communication and forces us to be explicit. “The more orthodox amongst us should at least reflect that many of the same imperfections are implicit in our own cerebrations and welcome the exposure which numbers bring to the muddle which words may obscure”. D Walker (1972) 7: Tackle problems not otherwise soluble. Hopefully better science. 8: Fun!

“General impressions are never to be trusted. Unfortunately when they are of long standing they become fixed rules of life, and assume a prescriptive right not to be questioned. Consequently those who are not accustomed to original inquiry entertain a hatred and a horror of statistics. They cannot endure the idea of submitting their sacred impressions to cold-blooded verification. But it is the triumph of scientific men to rise superior to their superstitions, to desire tests by which the value of their beliefs may be ascertained, and to feel sufficiently masters of themselves to discard contemptuously whatever may be found untrue.” Francis Galton Quoted from Quotes, Damned Quotes and... compiled by J Bibby Edinburgh: John Bibby (Books)

TERMINOLOGY Sample, object, individual “sampling unit” Variable, character, attribute Algorithms, methods, models, programs Classification, clustering, partitioning, scaling, gradient analysis [assignment, identification, discrimination] [dissection] Objective, repeatable

TYPES OF VARIABLES 1) Numeric, quantitative, continuous variables 2) Nominal and ordinal variables (qualitative multistate) Nominal “disordered multistate” (e.g. red, white, blue) Ordinal “ordered multistate” (e.g. dry, moist, wet) 3) Binary or dichotomous variables +/– (e.g. male, female) 4) Conditionally present variables 5) Mixed data – see Lecture 12

GEOMETRICAL MODELS Pollen data - 2 pollen types x 15 samples Depths are in centimetres, and the units for pollen frequencies may be either in grains counted or percentages. Adam (1970)

ALTERNATE REPRESENTATIONS OF THE POLLEN DATA Palynological representation Geometrical representation In (a) the data are plotted as a standard diagram, and in (b) they are plotted using the geometric model. Units along the axes may be either pollen counts or percentages. Adam (1970)

Geometrical model of a vegetation space containing 52 records (stands). A: A cluster within the cloud of points (stands) occupying vegetation space. B: 3-dimensional abstract vegetation space: each dimension represents an element (e.g. proportion of a certain species) in the analysis (X Y Z axes). A, the results of a classification approach (here attempted after ordination) in which similar individuals are grouped and considered as a single cell or unit. B, the results of an ordination approach in which similar stands nevertheless retain their unique properties and thus no information is lost (X1 Y1 Z1 axes). N. B. Abstract space has no connection with real space from which the records were initially collected.

Concept of Similarity, Dissimilarity, Distance and Proximity sij – how similar objectiis objectj Proximity measure DC or SC Dissimilarity=Distance _________________________________ Convert sijdij sij = C – dijwhere C is constant

COMPUTING In the 10 practicals, mainly use R, a public-domain statistical-computing environment, rather than specific commercial packages such as MINITAB or SYSTAT. Relatively steep learning curve but worth it. Recommend Fox (2002) An R and S-PLUS companion to applied regression (Sage), Crawley (2005) Statistics – An introduction using R (Wiley), Crawley (2007) The R Book (Wiley), Everitt (2005) An R and S-PLUS companion to multivariate analysis (Springer), and Verzani (2005) Using R for introductory statistics (Chapman Hall/CRC) as excellent guides. Will also use specialised software for specific methods (e.g. TWINSPAN, CANOCO and CANODRAW, C2, ZONE, etc.) Computing practicals are an integral and essential part of the course.

COURSE TOPICS

COURSE POWERP0INTS In some of the lectures, some of the slides are rather technical. They are included for the sake of completion to the topic under discussion. They are for reference only and are marked REF

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA