400 likes | 699 Views
ANALYSIS OF BIOLOGICAL DATA BIOL4062/5062. Hal Whitehead. Introduction Assignments Tentative schedule Analysis of biological data. Introduction. Instructors Purpose of class Related classes Books Computer programs. http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm.
E N D
ANALYSIS OF BIOLOGICAL DATABIOL4062/5062 Hal Whitehead
Introduction • Assignments • Tentative schedule • Analysis of biological data
Introduction • Instructors • Purpose of class • Related classes • Books • Computer programs http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm
Instructor: Hal Whitehead • LSC3076 (Ph 3723; email hwhitehe@dal.ca) • Best times: 8:00-9:00 a.m. • Teaching Assistant: ? • Other instructors • Dr David Lusseau
Why “Analysis of Biological Data”? • Biologists • increasingly using quantitative techniques • to analyze larger and larger data sets • need skills in data analysis • especially in broad area of ecology • BIOL4062/5062 • introduce techniques for analysis of biological data • emphasis will be on the practical use and abuse of techniques, not derivations or mathematical formulae • in assignments students explore real and realistic data sets
Related classes • Design of Biological Experiments (BIOL4061/5061) • most useful for those who work with systems that can be manipulated • Courses in Statistics • more emphasis on mathematical sides
Some books (on reserve) • Legendre, L. and P. Legendre. Numerical Ecology (2nd edition). Elsevier (1998) • Manly, B.F.J. Multivariate statistical methods: a primer (2nd edition). Chapman & Hall (1994) • Other books: • Many, do not need to be right up to date
Good, comprehensive packages, can do analyses for this class More sophisticated and powerful, harder to use Computer programs • MINITAB • SPSS • SYSTAT • SAS • MATLAB (Statistics toolbox) • S-plus • R
Support from Hal Support from ? Computer programs • MINITAB * † • SPSS * † • SYSTAT † • SAS * † • MATLAB (Statistics toolbox) • S-plus (freely available at Dal.?) • R† (freely available on the web) * on GS.DAL.CA † in Biology-Earth Sciences computer lab
Assignments • Type 1 • artificial data sets for trying different techniques • Type 2 • real data set to try a real analysis
Type 1 assignments • Five assignments, sent by email (next few days) • Each 10% final mark • Artificial but realistic data sets • Different data sets to each student, but structurally similar • More analyses expected for graduate students (BIOL5062) • Analyze using a computer statistical package
Type 1 assignments • Hand in a short write-up, explaining clearly: • what you did • what you found • what you think the results might mean biologically • Beware of: • Rubbish! • Check the results against patterns in the original data to make sure they make sense. • Over-interpreting the results • Not answering the questions posed
Type 1 assignments • Five assignments: • Multiple regression 10% • Log-linear models 10% • Principal components analysis 10% • Discriminant function analysis 10% • Cluster analysis, multidimensional scaling, network analysis 10%
Type 2 assignment • Find a biological data set, and then analyze it • The analysis should not be: • part of past, present, or future Honours, MSc or PhD thesis, or used for another class: self-plagiarism • that, or repeat that, done by someone else: plagiarism
Type 2 assignment • The analysis can • use same data as in thesis or another course, but totally different analysis • use data collected by your supervisor, or someone else, but you should ask them • use a data set that you find on the web, or somewhere else, but you should check that it is OK • be submitted for publication, but you must check that you have all necessary permissions
Type 2 assignment • Minimum sizes of data set (ask Hal for exceptions or in case of uncertainty): • For undergraduates (BIOL4062): • >50 units x >3 variables • For graduates (BIOL5062) • >50 units x >5 variables • either, two types of variables • e.g. “Dependent; Independent”; “Species; Environment” • or, link two data sets with one at least as large as the undergraduate data set • Must address at least 3 biological questions (BIOL4062), or 4 questions (BIOL5062)
Type 2 assignment (4 steps) • a) Short meeting with Hal or *** to discuss your proposed data set and proposed analysis: feedback • bring draft of 2b assignment • b) Description of data set and proposed analysis. • where it came from • its structure(s) (number of variables, units, names of variables, types of variables, ...) • proposed biological questions • proposed analytical methods • possible problems • Example on web
Type 2 assignment (4 steps) • c (i) Presentation of results to the class by graduate students • biological questions being addressed • brief description of the data set • how you analyzed it • conclusions • Example in Class • c(ii) Undergraduate students should go to graduate presentations and will be tested on general issues arising from them on last day
Type 2 assignment (4 steps) • d) Write-up of your analysis as for a scientific journal paper • Max 5 pages (4062) or 7 pages (5062) single-spaced • excluding references, tables, figures • Explain biological question, methods in sufficient detail for someone to replicate them, problems, and biological conclusions • Show graphically, or in tables, the major effects • Do not just present summaries of ordinations or significance levels of hypotheses tests • Introduction and Discussion can be shorter and less detailed than in published paper • sufficient to give a good feel for biological issue being examined and the potential biological significance of the results Example on web
Type 2 assignment • Marks • 2b Description of data set and proposed analysis 5% • 2c 15% • (i) Presentation of results by graduate students (BIOL5062) • (ii) Test on general principles from graduate student presentations (BIOL4062) • 2d Write-up of results 30%
SYSTAT demo. at end of lectures
Analysis of Biological Data • Types of biological data • History (very abbreviated!) • The process of biological data analysis • why garbage may come out • Hypothesis testing and data analysis • assumptions • other issues
Types of biological data • Morphometric • Community ecology • organism distribution and environmental variation • Genetic data for ecological and evolutionary questions • Population data for management, conservation, evolutionary questions • Behavioural, physiological, ...
Development of biological data analysis • >~1850 Displays • >~1900 ANOVA's, regression, correlation • without computers • >~1930 Non-parametric methods • >~1970 Multiple regression and multivariate analysis • matrix algebra using computers • >~1980 Robust methods: bootstraps, jackknives, permutations • need powerful computers
Garbage in => Garbage out • Good data + Errors => Garbage in => Garbage out • Check data entry • Good data + Errors in routine => Garbage out • Check results, run routines on data with known answer, • run on 2 routines • Good data + Wrong model => Garbage out • Think about, read about and discuss model
Hypothesis Experimental Design Experiment Analysis Conclusion [ANOVA, T-test] Agriculture Experimental ecology Physiology Animal behaviour Data Collection Data Analysis Hypothesis [scatter plots, box plots, most multivariate analyses] Fisheries Community ecology Paleontology Hypothesis Testing Data Analysis
Transform data or use non-linear or non-parametric methods Some assumptions • Normality • can only be properly examined on large data sets • mainly a problem on small ones • an important issue for hypothesis testing • normality desirable in data analysis • Linearity • makes hypothesis testing easier • makes data analysis easier • Independence • major problem for hypothesis testing • no problem, or advantage, in data analysis
Other issues in data analysis • Missing data • Often present in ecological data • Outliers • What do we do with apparent outliers? • Remove them? • Multiple comparisons • Major issue with hypothesis testing • Not an issue with data analysis • although: Patterns appear in random data
Next class: • Inference in ecology and evolution: • Null hypothesis statistical tests • Effect size statistics • Bayesian statistics • Information theoretic model comparisons
Performance in BIOL4062/5062 • Graduate students (BIOL5062) • some do well with rather little effort • some do well with a lot of effort • Undergraduate students (BIOL4062) • most do well with some effort • adequate statistical background • some do poorly • inadequate statistical background or effort