160 likes | 290 Views
Introductory Biological Sequence Analysis Through Spreadsheets. Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee, WI. Teaching Mathematics to Students of Biology.
E N D
Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee, WI ICTCM 2000
Teaching Mathematics to Students of Biology • Need to make the math in the courses correlate with math that needed in that discipline • The most important “math” needed is statistics • The molecular biology revolution in biology presents data in a form in which calculus has little impact (sequences of letters) ICTCM 2000
The Nature of Biological Sequence Data • Primary structure of DNA, RNA, and proteins are sequences of letters -- 4 letters in the case of DNA (ATGC) and RNA (AUGC) and 20 letters representing the sequence of amino acids which makes up a protein • Secondary and Tertiary structures (bending, folding and twisting) of structures determines function -- hints seen through primary structure ICTCM 2000
Use of Spreadsheets in this setting • Commonly found and used in biological labs for data acquisition, storage and organization, and data analysis • Commonly present on student computers and computer labs • Unlike calculators -- able to handle data sets typical of “real world” applications • R.F. Murphy at CMU has developed a set of worksheets for sequence analysis ICTCM 2000
Meaningful Questions & Problems 1. Measuring the similarity between two strings -- “alignment” or “homology” 2. Finding instances of a pattern in a string 3. Describing the composition and properties of a string 4. Graphing the evolutionary process and construction of phylogenetic trees ICTCM 2000
Measuring the Similarity between Strings • Given a gene -- suggest the function of the protein coded for by finding a similar sequence (possibly in another species) • Simple homology involves assigning a “1” for agreement and “0” for nonagreement at each site. Then sum over all sites • Homology is the fraction of the highest possible score, in % ICTCM 2000
Spreadsheet #1 Simple Homology ICTCM 2000
Finding Instances of a Particular Pattern in a String • The process of locating genes involves locating regions of the DNA sequences that contain patterns which resemble those of known genes • Identifying sites on DNA where one of the restriction enzymes can cleave DNA -- Also of interest is size of the fragments that result • Identify regions of RNA which correspond to particular features (e.g. loops) which may be splice sites ICTCM 2000
Describing the Composition and Properties of a String • Counts of frequencies of particular letters due to their properties (e.g. regions rich in G&C or A&T in DNA) • Properties of proteins (e.g. charge or hydrophobicity) which depend on the nature and frequencies of the particular amino acids ICTCM 2000
Spreadsheet #2 Hydropathy Plot ICTCM 2000
Spreadsheet #2 (Cont.) ICTCM 2000
Graphing Evolution and Phylogenetic Trees • Evolutionary distance between two DNA sequences used to determine the process of the changes in the sequences over time (e.g. the evolution of HIV or the flu viruses) • Trees constructed to express the relationship between related sequences -- distance in the tree a monotone function of homology ICTCM 2000
Spreadsheet #3 Mutation & Evolution ICTCM 2000
Spreadsheet #3 (cont.) To study the evolution of a sequence, we randomly pick a site for mutation, then change its letter ICTCM 2000
Conclusion • Use of a spreadsheet makes possible an experimental approach to introducing the mathematics of sequence analysis • The use of spreadsheets makes possible the use of real-world data and presents the computational tool in a meaningful context • The importance of the topics to all educated individuals suggests that the topics be included in many liberal arts math courses ICTCM 2000