What does mathematics contribute to bioinformatics?

What does mathematics contribute to bioinformatics? Winfried Just Department of Mathematics Ohio University

A new microscope and a new physics In 2004 PLoS Biology published a paper by Joel E. Cohen Mathematics Is Biology's Next Microscope, Only Better; Biology Is Mathematics' Next Physics, Only Better. Really? How does this new microscopediffer from the traditional ones? How to use it? Why did mathematicians become seriously interested in biology? And how is all this related to bioinformatics?

More empirical observations • NSF and NIH recently started to invest heavily in biomathematics. • In 2002 the Mathematical Biosciences Institute (MBI, located at OSU) was founded; this is the first and so far only NSFinstitute dedicated exclusively to applications of mathematics in one other area. • Several other new research institutes in biomathematics are supported from public or private sources. • A number of new journals specializing in biomathematics got started. • The job market for biomathematicians is currently rather favorable, both in academia and industry, especially in the pharmaceutical industry.

What is behind this trend? And why do we observe this trend now,instead of 30 years ago or 30 years from now?There are two main reasons: • Contemporary biology generate a huge mountains of data. Drawing biologically meaningful inferences from these data requires analysis in the framework of good mathematical models. Hence mathematics has become a necessarytool for biology. • Currently available computer power allows us to investigate sufficiently detailed mathematical models to draw biologically realistic inferences. Thus mathematics has become a usefultool for biology.

Biomathematics vs. bioinformatics Everything that has been said so far about “biomathematics” could also be said about “bioinformatics.” What is the difference between the two areas? Biomathematics:Applications of mathematics to biology. Bioinformatics: The design, implementation, and use of computer algorithms to draw inferences from massive sets of biomolecular data. It is an interdisciplinary field that draws on knowledge from biology, biochemistry, statistics, mathematics, and computer science.

Example of a huge data set: Genbank The first viral genomewas published in the 1980’s, the first bacterial genome,H. influenzae, 1.83 · 106 bp, in 1995, The first genome of a multicellular organism, C. elegans, 108bp,w 1998. The sketch of our own genome, H. sapiens, π· 109 bp, was announced in June 2000. As of February 2008, Genbank contained85759586764bp of information. How to draw concrete inferences from such a huge mountains of information?

Where are the genes? Let us look, for example, at our own genome. The information about it is written in Genbank as a sequenceπ· 109liter that would fill a million of tightly typed pages, the equivalent of several thousand novels: ...actggtacctgtatatggacgctccatatttaatgcgcgatgcaggatctaaa... Less than 1.5% of this sequence codes proteins. How to find these genes? No human can read the whole sequence. A computer can read it easily, in a few seconds. So, maybe the computer will tell us where the genes are, where they start, and where they end. But what is the computer supposed to compute???

Honest Craig’s Casino This is a casino in Nevada where one plays 64-number roulette. In each round, a player bets chips on three among those 64 numbers. If one of these three chosen numbers comes up, honest Craig will pay a suitable premium. If not, the player loses the chips. QUESTION: How long does it take, on average, for a winning number to come up?

Honest Craig’s Casino This is a casino in Nevada where one plays 64-number roulette. In each round, a player bets chips on three among those 64 numbers. If one of these three chosen numbers comes up, honest Craig will pay a suitable premium. If not, the player loses the chips. QUESTION: How long does it take, on average, for a winning number to come up? ANSWER: 64/3 = 21.33 rounds.

Probability of long waiting times Let us assume that Craig is as honest as he claims. Then the probability P(k) that our player will keep losing throughout the first krounds is (61/64)k. In particular, starting from k = 50 we obtain the following probabilities: P(50) = 0.0907 P(51) = 0.0864 P(52) = 0.0824 P(53) = 0.0785 P(54) = 0.0748 P(55) = 0.0713 P(56) = 0.0680 P(57) = 0.0648 P(58) = 0.0618 P(59) = 0.0589 P(60) = 0.0561 P(61) = 0.0535 P(62) = 0.0510 P(63) = 0.0486 P(64) = 0.0463 P(65) = 0.0441 P(66) = 0.0421 P(67) = 0.0401 P(68) = 0.0382 P(69) = 0.0364 P(100) = 0.0082 P(200) = 0.000064 P(300) = 0.00000055

Some statistical terminology The assumption that Craig is as honest as he claims will be our null hypothesis. The suspicion that heis cheating after all is our alternative hypothesis. The number of losses that precede the first winning round will be our test statistics. The p-value is the probability that the test statistics takes the observed or a more extreme value under the assumption of the null hypothesis.If the p-value falls below our agreed upon significance level, we are justified in rejecting the null hypothesis. In science, the most commonly used significance level is 0.05. Falsely accusing honestCraiga about cheating would be a Type I error; trusting him when he is in fact cheating would be a Type II error.

Craiga Venter’s Lab In 1995 Craig Venter’s team sequenced the genome of the bacterium H. influenzae. If we want to detect the positions of its 1740 genes that code proteins in its sequence of 1 830 140 base pairs, we can reason as follows: In bacteria almost all the genome codes proteins. Let us start from position n and read triplets: (n, n+1, n+2), (n+3, n+4, n+5), …

Craiga Venter’s Lab In 1995 Craig Venter’s team sequenced the genome of the bacterium H. influenzae. If we want to detect the positions of its 1740 genes that code proteins in its sequence of 1 830 140 base pairs, we can reason as follows: In bacteria almost all the genome codes proteins. Let us start from position n and read triplets: (n, n+1, n+2), (n+3, n+4, n+5), … If we read in the correct reading frame, we will read a sequence of codons that ends with a STOP codon, that is, TAA, TGA, TAG.

Craiga Venter’s Lab In 1995 Craig Venter’s team sequenced the genome of the bacterium H. influenzae. If we want to detect the positions of its 1740 genes that code proteins in its sequence of 1 830 140 base pairs, we can reason as follows: In bacteria almost all the genome codes proteins. Let us start from position n and read triplets: (n, n+1, n+2), (n+3, n+4, n+5), … If we read in the correct reading frame, we will read a sequence of codons that ends with a STOP codon, that is, TAA, TGA, TAG. Such a STOP codon will appear on average once in about 300 triplets. If we read in one of the other five reading frames, we will read garbage, that is, a more or less random sequence of triplets and one of the triplets TAA, TGA, TAG will be encountered on average once every 64/3 = 21.33 positions. Rings a bell?

This is the same problem! With minor modifications: Now our null hypothesis will be that we read in the wrong reading frame, the alternative hypothesis will be that we read a coding sequence in the correct reading frame. If we don’t encounter a STOP codon while reading 63 successive triplets, we can reject the null hypothesis at significance level 0.05 and conclude that we found a sequence that codes a protein whose end is easy to find. So we can design an easy gene-finding algorithm based on finding these so-called ORF’s (open reading frames).

Some caveats • The beginning of the gene is somewhat more difficult to determine, since ATG is both the START codon and the codon for methionine, and the promoter is also part of the gene. • The “garbage” in the other five reading frames is not completely random. • This approach will miss all genes that code proteins shorter than 63 amino acids (type ? error) and will sometimes discover spurious genes (type ? error). • This approach is unsuitable for discovering RNA-coding genes. However, the above problems can be solved, and there exist good gene-finding algorithms based on this idea.

Craiga Venter’s lab in 2000 But now let us look at the genome of H. sapiens: • Protein-coding regions constitute only a small fraction of our genome. All by itself, this would lead to a lot more Type I errors than in prokaryotes.

Craiga Venter’s lab in 2000 But now let us look at the genome of H. sapiens: • Protein-coding regions constitute only a small fraction of our genome. • The coding sequences, exons, are interspersed with introns. • A given codon may be split by an intron. • Consecutive exons don’t have to sit in the same reading frame. • Introns look similar to random sequences. So we are faced with a much more difficult problem. Nowadays there exist pretty good algorithms for finding genes in eukaryotes. But: No algorithm for finding genes in prokaryotes will work here.

Mathematics and mathematicians • Mathematics is a great language for elucidating the common structure in apparently unrelated problems. • Mathematicians have a tendency to talk about complicated theories in their jargon instead of giving simple and concrete answers. • “Mathematical microscopes” often don’t come with a simple user’s manual. In order to successfully use them, one needs to understand to some extent how they work. The choice of the most appropriate “mathematical microscope” for a given biological problem often requires active cooperation between mathematicians and biologists. • The key to success in this type of cooperation is finding a common language and mutual understanding of and respect for the two different intellectual approaches. • Mathematical models form the basis for formulating hypotheses, often in the form of probabilities. • The final interpretation of these hypotheses and their experimental verification belongs to the biologists. Thus “mathematical microscopes” will not make the more traditional ones redundant. In points 3-6, feel free to substitute “bioinformatics” for “mathematics.”

Biomathematics vs. bioinformatics Biomathematics:Applications of mathematics to biology. Bioinformatics: The design, implementation, and use of computer algorithms to draw inferences from massive sets of biomolecular data. It is an interdisciplinary field that draws on knowledge from biology, biochemistry, statistics, mathematics, and computer science. The design of all bioinformatics tools is based on mathematical models. In order to choose the most appropropriate among the available tools and draw proper inferences, one needs to understand these models.

What does mathematics contribute to bioinformatics?