BioInformatics at FSU – what it is, who’s doing it, and why it needs to be done now.

BioInformatics at FSU – what it is, who’s doing it, and why it needs to be done now. Steve Thompson Florida State University School of Computational Science and Information Technology (CSIT)

Introductory outline: • What is bioinformatics, genomics, sequence analysis, computational molecular biology . . . • Reverse Biochemistry & Evolution. • Database growth & cpu power. • A very brief ‘Show and Tell,’ • NCBI Resources, GCG’sSeqLab, phylogenetics. • High quality training is essential! • Graduates need to be competitive on a world biotechnology market. • The University’s role in all of this; out-reach.

My definitions: • Biocomputing and computational biology are synonymous and describe the use of computers and computational techniques to analyze any biological system, from molecules, through cells, tissues, and organisms, all the way to populations. • Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases. • Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, mechanism, interactions, evolution, and perhaps structure of biological molecules. • Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. • Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

The reverse biochemistry analogy: • from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round. • Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural insights into a gene product, without the need to isolate and purify massive amounts of protein! Eventually you can go on to clone and express the gene based on that analysis using PCR techniques. • The computer and molecular databases are an essential part of this process.

The exponential growth of molecular sequence databases & cpu power. Year BasePairs Sequences 1982 680338 606 1983 2274029 2427 1984 3368765 4175 1985 5204420 5700 1986 9615371 9978 1987 15514776 14584 1988 23800000 20579 1989 34762585 28791 1990 49179285 39533 1991 71947426 55627 1992 101008486 78608 1993 157152442 143492 1994 217102462 215273 1995 384939485 555694 1996 651972984 1021211 1997 1160300687 1765847 1998 2008761784 2837897 1999 3841163011 4864570 2000 11101066288 10106023 2001 15849921438 14976310 2002 28507990166 22318883 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Doubling time ~ 1 year!

Database growth (cont.) • The Human Genome Project and numerous other genome projects have kept the data coming at alarming rates. As of April 2003, (50 years after the Watson-Crick double-helix!)16 Archaea, 128 Bacteria, and 10 Eukaryote complete, finished genomes; and 4 Vertebrate and 5 Plant essentially complete genome maps are publicly available for analysis; not counting all the virus and viroid genomes available. • The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the human genome in June 2000; independently that same month, the private company Celera Genomics announced that it had completed the first assembly of the human genome. Both articles were published mid-February 2001 in the journals Science and Nature.

Some neat stuff from those papers: • We, Homo sapiens, aren’t nearly as special as we had once hoped we were. Of the 3.2 billion base pairs in our DNA — • Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25,000 and 35,000! • The protein coding region of our genome is only about 1% or so, much of the remainder ‘junk’ is ‘jumping,’ ‘selfish DNA’ of which much may be involved in regulation and control. Understanding this network is a huge challenge. • 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.)

What are primary sequences? • (Central Dogma: DNA —> RNA —> protein) • Primary refers to one dimension — all of the ‘symbol’ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. • The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural information are not included within this sequence, however, much of this type of information is available in the reference documentation sections associated with primary sequences in the databases.

What are sequence databases? • These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another. • North America: National Center for Biotechnology Information (NCBI): GenBank & GenPept. • Also Georgetown University’s NBRF Protein Identification Resource: PIR & NRL_3D. • Europe: European Molecular Biology Laboratory (also EBI & ExPasy): EMBL & Swiss-Prot. • Asia: The DNA Data Bank of Japan (DDBJ).

Content & organization: • Most sequence database installations are examples of complex ASCII/Binary databases, but they usually are not Oracle or SQL or Object Oriented (proprietary ones often are). They often contain several very long text files containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing index functions. • Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise. Nucleic acid databases are split into subdivisions based on taxonomy (historical). Protein databases are often organized into sections by level of annotation.

What are other biological databases? • Three dimensional structure databases: • the Protein Data Bank and Rutgers Nucleic Acid Database. • Still more; these can be considered ‘non-molecular’: • Reference Databases: e.g. • OMIM — Online Mendelian Inheritance in Man • PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. • Phylogenetic Tree Databases: e.g. the Tree of Life. • Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). • Population studies data — which strains, where, etc. • And then databases that most biocomputing folk don’t even usually consider: • e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . .

So how do you do bioinformatics? Often on the InterNet over the World Wide Web — Site URL (Uniform Resource Locator) Content Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/ databases/analysis/software PIR/NBRF http://www-nbrf.georgetown.edu/ protein sequence database IUBIO Biology Archive http://iubio.bio.indiana.edu/ database/software archive Univ. of Montreal http://megasun.bch.umontreal.ca/ database/software archive Japan's GenomeNet http://www.genome.ad.jp/ databases/analysis/software European Mol' Bio' Lab' http://www.embl-heidelberg.de/ databases/analysis/software European Bioinformatics http://www.ebi.ac.uk/ databases/analysis/software The Sanger Institute http://www.sanger.ac.uk/ databases/analysis/software Univ. of Geneva BioWeb http://www.expasy.ch/ databases/analysis/software ProteinDataBank http://www.rcsb.org/pdb/ 3D mol' structure database Molecules R Us http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization The Genome DataBase http://www.gdb.org/ The Human Genome Project Stanford Genomics http://genome-www.stanford.edu/ various genome projects Inst. for Genomic Res’rch http://www.tigr.org/ esp. microbial genome projects HIV Sequence Database http://hiv-web.lanl.gov/ HIV epidemeology seq' DB The Tree of Life http://tolweb.org/tree/phylogeny.html overview of all phylogeny Ribosomal Database Proj’ http://rdp.cme.msu.edu/html/ databases/analysis/software WIT Metabolism http://wit.mcs.anl.gov/WIT2/ metabolic reconstruction Harvard Bio' Laboratories http://golgi.harvard.edu/ nice bioinformatics links list

What other resources are available? • Desktop software solutions — public domain programs are available, but . . . complicated to install, configure, and maintain. User must be pretty computer savvy. So, • commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc., • but . . . license hassles, big expense per machine, and Internet and/or CD database access all complicate matters! • Therefore, UNIX server-based solutions, public domain or commercial (e.g. the Accelrys GCG Wisconsin Package[a Pharmacopeia Co.]): the SeqLab Graphical User Interface. • One commercial license fee for an entire institution and very fast, convenient database access on local server disks. Connections from any networked terminal or workstation anywhere!

University bioinformatics objectives: • The university tripartite mission — • Education, Research, and Service. • Education: out-reach programs and undergraduate and graduate courses. • Research: bioinformatics is becoming an indispensable tool in most biological research, particularly in molecular and cellular biology. • Service: those faculty and staff that know bioinformatics should be available to assist with consultation, systems administration, and hardware access.

Education: • Workshops — continue to teach GCG SeqLab tutorial series; each of the four sessions offered once per semester. • Modules —across the university curricula within existing courses, interdisciplinary by nature, implications, & ethics. • Graduate and Undergraduate Courses — presently three cross-listed biology courses; one introductory, team-taught survey, stressing practical, project-oriented approaches; one advanced algorithms lecture; one programming practicum. • Computational Molecular Biology Program — proposed; to be in association and cooperating with students’ present major department, coordinated by CSIT. Pros and cons . . . • Summer Short Course — long-range ‘dream.’ Participants from world-wide disparate disciplines learning bioinformatics techniques and theory.

GCG SeqLab workshop series: • Four different sessions — • Intro’ to SeqLab & Multiple Sequence Analysis and its supplement, • Rational Primer Design, • Database Searching & Pairwise Comparisons — Significance, • Molecular Evolutionary Phylogenetics. • http://bio.fsu.edu/~stevet/workshop.html FOR MORE INFO...

Modules in existing courses: • Cooperate with extant programs to incorporate bioinformatics into their existing curricula. • Key is to demonstrate necessity of knowledge & offer full cooperation with departments. • Potential courses exist across many different departments, and even across different colleges. Identify potential courses from the General Catalog and approach individual instructors and chairs. • So far: http://bio.fsu.edu/~stevet/modules.html

Courses at Florida State: • Four different Special Topics Biology Department BSC4933/5936 Bioinformatics sections — • First (first offered Spring 2002)— • “Introduction to Bioinformatics” Steve Thompson et al. • Covers both sequence and structural analysis. • Team-taught; lecture + optional lab; pragmatic, real-world, project-oriented approach. • Survey level — introduction to the theory + practical applications. Pluses and minuses; the problems. • Based on Washington State University’s Biochemistry 578 model. • Required by Biomedical Mathematics program.

Courses (cont.) • Second (first offered Fall 2002) — “Programming Skills for Computational Biology and Bioinformatics” David Swofford. The Java model, an object oriented framework. • Third (first offered Spring 2003) — “Advanced Bioinformatics: Computational Methods” David Swofford. The theory behind sequence analysis algorithms. • New (Fall 2003) — “Genomics and Evolution” Thomas Hansen.

Courses (cont.) • Departments other than Biology: • Mathematics — • MAP 5485 “Introduction to Mathematical Biophysics” Jack Quine. Mathematical tools in Biophysics. • an integral part of their Biomedical Mathematics Program. • Institute of Molecular Biophysics — Center Of Excellence In Biomolecular Computer Modeling & Simulation. • In all courses — • don’t ignore implications, ramifications, & ethics of bioinformatics research.

Undergraduate opportunities: • A special undergraduate fellowship — • The Howard Hughes Undergraduate Program in Mathematical and Computational Biology. • Twelve Hughes Fellows per year earn a $5000 stipend, a $1200 summer housing allowance, and a $1000 professional meeting allowance. • Supported by two new undergraduate majors programs —one each in the Mathematics and Biology Departments.

Computational Biology Program: • Presently FSU computational biology is composed of a confusing mix of undergraduate and graduate programs across at least three different Departments from the College of Arts and Sciences. • We propose a CSIT Coordinated ‘balloon’ program in association with student’s major department that would consolidate these efforts. Pros and Cons . . . • Undergraduate and/or Graduate? • Avoid duplication of effort. • Candidate department collaborations.

Summer short course: • Long-range ‘pipe dream’? • Broad spectrum — both instructors and students from many different disciplines and world-wide distribution. • See MBL Mol’ Evol’ Workshop for a model. • One or two weeks? • On-campus room & board support. • To be undertaken this course will need the potential to achieve an exceptional world-wide reputation!

Bioinformatics degree programs around the world: • Relatively rare, but more are being created all the time. Biocomputing education URL’s are documented at: • http://www.csit.fsu.edu/HHP/gradprog.html • Most are graduate course lists, many are graduate Masters or Ph.D. programs, some include undergraduate courses or programs. There is a huge need for bioinformatics education worldwide.

Conclusions: • Gunnar von Heijne in his old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: • “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” • He continues: • “. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”

Many fine texts are also starting to become available in the field. To ‘honk-my-own-horn’ a bit, check out the new — Current Protocols in Bioinformatics from John Wiley & Sons, Inc: http://www.does.org/cp/bioinfo.html. They asked me to contribute a chapter on multiple sequence analysis using GCG software. Humana Press, Inc. also asked me to contribute. I’ve got two chapters in their — Introduction to Bioinformatics: A Theoretical And Practical Approach http://www.humanapress.com/Product.pasp?txtCatalog=HumanaBooks&txtCategory=&txtProductID=1-58829-241-X&isVariant=0. Both volumes are now available. FOR MORE INFO... Visit my Web page: http://bio.fsu.edu/~stevet/cv.html. Contact me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or long distance collaboration.

BioInformatics at FSU – what it is, who’s doing it, and why it needs to be done now.