210 likes | 336 Views
Introduction to bioinformatics Based on Kim’s article “Computers are from mars, organisms are from venus”, Computer, p25-32, July, 2002. CS480-01 Computer Science Seminar Fall, 2002. Merging computer science & biology: motivation.
E N D
Introduction to bioinformaticsBased on Kim’s article “Computers are from mars, organisms are from venus”, Computer, p25-32, July, 2002. CS480-01 Computer Science Seminar Fall, 2002
Merging computer science & biology: motivation • Erwin Schrödinger envisioned life as an aperiodic crystal suggesting structure of life is neither periodic nor amorphous. As a result, classical or traditional mathematical tools has not been satisfactory in biological analysis. • Elegant algorithms + brute-force calculations suggests a reasonable approach to this aperiodic structure.
Merging computer science & biology: motivation continued • Computing seeks to create a machine that can flexibly solve diverse problems. In nature, such plastic problem solving resides uniquely in the domain of organic matters. Whether through historical evolution or individual behavior, organisms always adaptively solve problems their environment poses. • Thus, examining how organisms solve problems can lead to new computation and algorithm-development approaches, e.g., DNA computers.
Computer is needed to analyze huge amount of information • Until recently, major activities in biology had been information (data) gathering. The amount of information gathered, especially at the molecular level over the last five years has been overwhelming, e.g., info in GenBan (http://www.ncbi.nlm.nih.gov/) has nearly doubled every 18 months, mainly due to the improvement of biotechnology. Ten years ago, it took 5 days to obtain 200 base pairs of DNA sequence data. Today, the number increased to 28 million a month with the Human Genome Project. • Two most successful use of computers in biology are: • comparative sequence analysis • In silico cloning
Comparative sequence analysis • Researchers isolates bio-molecular sequence in the lab, they want to know if it is similar to any existing sequences. • By comparing extrapolated information to similar, already well-studied sequences, the scientists can learn a great deal about the newly isolated sequence. • BLAST (Basic Local Alignment Search Tool), a search database similar to GenBank is used by most researchers when they isolate a sequence.
In silico cloning • The process of using a computer search of existing databases to clone a gene. • For example, one may want to clone a gene by its phenotype such as olfactory; by its structure such as G protein-coupled receptor; or by a pattern fragment such as as DNA pattern “ACCAGTC”. • Computational algorithms has been designed to guide the otherwise extremely time-consuming and expensive wet-lab experiments with great success. For example, a 15-year-old problem of fruit-fly’s olfactory genes was isolated and identified with the help of a computer algorithm named QFC (quasiperiodic feature classifier), Bioinformatics, vol. 16, 2000, p767-775
Other challenging tasks • Collecting and integrating vast amount information from distributed databases of heterogeneous sources into a coherent information set. It has been a challenge due to many problems that need be addresses, e.g., same objects may be given many different names, or database specialists may define a gene differently. (Currently, most major databases such as GenBank and Swiss-PROT (http://www.ebi.ac.uk/swissprot/index.html/) operate partially by human curation and partially by automated tools.
Other challenging tasks continued • Annotating raw data collected from genome projects with all the relevant information (such as whether a stretch of DNA contains an amino acid coding sequence, transposons, or a regulator sequence, and if an amino acid is coded, what its putative function is, etc.) As of now, genome projects generate raw data without giving them biological meaning and the fact that biologists use existing information to extrapolate knowledge about novel bio-molecules. • One complete annotation addressed the 3.000,000 bases that surround the Drosophila melanogaster ADH sequence (http://www.flybase.org/). • Given the rate at which researchers now generate DNA sequence information, automatically annotating the raw data presents a computational challenge, and careful human analysis is becoming increasingly difficult. The tools necessary for addressing this problem include gene prediction, gene classification, comparative genomics, and evolutionary modeling.
Other challenging tasks continued • Synthesizing information (data) into general theories remain a challenging task. • For example, researchers working on the same object made hundreds of independent observations which appear in thousands of research articles that use scores of variations in terminology, methodology, and so forth. • To cope with information explosion, we may be need a computer system that performs automatic knowledge extraction and produces synthetic new information. Such a tool may be used in many other fields as well (http://www.cmu.edu/cald/research.html).
Other computer tasks frequently performed by biologists today • Bimolecular sequence alignment • Assembly of DAN pieces • Multivariate analysis of large-scale gene expressions • Metabolic pathway analysis
Computational biology’s holy grail • Predicting molecular structure • Compute the genotype-phenotype map
Predicting molecular structure • Given the molecules sequence identity, predict its 3-D structure and from the structure, infer the molecular function. • What’s the challenge? • The genetic code consists of 20 amino acids. • Proteins consist of approx. 1,000 different major structures called folds, each with tens of thousands of variations. • In proteins, the physical forces that govern the interaction of the hundreds t thousands of amino acid residues determine the structure and we do not know the details of these interactions. (even we do, it is an extremely difficult many-body problem.) • Still, some significant progress has been made, thanks in part to CASP (Critical Assessment of Structure Prediction http://www.ncbi.nlm.nih.gov/structure/Research/casp3/index.shtml)
From structure to function • A protein’s structure approximately determines its molecular functions such as catalysis, DNA binding, and cell component binding. • Some researchers believe that a relational map between structure and function should be deducible (third genetic code). The idea also drives “rational drug design”. The idea is beyond reach because the knowledge about protein function is unclear in theory and practice. Also, an object’s function often depends on context. For example, the function of a screw holds a chair together is quite different from a screw of a car jack.
From genotype to phenotype • A genotype refers to the genetically encoded information in an individual genome; it consists the sequence identity for a person’s entire DNA. • The phenotype refers to any measured trait of a particular individual such as hair color, body weight, propensity for a heart attack, and so on (drug companies have begun to ask how the efficacy of a given drug treatment interacts with the recipient’s genotype).
The DNA computer • Motivated by lack of efficient algorithm to solve the NP-complete (Non-deterministic polynomial time), Adleman suggested using DNA for computation. He used sequence-specific hybridization of DNA molecules and polymerase chain reaction to solve the problem of finding Hamiltonian path in a directed graph. • Additional research has shown that by using DNA’s ability to find complementary sequence paris, DNA can be used to encode a universal computer. • DNA computer remains elusive due to • Encoding the problem and reading the output is extremely time consuming. • Inherent computational errors. • The amount DNA required to solve a practical hard problem.
Genetic algorithm • John Holland laid the foundation in the ’60s. • The algorithm emulates the evolutionary adaptive behavior or real organisms. • Three components: • The organism must have a property or suite of properties that governs its differential survival. • The individuals should inherit these properties. • A mechanism should generate variations of these properties via mutation. • Evolutionary computing thus involves generating a population of computer programs and --- by tying their survival to their problem-solving ability --- selects those that are particularly good at solving a posed problem.
Terminology phe·no·type • The observable physical or biochemical characteristics of an organism, as determined by both genetic makeup and environmental influences. • b. The expression of a specific trait, such as stature or blood type, based on genetic and environmental influences.
Terminology continued trans·po·son A segment of DNA that is capable of moving to a new position within the same or another chromosome, plasmid, or cell and thereby transferring genetic properties such as resistance to antibiotics. dro·soph·i·la Any of various small fruit flies of the genus Drosophila, especially D. melanogaster, used extensively in genetic research.
Terminology continued • phy·log·e·ny • 1.The evolutionary development and history of a species or higher taxonomic grouping of organisms. Also called phylogenesis. • 2.The evolutionary development of an organ or other part of an organism: the phylogeny of the amphibian intestinal tract. • 3.The historical development of a tribe or racial group.
Genomics • Each cell of a living organism contains chromosomes composed of a sequence of DNA base pairs.The sequence, the genome, represents a set of instructions that controls the replication and function of each organism. • Genomics: The automated DNA sequencer give birth to genomics --- the analytic and comparative study of genomes, by allowing researchers to decode entire gnomes.
Other articles of interests • Genome sequence assembly: algorithms and issues. • Toward new software for computational phylogenetics. • BioSig: an imaging bioinformatics systems for studying phenomics. • A random walk down the genomes: DNA evolution in Valis (Vast active living intelligent systems). • Interactively exploring hierarchical clustering results.