610 likes | 824 Views
How Bioinformatics can change your life Basic Concepts of Bioinformatics. M. Alroy Mascrenghe MBCS, MIEEE, MIT mark_ai@yahoo.com A lecture given for the BCS Wolerhampton Branch at the University of Wolverhampton http://www.geocities.com/mark_ai/. TOC. Introduction
E N D
How Bioinformatics can change your lifeBasic Concepts of Bioinformatics M. Alroy Mascrenghe MBCS, MIEEE, MIT mark_ai@yahoo.com A lecture given for the BCS Wolerhampton Branch at the University of Wolverhampton http://www.geocities.com/mark_ai/
TOC • Introduction • Basic concepts in Molecular biology • Bioinformatics techniques • Areas in bioinformatics • Applications • Related Computer Technology • Conference in Glasgow • Acknowledgements • Reference M.Alroy Mascrenghe
Introduction…… M.Alroy Mascrenghe
2000 • A Major event happened that was to change the course of human history • It was a joint British and American effort • nothing to do with IRAQ! • It was a race – who will complete first • Race Test – not whether they have taken drugs but whether they can produce them! • Human genome was sequenced M.Alroy Mascrenghe
A Situ…somewhere in the near future • A virus –not ‘I love you’ virus- creates an epidemic • Geneticists and bioinformaticians role on their sleeves • Genetic material of the virus is compared with the existing base of known genetic material of other viruses • As the characteristics of the other viruses are known • From genetic material computer programs will derive the proteins necessary for the survival of the virus • When the protein (sequence and structure) is known then medicines can be designed M.Alroy Mascrenghe
What is • The marriage between computer science and molecular biology • The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists • ‘Information technology applied to the management and analysis of biological data’ • Storage and Analysis are two of the important functions – bioinformaticians build tools for each M.Alroy Mascrenghe
Biology Chemistry Computer Science Statistics Bioinformatics M.Alroy Mascrenghe
What is.. • This is the age of the Information Technology • However storing info is nothing new • Information to the volume of Britannica Encyclopedia is stored in each of our cells • ‘Bioinformatics tries to determine what info is biologically important’ M.Alroy Mascrenghe
Basics of Molecular Biology…. M.Alroy Mascrenghe
DNA & Genes • DNA is where the genetic information is stored • Blonde hair and blue eyes are inherited by this • Gene - The basic unit of heredity • There are genes for characteristics i.e. a gene for blond hair etc • Genes contain the information as a sequence of nucleotides • Genes are abstract concepts – like longitude and latitudes in the sense that you cannot see them separately • Genes are made up of nucleotides M.Alroy Mascrenghe
Nucleotide (nt) • Each nt I made up of • Sugar • Phospate group • Base • The base it (nt) contains makes the only difference between one nt and the other • There are 4 different bases • G(uanine),A(denine),T(hymine),C(ytosine) • The information is in the order of nucleotide and the order is the info • Genes can be many thousands of nt long • The complete set of genetic instructions is called genomes M.Alroy Mascrenghe
Chromosomes • DNA strings make chromosomes • Analogy • Letters - nt • Sentences – genes • Individual volumes of Britannica encyclopedia – chromosomes • All voles together - Genome M.Alroy Mascrenghe
Double Helix • The DNA is a double helix • Each strand has complementary information • Each particular base in one strand is bonded with another particular base in the next strand • G - C • A - T • For example - • AATGC one strand • TTACG other strand M.Alroy Mascrenghe
Proteins • Proteins are very important biological feature • Amino Acids make up the proteins • 20 different amino acids are there • The function of a protein is dependant on the order of the amino acids M.Alroy Mascrenghe
Proteins… • The information required to make aa is stored in DNA • DNA sequence determines amino acid sequence • Amino Acid sequence determines protein structure • Protein structure determines protein function • A Substance called RNA is used to carry the Info stored in the DNA that in turn is used to make proteins • Storage - DNA • Information Transfer – RNA • RNA is the message boy! M.Alroy Mascrenghe
Central dogma DNA transcription RNA Translation Protein RNA Polymerase Ribosomes M.Alroy Mascrenghe
Proteins….. • Since there are 20 amino acids to translate one nt cannot correspond to one aa, neither can it correspond as twos • So in triplet codes – codon – protein information is carried • The codons that do not correspond to a protein are stop codons – UAA, UAG, UGA (RNA has U instead of T) • Some codons are used as start codons - AUG as well as to code methionine M.Alroy Mascrenghe
Protein Structure • Shows a wide variety as opposed to the DNA whose structure is uniform • X-ray crystallography or Nuclear Magnetic Resonance (NMR) is used to figure out the structure • Structure is related to the function or rather structure determines the function • Although proteins are created as a linear structure of aa chain they fold into 3 d structure. • If you stretch them and leave them they will go back to this structure – this is the native structure of a protein • Only in the native structure the proteins functions well • Even after the translation is over protein goes through some changes to its structure M.Alroy Mascrenghe
Gene Expression • Gene Expression – the process of Transcripting a DNA and translating a RNA to make protein • Where do the genes begin in a chromosome? • How does the RNA identify the beginning of a gene to make a protein • A single nt cannot be taken to point out the beginning of a gene as they occur frequently • But a particular combination of a nucleotide can be • Promoter sequences – the order of nt which mark the beginning of a gene M.Alroy Mascrenghe
Bioinformatics Techniques….. M.Alroy Mascrenghe
Prediction and Pattern Recognition • The two main areas of bioinformatics are • Pattern recognition • ‘A particular sequence or structure has been seen before’ and that a particular characteristic can be associated with it • Prediction • From a sequence (what we know) we can predict the structure and function (what we don’t know) M.Alroy Mascrenghe
Dot plots…. • Simple way of evaluating similarity between two sequences • In a graph one sequence is on one side the next on the other side • Where there are matches between the two sequences the graph is marked M.Alroy Mascrenghe
Alignments • A match for similarity between the characters of two or more sequences • Eg. • TTACTATA • TAGATA • There are so many ways to align the above two sequences • 1. • TTACTATA • TAGATA • 2. • TTACTATA • TAGATA • 3. • TTACTATA • TAGATA • So which one do we choose and on what basis? • Solution is to Provide a match score and mismatch score M.Alroy Mascrenghe
Gaps • Introduce gaps and a penalty score for gaps • TTACTATA • T_A_GATA • In gap scores a single indel which is two characters long is preferred to two indels which are each one character long • However not all gaps are bad • TTGCAATCT • CAA • How do we align? • ---CAA--- • These gaps are not biologically significant • Semi Global Alignments M.Alroy Mascrenghe
Scoring Matrix • For DNA/protein sequence alignment we create a matrix • If A and A score is 1 • If A and T score is -5 • If A and C score is -1 M.Alroy Mascrenghe
Dynamic Programming • As the length of the query sequences increase and the difference of length between the two sequence also increases –more gaps has to be inserted in various places • We cannot perform an exhaustive search • Combinatorial explosion occurs – too much combinations to search for • Dynamic programming is a way of using heuristics to search in the most promising path M.Alroy Mascrenghe
Databases • Sequence info is stored in databases • So that they can be manipulated easily • The db (next slide) are located at diff places • They exchange info on a daily basis so that they are up-to-date and are in sync • Primary db – sequence data M.Alroy Mascrenghe
Composite DB • As there are many db which one to search? Some are good in some aspects and weak in others? • Composite db is the answer – which has several db for its base data • Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db M.Alroy Mascrenghe
Composite DB • OWL has these as their primary db • SWISS PROT (top priority) • PIR • GenBank • NRL-3D M.Alroy Mascrenghe
Secondary db • Store secondary structure info or results of searches of the primary db M.Alroy Mascrenghe
Database Searches • We have sequenced and identified genes. So we know what they do • The sequences are stored in databases • So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases. • Since there are large number of databases we cannot do sequence alignment for each and every sequence • So heuristics must be used again. M.Alroy Mascrenghe
Areas in Bioinformatics… M.Alroy Mascrenghe
Genomics • Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic • i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates M.Alroy Mascrenghe
Genomics - Finding Genes • Gene in sequence data – needle in a haystack • However as the needle is different from the haystack genes are not diff from the rest of the sequence data • Is whole array of nt we try to find and border mark a set o nt as a gene • This is one of the challenges of bioinformatics • Neural networks and dynamic programming are being employed M.Alroy Mascrenghe
Proteomics • Proteome is the sum total of an organisms proteins • More difficult than genomics • 4 20 • Simple chemical makeup complex • Can duplicate can’t • We are entering into the ‘post genome era’ • Meaning much has been done with the Genes – not that it’s a over M.Alroy Mascrenghe
Proteomics….. • The relationship between the RNA and the protein it codes are usually very different • After translation proteins do change • So aa sequence do not tell anything about the post translation changes • Proteins are not active until they are combined into a larger complex or moved to a relevant location inside or outside the cell • So aa only hint in these things • Also proteins must be handled more carefully in labs as they tend to change when in touch with an inappropriate material M.Alroy Mascrenghe
Protein Structure Prediction • Is one of the biggest challenges of bioinformatics and esp. biochemistry • No algorithm is there now to consistently predict the structure of proteins M.Alroy Mascrenghe
Structure Prediction methods • Comparative Modeling • Target proteins structure is compared with related proteins • Proteins with similar sequences are searched for structures M.Alroy Mascrenghe
Phylogenetics • The taxonomical system reflects evolutionary relationships • Phylogenetics trees are things which reflect the evolutionary relationship thru a picture/graph • Rooted trees where there is only one ancestor • Un rooted trees just showing the relationship • Phylogenetic tree reconstruction algorithms are also an area of research M.Alroy Mascrenghe
Applications…. M.Alroy Mascrenghe
Medical Implications • Pharmacogenomics • Not all drugs work on all patients, some good drugs cause death in some patients • So by doing a gene analysis before the treatment the offensive drugs can be avoided • Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! • Customized treatment • Gene Therapy • Replace or supply the defective or missing gene • E.g: Insulin and Factor VIII or Haemophilia • BioWeapons (??) M.Alroy Mascrenghe
Diagnosis of Disease • Diagnosis of disease • Identification of genes which cause the disease will help detect disease at early stage e.g. Huntington disease - • Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment • Death in 10-15 years • The gene responsible for the disease has been identified • Contains excessively repeated sections of CAG • So once analyzed the couple can be counseled M.Alroy Mascrenghe
Drug Design • Can go up to 15yrs and $700million • One of the goals of bioinformatics is to reduce the time and cost involved with it. • The process • Discovery • Computational methods can improves this • Testing M.Alroy Mascrenghe
Discovery Target identification • Identifying the molecule on which the germs relies for its survival • Then we develop another molecule i.e. drug which will bind to the target • So the germ will not be able to interact with the target. • Proteins are the most common targets M.Alroy Mascrenghe
Discovery… • For example HIV produces HIV protease which is a protein and which in turn eat other proteins • This HIV protease has an active site where it binds to other molecules • So HIV drug will go and bind with that active site • Easily said than done! M.Alroy Mascrenghe