730 likes | 1.22k Views
Sequence Alignment Algorithms – Application to Bioinformatics Tool Development. Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: partha@cnld.bdu.ac.in ). Plan. Introduction to Bioinformatics
E N D
Sequence Alignment Algorithms – Application to Bioinformatics Tool Development Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: partha@cnld.bdu.ac.in) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Plan • Introduction to Bioinformatics • Sequence alignment algorithms • Global alignment : Needleman - Wunsch algorithm • Local alignment : Smith – Waterman algorithm • – Predict Fold to a protein sequence • Methodology • Algorithm, Coding & Tool Development • Benchmarking • Conclusions PredictFold Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Introduction • Why do we need Bioinformatics? • What is Bioinformatics? • Where is Bioinformatics used? Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Why? • Biological Data Explosion • How did Biological Data Explosion happen? • Sequence Databases are HUGE than the Structure Databases • Why so? Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Introduction Biological Data : Genome Projects Latest Revolution • On 26 June, 2000 - Announcement of completion of the draft of the ‘Human Genome’ ‘Genetic Code of Human Life is Cracked by Scientists’ • Human Genome contains 3.2 x 109 bps • Unit of (Genome) sequence length • bps (base pairs) • Mbps (Mega base pairs) = 106 bps • Gbps (Giga base pairs) = 109 bps • huge (human genome equivalent) = 3.2 Gbps • Unit of Genetic distance • centiMorgan (cM) - arbitrary unit ; Named for Thomas Hunt Morgan (e.g. 1 cM = 0.01 recombinant frequency) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Introduction Biological Data : Genome Projects 16 February 2001 15 February 2001 Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Biological Data : Recombinant DNA Technology Old Revolution • 1940 – Role of DNA as the genetic material was confirmed • 1953 – Discovery of DNA structure by James Watson & Francis Crick • 1966 – Establishment of the Genetic Code • 1967 – DNA ligase was isolated – (join two strands of DNA together) – Molecular Glue • 1970 – Isolation of Restriction enzyme – Molecular Scissors • 1972 – Recombinant DNA molecules were generated at Stanford University, USA • 1973 – Joining DNA fragments to the plasmid pSC101 isolated from E.Coli. They could replicate when introduced into E.Coli. The discoveries of 1972 & 1973 triggered off the biggest scientific revolution – Genetic Engineering Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Biological Data explosion • GenBank, NCBI, USA • 44 Gbps of DNA & 40 Million Sequences (upto 2004) • GenBank, National Center for Biotechnology Information, USA • Protein Data Bank (PDB), RCSB, USA • 29,000 structures (2004) • PDB, Research Collaboratory for Structural Bioinformatics, USA • QUALITY of Data - HIGH • Experimental error in modern genomic sequencing is extremely low • QUANTITY of Data - HUGE • With Recombinant DNA technology & genomic sequencing, size of sequence data bases is increasingvery rapidly • SEQUENCE Versus STRUCTURE Databases • Sequence Databases are HUGE than Structure Databases Leads to Bioinformatics Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
What? • What is Bioinformatics? • Define Bioinformatics Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Bioinformatics - Definition F(i,j) = max { F(i-1, j-1)+s(xi,yj), F(i-1, j) – d, F(i, j-1) – d.} Bioinformat ics atcggcatgcatcagtcatgcaactg PEPTIDESE QSEDITPEP Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions. The marriage of biology and computer science has created a new field called ‘Bioinformatics’. - Arthur M. Lesk Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Biology Basic Definitions • Cell - It is the building block of living organisms • Eukaryotic Cells or organisms have the nucleus separated from the cytoplasm by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein • Chromosome • The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of eukaryotes • Contains DNA and protein arranged in compact manner • Replicate identically during cell division • Same number of chromosomes present in cells of a particular species (e.g. Human : 22, X and Y) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
GenomeBasic Definitions • Genome • A complete set of chromosomes inherited from one parent • Gene • One of the units of inherited material carried on by chromosomes. They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus. • DNA (Deoxyribo Nucleic Acid) • DNA is made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine • Protein • Protein is made up of TWENTY different amino acids A T G C ... – Alanine, Threonine, Glycine, Cysteine, … Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
DNA transcription mRNA translation Protein Central Dogma CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Genome DataHuman & Model Organisms • Most mapping and sequencing technologies were developed from studies of simpler non-human organisms • Non-Human/Model organisms • Bacterium Escherichia Coli - 4.6 Mbp • Yeast Saccharomyces Cerevisiae - 12.1 Mbp • Fruit Fly Drosophila melanogaster - 180.0 Mbp • Roundworm C. elegans - 95.5 Mbp • Laboratory Mouse Mus musculus - 3.0 Gbp • Human – more complex genome • Human Homo sapiens - 3.2 Gbp Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Genome DataHuman (Homo Sapiens) Genome 1 Chromosomes 23 Genes / DNAs ~ 30,000 Nucleotides 3.2 x 109 bps Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Bioinformatics in Genome Research • Data Collection and Interpretation • Collecting and Storing Data • Sequence generated by genome research will be used as primary information source for human biology and medicine • The vast amount of data produced will first need to be collected, stored and distributed • Interpretation of Data • Recognizing where genes begin and end • Searching a database for a particular DNA sequence may uncover these homologous sequences in a known gene from a model organism, revealing insights into the function of the corresponding human gene Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Understanding Gene Function • Correct protein functiondepends on the 3-D or folded structure the protein assumes in biological environments • Understandingprotein structure will be essential in determining gene function Gene Protein Function Structure Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Where? • Where is Bioinformatics used? • What are the uses of Bioinformatics? • Applications of Bioinformatics Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Bioinformatics Tasks • Sequence Analysis (Protein sequences) • Similarity & Homology • pairwise local/global alignment • GCG – Seqlab & Seqweb • Scoring Matrices - PAM, BLOSUM • Database Search • BLAST, FASTA • Multiple alignment • ClustalW, PRINTS, BLOCKS • Secondary Structure Prediction (from Sequence) • Proteins – -Helix, β-Sheet, Turn or coil • Protein Folding Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Bioinformatics Tasks • Structure analysis – Experimental Determination • X-ray crystallography – 3 dimensional coordinates – Structure • Nuclear Magnetic Resonance (NMR) • PDB – Protein Data Bank • RasMol – Molecular Viewing Software • High-throughput crystallographic structure determination • High flux synchrotron radiation sources (data collection) • Multiple anomalous diffraction method (data interpretation) • Bioinformatics - Structure Prediction • Homology Modelling – InsightII, SwissPDBViewer, Biosuite • ‘ab initio’ method - Monte Carlo Simulation • Protein Structure Classification • SCOP - Structural Classification Of Proteins • CATH - Class, Architecture, Topology, Homologous superfamily • FSSP - Fold Classification based on Structure- Structure alignment of Proteins – obtained by DALI (Distance-matrix ALIgnment) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Bioinformatics Tasks • Protein Engineering • Mutations • Alter particular amino acid/base for desired effect • Site directed mutagenesis • Identify the potential sites where we can do alterations • Applications • Agricultural – Genetically Modified Plants, Vegetables, GM Food • Pharmaceutical – Molecular Modelling base Drug Design • Medical – Gene Therapy • DNA Bending • Application to Genomes (Ref: M.G.Munteanu, K.Vlahovicek, S.Parthasarathy, I.Simon and S.Pongor, Rod Models of DNA: Sequence-dependent anisotropic elastic modelling of local phenomena, Trends in Biochemical Sciences, 23 (1998) 341-347) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Bioinformatics TasksGenomics & Proteomics • Genomicsis the study of the structure, content, evolution and functions of genes in genomes • Aims of Genomics • To establish an integrated web based database and research interface • To assemble Physical,Genetic and Cytological maps of the Genome • To identify and annotate the complete set of genes encoded within a genome • To provide the resources for comparison with other genomes Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Proteomics – Proteome • Proteome is the complete collection of proteins in a cell/tissue/organism at a particular time. Unlike genomes, which are stable over the life time of the organism, proteomes change rapidly as each cell response to its changing environment and produces new proteins and at different amounts. • Genome is a more stable entity. An organism has only one genome but many proteomes. • For an organism, there may be • one body wide proteome, • about 200 tissue proteomes • about a trillion (~1012) individual cell proteomes. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Proteomics – Definition • The study of proteomes that includes determining the 3D shapes of proteins, their roles inside cells, the molecules with which they interact, and defining which proteins are present and how much of each is present at a given time. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Proteomics – Applications • To correlate proteins on the basis of their expression profiles. • To observe patterns in protein synthesis and this observed pattern changes can be used as an indicator of the state of cell and its gene expression. • To characterize bacterial pathogens and to develop novel antimicrobials. • To identify regions of the bacterial genome that encode pathogenic determinants. • To develop drugs and in toxicology – Structural Proteomics • Proteomics as a tool for plant genetics and breeding Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Systems Biology • Systems Biology is a new perspective and emerging field for research in the post-genomic era. • It aims at system level understanding of biological systems. • It studies whole cells/tissues/organisms not by a traditional reductionist’s approach but by holistic means in a reiterative attempt to model the complete cell/tissue/organism. • It is an integrated and interacting network of genes, proteins and biochemical reactions which give rise to life. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Systems Biology Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Sequence Alignment Algorithms • Similarity and Homology • Sequence Comparison - Issues • Types of alignments • Algorithms Used Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Sequence similarity and homology • Nature is atinkerer andnot an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences. – Fortunate for computational sequence analysis • Similarity – Measurement of resemblance and differences, independent of the source of resemblance. Homology – The sequences and the organisms in which they occur are descended from a common ancestor. • If two related sequences are homologous, then we can transfer information about structure and/or function, by homology. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3-D Structure and Homology • 3-D structure patterns (motifs) of proteins are much more evolutionarily conservedthan amino acid sequences - This type of Homology search could prove more fruitful • Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analysis • Only a few protein motifs can be recognised at the sequence level • Development of more analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more useful Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Sequence ComparisonIssues • Types of alignment • Global – end to end matching (Needleman-Wunsch) • Local – portions or subsequences matching (Smith-Waterman) • Scoring system used to rank alignments • PAM & BLOSUM matrices • Algorithms used to find optimal (or good) scoring alignments • Heuristic • Dynamic Programming • Hidden Markov Model (HMM) • Statistical methods used to evaluate the significance of an alignment score • Z- score, P- value and E- value Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
PAM BLOSUM Substitution Matrices • PAM (Point Accepted Mutation) • BLOSUM (BLOcks SUbstitution Matrix) 40 Close 90 Default 250 62 Distant 500 30 Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Types of Algorithms • Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs. • Dynamic Programming The algorithm for finding optimal alignments given an additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. • HMM - Based on Probability Theory – very versatile. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Global AlignmentNeedleman-Wunsch Algorithm • Formula { F(i-1,j-1) + s(xi,yj) D F(i, j) = max { F(i-1 , j) - d H { F(i , j-1) - d V Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Global AlignmentNeedleman-Wunsch Algorithm • Gap penalties • Linear score f(g) = - gd • Affine score f(g) = - d – (g-1) e • d = gap open penalty e = gap extend penalty • g = gap length • Trace back • Take the value in the bottom right corner and trace back till the end. (i.e. align end – end always). • Algorithm complexity • It takes O(nm) time and O(nm) memory, where n and m are the lengths of the sequences. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Local AlignmentSmith-Waterman Algorithm Same as Global alignment algorithm with TWO differences. • F(i,j) to take 0 (zero), if all other options have value less than 0. • Alignment can end anywhere in the matrix. Take the highest value of F(i,j) over the whole matrix and start trace back from there. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Local AlignmentSmith-Waterman Algorithm • Formula { F(i-1,j-1) + S(xi,yj) D F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is < 0) } Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Web based server development • Design the web page to get the data • Use cgi-bin or Perl script to parse the submitted data • Invoke the corresponding program to get the appropriate results • Send the results either by e-mail or to the web page directly Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Application to Bioinformatics Tool Development To predict a fold to protein sequence PredictFold Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
To predict a fold to protein sequence PredictFold • To predict possible folds for a given protein sequence, whose structure is not known • To develop a fold recognition technique / tool that is sensitive in detecting folds of given protein sequences in the twilight zone (sequences sharing less than 25% identity) • Application of the fold recognition strategy to genomic annotation Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
‘Twilight Zone’ sequencesExampleCytochrome Sequences • 256b >256B:A CYTOCHROME $B562 (OXIDIZED) - CHAIN A ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR >256B:B CYTOCHROME $B562 (OXIDIZED) - CHAIN B ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR • 2ccy >2CCY:A CYTOCHROME $C(PRIME) - CHAIN A QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD >2CCY:B CYTOCHROME $C(PRIME) - CHAIN B QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
ExampleSequences similarity lalign output for 256b & 2ccy follows … Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
ExampleCytochrome Structures 256b CYTOCHROME STRUCTURES (seq. similarity 24%) 2ccy Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Goals • Exploration of suitable fold recognition techniques that are sensitive in detecting similar folds despite low sequence similarity • Identification of functional motifs in proteins at sequence (1D) and structure (3D) level • Development of a protocol that aid in the rapid classification and annotation of genomic data based on functional motifs Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Methodology • Reduction of 3D-structure to 1D-environment string. Environment at each residue position is a function of local secondary structure and extent of exposure to the solvent (based on 3D-1D profile method developed by Eisenberg et al., 1991). • Extract residue environment profiles of the available protein structures. • A scoring matrix is generated from a library of profiles. Each matrix element is the information value of a residue in the given environment. • A library ofenvironment strings is created for the available protein fold structures. • The probe sequence is queried against this library to look for best matches. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Workflow Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Residue Environments _Helix Partially buried _Exposed _Coil Strand_ Buried_ Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Residue Environments • The residue environments are described by • the area (A) of the residue buried in the protein • the fraction (f) of side-chain area that is covered by polar atoms (O and N) • the local secondary structure Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Residue Environments CLASS Area (A) Å2 FRACTION (f) BURIED 1 (B1) A > 114 f < 0.45 BURIED 2 (B2) 0.45 < f < 0.58 BURIED 3 (B3) f > 0.58 PARTIAL 1 (P1) 40 < A < 114f < 0.67 PARTIAL 2 (P2) f > 0.67 EXPOSED (E0) A < 40 f > 0.67 Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
Residue Environment classes • We have 6 classes based on the extend of exposure to solvent • We have 3 classes based on secondary structure – Alpha Helix(A), Beta Sheet (B) & Coil(C) • Total : 6 x 3 = 18 environments • B1A,B1B,B1C, B2A,B2B,B2C, B3A,B3B,B3C P1A,P1B,P1C, P2A,P2B,P2C, E0A,E0B,E0C. • For example B1A - Buried 1Alpha Helix P2B - Partially Buried 2Beta Sheet E0C - Exposed 0Coil Dr.S.Parthasarathy, Bharathidasan Univ., Trichy