320 likes | 554 Views
Bioinformatics. Tigor Nauli (tigor@lipi.go.id / tigor@nauli.net) Research Center for Informatics - LIPI. Topics. Definition Biological database Sequence alignment Gene prediction Phylogenetic analysis Protein structure prediction Other studies Conclusion. Definition. Bioinformatics is
E N D
Bioinformatics Tigor Nauli(tigor@lipi.go.id / tigor@nauli.net)Research Center for Informatics - LIPI
Topics • Definition • Biological database • Sequence alignment • Gene prediction • Phylogenetic analysis • Protein structure prediction • Other studies • Conclusion
Definition • Bioinformatics is • the application of computational tools and techniques to the management and analysis of biological data. • information technology (IT) in molecular biology. • a subset of the larger field of computational biology. • The term of bioinformatics is being used in a number ways depending on who using it.
Definition • The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. • Bioinformatics is an in silico research.
Biological database • Database • an archive of information. • a logical organization of information. • tools to gain acess to it. • Biological data • cover nucleic acid and protein sequences, macromolecular structures, and function • being generated by the efficient large-sequencing machines. • being submitted by molecular biologists around the world. • .
Biological database • Archival database of biological information • nucleic acid and protein sequences • protein expression patterns • sequence motifs (‘signature patterns’) • mutations and variants in sequences • classification or relationships of protein sequence families or protein folding patterns • bibliographic
Biological database • Databank for nucleotide database • GenBank is maintained by National Center for Biotechnology Information (NCBI) • http://www.ncbi.nlm.nih.gov • EMBL (European Molecular Biology Laboratory) • http://www.ebi.ac.uk/embl/
Biological database • Databank for annotated protein sequence • SWISS-PROT is maintained by European Bioinformatics Institute (EBI) • http://us.expasy.org/sprot/ • Databank for sequence profiles, patterns, and motifs • PROSITE • http://us.expasy.org/prosite/ • Databank for protein structure • Protein Data Bank • http://www.rcsb.org/pdb/
Biological database • Database queries • given a sequence, or fragment of a sequence, find sequences in the database that are similar to it • given a protein structure, or fragment, find protein structures in the database that are similar to it • such searches are carried out thousands of times a day
Biological database • Database queries • given a sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures • given a protein structure, find sequences in the databank that correspond to similar structures • are active fields of research
Biological database • GenBank file • may contain identifying, descriptive, and genetic information in ASCII-format • for example:
LOCUS AF134350 1734 bp mRNA linear INV 03-JAN-2000 DEFINITION Drosophila melanogaster transcription factor Toy (toy) mRNA, complete cds. ACCESSION AF134350 VERSION AF134350.1 GI:4883931 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. REFERENCE 1 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE twin of eyeless, a second Pax-6 gene of Drosophila, acts upstream of eyeless in the control of eye development JOURNAL Mol. Cell 3 (3), 297-307 (1999) MEDLINE 99214845 PUBMED 10198632 REFERENCE 2 (bases 1 to 1734) AUTHORS Czerny,T., Halder,G., Kloter,U., Souabni,A., Gehring,W.J. and Busslinger,M. TITLE Direct Submission JOURNAL Submitted (11-MAR-1999) Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, Vienna A-1030, Austria
FEATURES Location/Qualifiers source 1..1734 /organism="Drosophila melanogaster" /mol_type="mRNA" /db_xref="taxon:7227" /chromosome="IV" /map="102E1" /dev_stage="embryo" gene 1..1734 /gene="toy" /note="twin of eyeless; second Pax-6" CDS 10..1641 /gene="toy" /codon_start=1 /product="transcription factor Toy" /protein_id="AAD31712.1" /db_xref="GI:4883932" /translation="MMLTTEHIMHGHPHSSVGQSTLFGCSTAGHSGINQLGGVYVNGR PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIKPRAIGGSKPR VATTPVVQKIADYKRECPSIFAWEIRDRLLSEQVCNSDNIPSVSSINRVLRNLASQKE QQAQQQNESVYEKLRMFNGQTGGWAWYPSNTTTAHLTLPPAASVVTSPANLSGQADRD DVQKRELQFSVEVSHTNSHDSTSDGNSEHNSSGDEDSQMRLRLKRKLQRNRTSFSNEQ IDSLEKEFERTHYPDVFARERLADKIGLPEARIQVWFSNRRAKWRREEKMRTQRRSAD TVDGSGRTSTANNPSGTTASSSVATSNNSTPGIVNSAINVAERTSSALISNSLPEASN GPTVLGGEANTTHTSSESPPLQPSAPRLPLNSGFNTMYSSIPQPIATMAENYNSSLGS MTPSCLQQRDAYPYMFHDPLSLGSPYVSAHHRNTACNPSAAHQQPPQHGVYTNSSPMP SSNTGVISAGVSVPVQISTQNVSDLTGSNYWPRLQ" misc_difference 1605 /gene="toy" /note="compared to genomic sequence; aspartic acid to glutamic acid change" /replace="a"
ORIGIN 1 taattaatta tgatgctaac aactgaacac ataatgcatg ggcatcccca ctcgtcagtc 61 gggcagagta ctctatttgg gtgctccacg gcgggccata gcggaataaa tcagctgggc 121 ggcgtatatg ttaatggccg gccactgccc gattcaacgc gtcaaaaaat tgtcgaattg 181 gctcattccg gcgcacgtcc ttgtgatatt tcaagaatac tacaagtgtc caacggttgc 241 gtaagcaaaa ttttgggcag atattatgaa actggatcga taaaacctcg agctataggt 301 ggttcaaagc cacgagtagc tacaaccccg gttgtgcaaa aaattgcaga ttacaaacgg 361 gaatgtccca gcatatttgc gtgggaaata cgagatcgac tgctatcgga acaagtttgc 421 aatagtgata acattccaag tgtttcatct attaatcgag tcttacgtaa cctggcctca 481 caaaaggagc agcaagctca gcaacaaaac gaatccgttt atgaaaagct tcgcatgttt 541 aatggccaaa cgggcggatg ggcatggtat ccaagcaata caacgacggc acatttgacg 601 ctaccaccag cagcttccgt tgtgacatct cctgcaaatt tatcaggaca ggccgatcgg 661 gatgatgttc aaaaaagaga attacaattt tcagtagaag tttcgcatac aaactctcac 721 gatagtacat cggatggaaa ctctgaacat aattcatccg gggacgaaga ctctcaaatg 781 cggttgcgcc taaaaaggaa gttacagcgc aatcggacat cattttctaa tgagcaaatt 841 gacagtcttg aaaaagaatt tgaaagaaca cattatcccg atgtttttgc gcgagaaagg 901 cttgctgata aaattggttt gccagaggca cgtattcagg tttggttttc aaaccgacga 961 gctaaatggc gccgagaaga aaaaatgcga actcagagac gatcggccga taccgtggac 1021 ggcagtggtc gaaccagcac ggcaaataat ccttcaggaa cgactgcatc ttcctccgtc 1081 gcaacgtcaa acaactcaac tccagggatt gtgaactcag caatcaacgt tgcggaacga 1141 acatcatctg cattaattag taatagcctt cccgaggctt caaatggacc aactgttttg 1201 ggtggtgaag ctaatactac acacaccagc tctgaaagcc caccccttca gccatcggca 1261 ccgcggctac ccttaaattc tggattcaac accatgtact catctattcc acaaccgatt 1321 gcaacgatgg ctgaaaatta caactcctca ttaggatcaa tgaccccgtc atgcttacaa 1381 caacgcgatg cctatcctta catgtttcac gatccgttat cactaggatc tccctatgtg 1441 tcagcccacc atcgaaacac agcttgcaac ccctcagctg cgcaccaaca gccccctcag 1501 catggcgttt ataccaatag ttctccaatg ccatcatcaa acacaggtgt catttctgcg 1561 ggcgtttcgg tgcctgtcca gatttcaacg caaaatgtat ctgacctaac gggaagcaat 1621 tactggccac gtcttcagtg atcgtcaatc tttggctcac cattagatca tttgtcaaag 1681 gcgactgccg ctgcaatcat tgccgcacaa gcagctgaga aaagccataa acac //
Sequence alignment • The basic sequence analysis task is to ask if two sequences are related • sequence similarity/homology • When we compare sequences, we are considered that they have diverged by a process of mutation. • The mutational process are substitutions, which change residues in a sequence, and insertions and deletions, which add or remove residues. • The three ways an alignment can be extended: match, mismatch, and gap.
Sequence alignment • We use dynamic programming matrix to find the optimal alignment. • To align CATGT with ACGCTG, first we fill the matrix with scores: +2 for match -1 for mismatch -1 for gap
Sequence alignment • The maximum score will be: • The best alignment is: • The example output using BLAST program:
Gene prediction • Predicting gene locations • identify all the open reading frame (ORF) in unannotated DNA • a query sequence will be compared to an entire annotated DNA database to find similar sequences • based on Bayesian statistics to find the most probable subsequence appears following the known subsequence P(CCGAT)=P(CC)*P(G|CC)*P(A|CG)*P(T|GA)
Gene prediction • implementing the Hidden Markov model
Phylogenetic analysis • Phylogenetic analysis • the process of developing hypotheses about the evolutionary relatedness of organisms based on their observable characteristics • Phylogenetic tree • build from multiple sequence alignment
Phylogenetic analysis • implemets Parsimony method, UPGMA, Cladistic, Neighbor Joining, Least Squares Method, Maximum Likelihood, or Clustering, to determine the differences in the sequences • find the relatedness by clustring
Phylogenetic analysis • the percentage of identity: • phylogenetic tree:
Protein structure prediction • Two approaches in computational modeling of protein structure • knowledge-based modeling • employ parameters extracted from the database of existing structures to evaluate and optimize structures • predict structure from sequence
Protein structure prediction • Predict from sequence: • select the protein sequence of the target • determine the secondary structure by calculating its hydrophobicity values
Protein structure prediction • align the structure with the similar sequence in databank • find the list of angles • draw the structure
Others studies • Protein structure property analysis • Biochemical simulation • Whole genome analysis • Primer design • DNA microarray analysis • Proteomics analysis
Conclusion • Bioinformatics can provide anything from the abstraction of the properties of a biological system into a mathematical or physical model, to the the implementation of new algorithms for data analysis.