Introduction to bioinformatics

Introduction to bioinformatics Sylvia B. Nagl

What is bioinformatics? • an emerging interdisciplinary research area • deals with the computational management and analysis of biological information: genes, genomes, proteins, cells, ecological systems, medical information, robots, artificial intelligence...

The Core of Bioinformatics to date • Relationships between • sequence 3D structure protein functions • Properties and evolution of genes, genomes, proteins, metabolic pathways in cells • Use of this knowledge for prediction, modelling, and design TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAISTAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVLVTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAAGYALYGSATMLV

“The holy grail of bioinformatics” GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGATCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAGTTAACCTAA... > 500, 000 genes sequenced to date Expected number of unique protein structures: ~ 700-1, 000

Basic concepts • conceptual foundations of bioinformatics: evolution protein folding protein function • bioinformatics builds mathematical models of these processes - to infer relationships between components of complex biological systems

Information processing in cells nucleic acids proteins coding regions regulatory sites transcripts One-to-many mappings! Context-dependence!

Global approaches: Toward a new Systems Biology Global cell state Genome Protein population: proteomics Genome activation patterns: transcriptomics • How does the spatial and temporal organisation of living matter give rise to biological processes? Organisation: tissue imaging EM X-ray, NMR cells molecular complexes

Global approaches: Toward a new Systems Biology Perturbation Living cell Dynamic response Biological knowledge (computerised) • Basic principles • Practical applications “Virtual cell” Sequence information Structural information Bioinformatics Mathematical modelling Simulation

We do not know yet whether the information in the genome is sufficient to reconstruct an entire biological system. Information on building blocks not enough, information on their interactions is essential. External environment Internal environment Metabolic net Genetic networks DNA hRNA mRNAs proteins

Bioinformatics in context Mathematics/computer science Genomics Molecular biology Bioinformatics Biophysics Ethical, legal, and social implications Molecular evolution

Current challenges to users • Potential hurdles: Methods are in flux and not fully developed- scattered and heterogeneous resources • Remedies: Web resources navigation guides integration of tools and databanks http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html

Example 1 Sequence homology search of the genome of Plasmodium falciparumTarget identification for antimalerial drugs

The search for new antimalarial drugs • Malaria is one of the leading causes of morbidity and mortality in the tropics. • 300 to 500 million estimated clinical cases and 1.5 million to 2.7 million deaths per year. • Nearly all fatal cases are caused by Plasmodium falciparum. • The parasite's resistance to conventional antimalarial drugs such as chloroquine is growing at an alarming rate.

P. falciparum has a plastidlike organelle, called the apicoplast, acquired by endosymbiosis of an alga. • Self-replicating, maternally inherited (35kb, circular DNA). • Comparative genome analysis: Search for orthologs. • Apicoplast contains enzymes found in plant and bacterial, but not animal metabolic pathways. • Potential target for antimalerial drugs: • DOXP reductoisomerase Jomaa et al. (1999)

Jomaa et al. (1999) Science 285: 1573-1576:

Biological databases

The challenge (Boguski, 1999) In 1995, the number of genes in the database started to exceed the number of papers on molecular biology and genetics in the literature!

Data types primary data sequence DNA amino acid primary database AATGCGTATAGGC DMPVERILEALAVE secondary data secondary protein structure secondary db “motifs”:regular expressions, blocks, profiles, fingerprints e. g., alpha-helices, beta-strands tertiary data tertiary protein structure tertiary db atomic co-ordinates domains, folding units

Nucleic acid EMBL GenBank DDBJ (DNA Data Bank of Japan) Protein PIR MIPS SWISS-PROT TrEMBL NRL-3D Primary biological databases

International nucleotide data banks EMBL Europe GenBank USA International Advisory Meeting Collaborative Meeting NLM EMBL NCBI EBI DDBJ Japan TrEMBL NRDB NIG CIB

GenBank file format

Swiss-Prot

SWISS-PROT file format

Other primary protein databases • TrEMBL (translated EMBL) in SWISS-PROT format rapid access to sequence data from genome projects computer-annotated supplement to SWISS-PROT translations of all coding sequences (CDS) in EMBL • SP-TrEMBL • REM-TrEMBL: immunoglobulins, T-cell receptors, short fragments, synthetic and patented sequences

Other primary protein databases The Protein Information Resource (PIR) • integrated system of protein sequence databases and derived related databases, e. g., alignment databases • rapid searching, comparison, and pattern matching of protein sequences • retrieval of descriptive, bibliographic, feature, and concurrent cross-reference information • aims to be comprehensive and consistently annotated

PIR: related databases NRL-3D Sequence-Structure Database • produced by PIR from sequence and annotation information extracted from three-dimensional structures in the Protein Databank (PDB) • allows keyword and similarity searches

PIR: related databases PATCHX integrated with PIR • a non-redundant database of protein sequences produced by MIPS, the European branch of PIR-International The PIR Protein Sequence Database and PATCHX together provide the most complete collection of protein sequence data currently available in the public domain.

Composite protein sequence dbs NRDBOWLMIPSX(PIR+PATCHX)SP+TrEMBL PIR PIR PIRTrEMBL SP SP SP SP PDB GenBank MIPSOwn GenPept NRL-3D NRL-3D MIPSH PIRMOD MIPSTrn EMTrans GBTrans Kabat PseqIP

OWL composite database • By accession number • By database code • By text • By sequence • By title • By author • By query language • By regular expression • Direct OWL access: OWL only released every 6-8 weeks OWL Blast server

Two other useful sites INFOBIOGEN-The Public Catalog of Databases http://www.infobiogen.fr/services/dbcat/ KEGG-Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/ Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects.

Sequence Retrieval System (SRS) • Database browser that allows users to • retrieve • link • access • entries from all interconnected resources. • Users can formulate queries across a range of different database types.

Guide to Protein Databases: http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index.html http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index.html With thanks to Dr Roman Laskowski.

Introduction to bioinformatics

Introduction to bioinformatics

Presentation Transcript

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to BioInformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics