Bioinformatics

Bioinformatics Lecturer: Christos Makris Office: Π502 (ΠΡΟΚΑΤ) e-mail: makri@ceid.upatras.gr Teaching: Friday 12:00-14:00

Description • E_class and course description

Introduction Bioinformatics can be defined as: “the application of computational techniques and methods in an effort to understand and organize data and information that is related to biological macromolecules...”. Bioinformatics-Bioinformatics is conceptualizing biology in terms of molecules (in the sense of Physical Chemistry) and applying "informatics techniques" (derived from disciplines such as applied maths, computer science and statistics) to understandand organise the information associated with these molecules, on a large scale. In short bioinformatics is a management information system for molecular biology and has many practical applications.

Research Areas • Efficient storage and organization of biological data • Development of tools that permit the efficient analysis of biological data. • Development of tools for data analysis of experimental results.

Course contents General Introduction • Part I -- Introduction to algorithms for efficient storage and management of strings and biological data sequences . -- Algorithms for exact pattern matching (Boyer-Moore,Knuth-Morris-Pratt,Shift-Or, Multiple pattern matching). -- Introduction to suffix trees and its applications. -- Αlgorithms for approximate pattern matching and Sequence Alignment --Algorithms for searching in Sequence Data Bases (FASTA, BLAST, PROSITE)

Part II -- The theoretical base of molecular design. --Molecular models and biochemical information --Structure baseddrug design -- Open problems • Part III Clustering and categorization techniques for biological data, in order to predict the behavior of biological molecules.

Course Exam Course exams consist of: Oral examination on the topic of one project (a research assignments on 2-3 papers with obligations a written report of 20-25 pages on the content and a powerpoint presentation of 20 minutes) (50% of the total degree) and questions about the course (50% of the degree). Exam duration: 1h. Concerning the topic of the course mainly the understanding of concepts will be tested. The relative importance of the technical and mathematical derivations will be pointed out during the lectures

Basic Notion (1) Bioinformatics is management of Biology in terms of molecules (Physical Chemistry) and the application of “information techniques” (applied mathematics, computer science and statistics) for comprehending and organizing information that is related to molecules in large scale. Figure from “Bioinformatics, from Genomes to Drugs”. T. Lengauer

Cells Information and Machinery • Cells store all information to replicate itself • Human genome is around 3 billions base pair long • Almost every cell in human body contains same set of genes • But not all genes are used or expressed by those cells • Machinery: • Collect and manufacture components • Carry out replication • Kick-start its new offspring • (A cell is like a car factory)

Prokaryotes and Eukaryotes • According to the most recent evidence, there are three main branches to the tree of life. • Prokaryotes include Archaea (“ancient ones”) and bacteria. • Eukaryotes are kingdom Eukarya and includes plants, animalsalgae.

Some Terminology • Genome: an organism’s genetic material • Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. • Genotype: The genetic makeup of an organism • Phenotype: the physical expressed traits of an organism • Nucleic acid: Biological molecules(RNA and DNA) that allow organisms to reproduce;

More Terminology • The genome is an organism’s complete set of DNA. • a bacteria contains about 600,000 DNA base pairs • human and mouse genomes have some 3 billion. • human genome has 24 distinct chromosomes. • Each chromosome contains many genes. • Gene • basic physical and functional units of heredity. • specific sequences of DNA bases that encode instructions on how to make proteins. • Proteins • Make up the cellular structure • large, complex molecules made up of smaller subunits called amino acids.

All Life depends on 3 critical molecules • DNAs • Hold information on how cell works • RNAs • Act to transfer short pieces of information to different parts of cell • Provide templates to synthesize into protein • Proteins • Form enzymes that send signals to other cells and regulate gene activity • Form body’s major components (e.g. hair, skin, etc.)

DNA: The Code of Life • The structure and the four genomic letters code for all living organisms • Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G on complimentary strands.

DNA, continued • DNA has a double helix structure which composed of • sugar molecule • phosphate group • and a base (A,C,G,T)

DNA, RNA, and the Flow of Information Replication Transcription Translation

Overview of DNA to RNA to Protein • A gene is expressed in two steps • Transcription: RNA synthesis • Translation: Protein synthesis

DNA the Genetics Makeup • Genes are inherited and are expressed • genotype (genetic makeup) • phenotype (physical expression) • On the left, is the eye’s phenotypes of green and black eye genes.

Cell Information: Instruction book of Life • DNA, RNA, and Proteins are examples of strings written in either the four-letter nucleotide of DNA and RNA (A C G T/U) • or the twenty-letter amino acid of proteins. Each amino acid is coded by 3 nucleotides called codon. (Leu, Arg, Met, etc.)

Basic Notions (2) • DNA consists of 1 double 1 strand of bases, The bases are united in a concrete sequence and store the genetic information of every organization : Α (adenine ), Τ (thymine ), C (cytosine ), G (guanine ) • EveryDNA molecule can be considered as a string of alphabet{A,C,T,G} • Double helix, the knowledge of one assures the knowledge of the other (Α-Τ, C-G)

Genomes • The term genome, refers to the DNA sequence of a living organism, • Human genome consist of 46 chromosomes. • Every cell consist of the while genome of an organism (differentiation between prokaryotes and eukariotes)

Proteins • Proteins are molecules that consist of one or more polypeptides, • A polypeptide is a polymer consisting of amino acids • The cells produce their proteins from 20 different amino acids. • A protein sequence can be considered as a seuqence of an alphabet of 20 characters,Σ= {Ala, Arg, Asp, Asn, Cys, Glu, Gln, Gly, Hsi,Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, Val}.

Genes • Genes are the basic unit of inheritance, • A gene encodes a protein since it stores information for its construction, • Human genome consists of ~40.000 genes

Molecular Biology Dogma • DNA (RNA-polymerase) -> pre-RNA (Τ->U (uracil)) • Pre-RNA (Spliceosome) -> RNA • RNA (Ribosome) -> Protein (the first step for eukariotes) -- The process deals with the part of the DNA that codes a gene -- The term gene expression refers to the RNA production from a gene.

Central Dogma of Molecular Biology • DNA molecule(https://public.ornl.gov/site/gallery/default.cfm - U.S. Department of Energy Genome Programs ) • Central dogma of Molecular Biology (http://www.accesexcellence.org/AB/GG) • Replication • Transcription • Synthesis • Genetic code (http://www.accesexcellence.org/AB/GG) • In a nutshell • (http://www.doegenomes.org - U.S. Department of Energy Genome Programs ) • Little changed imply variations • (http://www.doegenomes.org - U.S. Department of Energy Genome Programs ) • Genomics, and proteomics

Central Dogma (DNARNAprotein) The paradigm that DNA directs its transcription to RNA, which is then translated into a protein. • Transcription (DNARNA) The process which transfers genetic information from the DNA to the RNA. • Translation (RNAprotein) The process of transforming RNA to protein as specified by the genetic code.

Central Dogma of Biology The information for making proteins is stored in DNA. There is a process (transcription and translation) by which DNA is converted to protein. By understanding this process and how it is regulated we can make predictions and models of cells. Assembly Protein Sequence Analysis Sequence analysis Gene Finding

RNA • RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) • Some forms of RNA can form secondary structures by “pairing up” with itself. This can have change its properties dramatically. DNA and RNA can pair with each other. tRNA linear and 3D view: http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif

RNA, continued • Several types exist, classified by function • mRNA – this is what is usually being referred to when a Bioinformatician says “RNA”. This is used to carry a gene’s message out of the nucleus. • tRNA – transfers genetic information from mRNA to an amino acid sequence • rRNA – ribosomal RNA. Part of the ribosome which is involved in translation.

Splicing (Eukaryotes) • Unprocessed RNA is composed of Introns and Extrons. Introns are removed before the rest is expressed and converted to protein. • Sometimes alternate splicings can create different valid proteins. • A typical Eukaryotic gene has 4-20 introns. Locating them by analytical means is not easy.

The Central Dogma (cont’d)

Translation • The process of going from RNA to polypeptide. • Three base pairs of RNA (called a codon) correspond to one amino acid based on a fixed table. • Always starts with Methionine and ends with a stop codon

It is Sequenced, What’s Next? • Tracing Phylogeny • Finding family relationships between species by tracking similarities between species. • Gene Annotation (cooperative genomics) • Comparison of similar species. • Determining Regulatory Networks • The variables that determine how the body reacts to certain stimuli. • Proteomics • From DNA sequence to a folded protein.

The future… • Bioinformatics is still in it’s infancy • Much is still to be learned about how proteins can manipulate a sequence of base pairs in such a peculiar way that results in a fully functional organism. • How can we then use this information to benefit humanity without abusing it?

Notions • Genomics (studying genes) • Proteomics (studying proteins)

Molecular Biology targets • Sequencing and comparison of genomes of different organisms (evolution, correlation). • Gene recognition and definition of the functionality that they define (recognition of contact areas and form there gene recognition). • Understanding gene expression (every gene is acticated during the production of the specific gene expression, studying the activation procedure) • Understanding the role of genes for several diseases (gene mutation). The data that we use can be from experiments or from biological data baes)

Solving Computational Problems using Bioinformatics Tools • Combining gene information (sequencing and linking) • Comparing sequences (information retrieval algorithms using schematic similaritities). • Protein categorizaion. • Information extraction from gene expressiong. • Representing cells as transcription networks.

Bioinformatics research areas • Implementing and designing computations tool for automatic knowledge extraction from Biological Data Bases • Analysing Biological Data Sequences • Biological data categorization • Molecular modeling • Protein analysis • Drug design with Computers

Computational Tools • Managing data with Molecular Biology is increasingly demanding and the model of relational data bases does not seem satisfying. The target is to design and implement a model that permits automated knowledge discovery from data of large scale. • Recognizing common structural characteristics not only at the sequence level but also in two dimensional (2D) or three dimensional (3D) level. • Similarity tracking between 2D or 3D shapes

Sequence Analysis • Exact matching • Approximate matching • Alignment (local or global) • Searching for maximal common subsequences • Data Bases of biological macromolecules (compression • Algorithms for searching, indexing and Cross-Referencing

Categorization/Clustering Biological Data • Categorization takes place using common motifs structural or functional • We are interested in applications that integrate different data types (data integration: sequences, 3D co-ordinates, functional knowledge)

Molecular Modeling • Selection of the proper model that described sufficiently the innermolecular correlations of the studied biological system. • Computing the energy state of the system and it minimixation, • Analysis of these calculations and control of the final formulation so that the conditions and the restrictions that have been settled by are fullfilled.

Protein Analysis • Setting the three dimensional structure of a protein and its amino acid sequence • Studying the Protein Docking Problem and the DNA-Protein Docking Problem

Computer Based Drug Design • Computer systems store useful information related to: 1) three dimensional architecture of molecules, 2) phycic and chemical properties, 3) comparing one molecule with other molecules, 4) micromolecules and macromolecules complexes, 5) predictions for new molecules. • The first target of scientists that are engaged to computer based drug design is the effective representations of the structure of normal and pathological molecules that are subsesequently compared to pathogenic enzymes and active receptors, and the target of molecular design is settled. • So, if we know the protein structure and the way that the receptor ar the active region acts, we can “built” and simulate their docking saving time and space

Computer Based Drug Design • Graphic Algorithms • Geometric calculations • Arithmetic methods • Graph-theoretic methods

In silico drug based design(in silico) • Design drug compound • Optimizing drug compound • In vitro andin vivo tests • Toxic tests • Human tests • Performance tests

Design Genome sequences Gene finding Protein sequences Structure prediction Protein structure Geometry calculation Protein surface Molecular modeling Field forces Molecular docking Molecular docking

Problems • the lack of a general and unified tool for designing molecular structures that can contain the whole set of biological molecules, • the increasing computational complexity that is expressed in time and needed resources and that increases with the size of the molecule • Selection of the proper representation model (aalogous to the biological molecult) and defining the crucial parameters (e.g.: geometrical coordinates) that should be checked especially at the level of computational geometry, • Robustness to errors in the input data and rebuilding of a three dimensional model from incomplete data, • Simulataneous representation of a set of physicchemical properties (entropy, energy) and the means that this information can be comperehensible and interpetable by a researcher

Phylogenetics means to register the genetic correlations between sepcies and the history of living organisms • Lyhlogenetic is the tree that depicts the evolution between different sepcied with common ancestor.ο. • Φύλλα : είδη/βιολογικές ακολουθίες • Εσωτερικοί Κόμβοι: υποθετικοί κοινοί πρόγονοι

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics