Introduction to Bioinformatics

Introduction to Bioinformatics

The Swiss Institute of Bioinformatics • Collaborative structure Lausanne - Geneva • Groups at ISREC, Ludwig Institute, CHUV, Unil, HUG, UniGe, and recently UniBas • Several roles: research, services, teaching • DEA (master degree) in Bioinformatics: 1 year full time. • EMBnet courses: 2x 1 week per year, to be extended • Pregrade courses in Geneva, Fribourg and Lausanne Universities

Projects at SIB • Databases • SWISS-PROT, PROSITE, EPD, World-2DPAGE, SWISS-MODEL • TrEST, TrGEN (predicted proteins), tromer (transcriptome) • Softwares • Melanie, Deep View, proteomic tools, ESTScan, pftools, Java applets • Services • Web servers ExPASy, EMBnet • Teaching and helpdesk • Research • Mostly sequence and expression analysis, 3D structure, and proteomic

EMBnet organisation • European in 1988, now world-wide spread • 29 country nodes, 9 special nodes. • Role • Training, education • Software development (EMBOSS, SRS) • Computing resources (databases, websites, services) • Helpdesk and technical support • Publications

Swiss node http://www.ch.embnet.org

Other important sites • ExPASy - Expert Protein Analysis System • www.expasy.org • EBI - European Bioinformatics Institute • www.ebi.ac.uk • NCBI - National Center for Biotechnology Information • www.ncbi.nlm.nih.gov • Sanger - The Sanger Institute • www.sanger.ac.uk

Bioinformatics: definition • Every application of computer science to biology • Sequence analysis, images analysis, sample management, population modelling, … • Analysis of data coming from large-scale biological projects • Genomes, transcriptomes, proteomes, metabolomes, etc…

The new biology • Traditional biology • Small team working on a specialized topic • Well defined experiment to answer precise questions • New « high-throughput » biology • Large international teams using cutting edge technology defining the project • Results are given raw to the scientific community without any underlying hypothesis

Example of « high-throughput » • Complete genome sequencing • Large-scale sampling of the transcriptome (EST) • Simultaneous expression analysis of thousands of genes (DNA microarrays, SAGE) • Large-scale sampling of the proteome • Protein-protein analysis large-scale 2-hybrid (yeast, worm) • Large-scale 3D structure production (yeast) • Metabolism modelling • Simulations • Biodiversity

Role of bioinformatics • Control and management of the data • Analysis of primary data e.g. • Base calling from chromatograms • Mass spectra analysis • DNA microarrays images analysis • Statistics • Database storage and access • Results analysis in a biological context

First information: a sequence ? • Nucleotide • RNA (or cDNA) • Genomic (intron-exon) • Complete or incomplete? • mRNA with 5’ and 3’ UTR regions • Entire chromosome • Protein • Pre/Pro or functional protein? • Function prediction • Post-translational modifications? • Holy Grail: 3D structure?

Genomes in numbers • Sizes: • virus: 103 to 105 nt • bacteria: 105 to 107 nt • yeast: 1.35 x 107 nt • mammals: 108 to 1010 nt • plants: 1010 to 1011 nt • Gene number: • virus: 3 to 100 • bacteria: ~ 1000 • yeast: ~ 7000 • mammals: ~ 30’000 • Plants: 30’000-50’000?

Sequencing projects • « small » genomes (<107): bacteria, virus • Many already sequenced (industry excluded) • More than 90 microbial genomes already in the public domain • More to come! (one new every two weeks…) • « large » genomes (107-1010) eucaryotes • 12 finished (S.cerevisiae, S. Pombe, E. cuniculi, C.elegans, D.melanogaster, A. gambiae, D. rerio, F. rubripes, A.thaliana, O. sativa, M. musculus, Homo sapiens) • Many more to come: rat, pig, cow, maize (and other plants), insects, fishes, many pathogenic parasites (Plasmodium…) • EST sequencing • Partial mRNA sequences • ~12x106 sequences in the public domain

centromer exons of a gene locus control region telomer regulatory elements repetitive sequences Human genome • Size: 3 x 109 nt for a haploid genome • Highly repetitive sequences 25%, moderately repetitive sequences 25-30% • Size of a gene: from 900 to >2’000’000 bases (introns included) • Proportion of the genome coding for proteins: 5-7% • Number of chromosomes: 22 autosomal, 1 sexual chromosome • Size of a chromosome: 5 x 107 to 5 x 108 bases

How to sequence the human genome? • Consortium « international » approach: • Generate genetic maps (meiotic recombination) and pseudogenetic maps (chromosome hybrids) for indicator sequences • Generate a physical map based on large clones (BAC or PAC) • Sequence enough large clones to cover the genome • « commercial » approach (Celera): • Generate random libraries of fixed length genomic clones (2kb and 10kb) • Sequence both ends of enough clones to obtain a 10x coverage • Use computer techniques to reconstitute the chromosomal sequences, check with the public project physical map

Sequencing progression

Interpretation of the human draft • Still many gaps and unordered small pieces (except for chr 6, 7, 20, 21, 22, Y) • Even a genomic sequence does not tell you where the genes are encoded. The genome is far from being « decoded » • One must combine genome and transcriptome to have a better idea

The transcriptome • The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome • The documentation of the localization (cell type) and conditions under which these RNAs are expressed • The documentation of the biological function(s) of each RNA species

Public draft transcriptome • Information about the expression specificity and the function of mRNAs • « full » cDNA sequences of know function • « full » cDNA sequences, but « anonymous » (e.g. KIAA or DKFZ collections) • EST sequences • cDNA libraries derived from many different tissues • Rapid random sequencing of the ends of all clones • ORESTES sequences • Growing set of expression data (microarrays, SAGE etc…) • Increasing evidences for multiple alternative splicing and polyadenylation

Example mapping of ESTs and mRNAs mRNAs ESTs Computer prediction

The proteome • Set of proteins present in a particular cell type under particular conditions • Set of proteins potentially expressed from the genome • Information about the specific expression and function of the proteins

Information on the proteome • Separation of a complex mixture of proteins • 2D PAGE (IEF + SDS PAGE) • Capillary chromatography • Individual characterisation of proteins • Tryptic peptides signature (MS) • Sequencing by chemistry or MS/MS • All post-translational modifications (PTMs) !

Tridimentional structures • Methods to determine structures • X-ray cristallography • NMR • Data format • Atoms coordinates (except H) in a cartesian space • Databases • For proteins and nucleic acids (RSCB, was PDB) • Independent databases for sugars and small organic molecules

Visualisation of the structures • Secondary structure elements • Alpha helices, beta sheets, other • Softwares • Various representations (atoms, bonds, secondary…) • Big choice of commercial and free software (e.g., DeepView)

Sequence information, and so what ? • How to store and organise ? • Databases (next lecture) • How to access, search, compare ? • Pairwise alignments, BLAST (tomorrow) • EST clustering, Multiple Alignments (Wednesday) • Patterns, PSI-BLAST, Profiles and HMMs (Thursday) • Gene prediction (Thursday) • Your problems? • Friday

Introduction to Bioinformatics