Introduction to Bioinformatics

Introduction to Bioinformatics

SIB and EMBnet Bioinformatics resources for biomedical scientists

The Swiss Institute of Bioinformatics • Founded in March 1998 • Collaborative structure Lausanne - Geneva - Basel • Groups at ISREC, Ludwig Institute, Unil, HUG, UniGe, recently UniBas and soon EPFL. • Several roles: teaching, services, research • Currently: ~ 130 employees

Projects at SIB • Databases • SWISS-PROT, PROSITE, EPD, World-2DPAGE, SWISS-MODEL • TrEST, TrGEN (predicted proteins), tromer (transcriptome) • Softwares • Melanie, Deep View, proteomic tools, ESTScan, pftools, Java applets • Services • Web servers ExPASy, EMBnet • Teaching and helpdesk • Research • Mostly sequence and expression analysis, 3D structure, and proteomic

Teaching • DEA (master degree) in Bioinformatics: 1 year full time, first diploma common to Unige and Unil. • EMBnet courses: 2x 1 week per year in Lausanne, plus 1 week in Basel starting in 2003 (this course!). • Pregrade courses in Geneva, Fribourg and Lausanne Universities • Other courses at CHUV and EPFL • Courses in other countries: Colombia, Cambodia, Peru, …

Research • New algorithms (faster alignments…) • New technology (GRID or cluster computing) • New tools (protein analysis, microarrays, confocal microscopy) • New databases (microarrays, transcriptome, proteome) • Collaborations with lab researchers!

Three levels of services • Simple web access to softwares and databases • Easy to use for basic occasional research with few sequences • Potentially insecure • Command-line access with a local Unix account • More powerful (automation) and secure • Requires to understand Unix system and frequent practice • Collaboration with SIB • Access to experts in the field (help desk) • For projects requiring huge programming or special hardware resources • Help desk • helpdesk@mail.ch.embnet.org or http://www.expasy.org/contact.html

SIB’s important sites • Home • www.isb-sib.ch • ExPASy - Expert Protein Analysis System • www.expasy.org • Hits database and tools • hits.isb-sib.ch • EMBnet Switzerland • www.ch.embnet.org • Geneva Bioinformatics • www.genebio.ch

SIB home

Expert Protein Analysis System

Swiss node http://www.ch.embnet.org

EMBnet organisation • European in 1988, now world-wide spread • 32 country nodes, 8 special nodes. • Role • Training, education (EMBER) • Software development (EMBOSS, SRS) • Computing resources (databases, websites, services) • Helpdesk and technical support • Publications (EMBnet.news, Briefings in Bioinformatics) • Access: www.embnet.org • Each node with “www.xx.embnet.org” where xx is the country code (e.g., ch for Switzerland)

EMBnet home

European Molecular Biology Open Software Suite • Free Open Source (for most Unix plateforms) • GCG successor (compatible with GCG file format) • More than 150 programs (ver. 2.7.1) • Easy to install locally • but no interface, requires local databases • Unix command-line only • Interfaces • Jemboss, www2gcg, w2h, wemboss… (with account) • Pise, EMBOSS-GUI, SRSWWW (no account) • Staden, Kaptain, CoLiMate, Jemboss (local) • Access: www.emboss.org

Other important sites • ExPASy - Expert Protein Analysis System • www.expasy.org • EBI - European Bioinformatics Institute • www.ebi.ac.uk • NCBI - National Center for Biotechnology Information • www.ncbi.nlm.nih.gov • Sanger - The Sanger Institute • www.sanger.ac.uk

Bioinformatics: definition • Every application of computer science to biology • Sequence analysis, images analysis, sample management, population modelling, … • Analysis of data coming from large-scale biological projects • Genomes, transcriptomes, proteomes, metabolomes, etc…

The new biology • Traditional biology • Small team working on a specialized topic • Well defined experiment to answer precise questions • New « high-throughput » biology • Large international teams using cutting edge technology defining the project • Results are given raw to the scientific community without any underlying hypothesis

Example of « high-throughput » • Complete genome sequencing • Large-scale sampling of the transcriptome (EST) • Simultaneous expression analysis of thousands of genes (DNA microarrays, SAGE) • Large-scale sampling of the proteome • Protein-protein analysis large-scale 2-hybrid (yeast, worm) • Large-scale 3D structure production (yeast) • Metabolism modelling • Simulations • Biodiversity

Role of bioinformatics • Control and management of the data • Analysis of primary data e.g. • Base calling from chromatograms • Mass spectra analysis • DNA microarrays images analysis • Statistics • Database storage and access • Results analysis in a biological context

First information: a sequence ? • Nucleotide • RNA (or cDNA) • Genomic (intron-exon) • Complete or incomplete? • mRNA with 5’ and 3’ UTR regions • Entire chromosome • Protein • Pre/Pro or functional protein? • Function prediction • Post-translational modifications? • Holy Grail: 3D structure?

Genomes in numbers • Sizes: • virus: 103 to 105 nt • bacteria: 105 to 107 nt • yeast: 1.35 x 107 nt • mammals: 108 to 1010 nt • plants: 1010 to 1011 nt • Gene number: • virus: 3 to 100 • bacteria: ~ 1000 • yeast: ~ 7000 • mammals: ~ 30’000 • Plants: 30’000-50’000?

Sequencing projects • « small » genomes (<107): bacteria, virus • Many already sequenced (industry excluded) • More than 100 microbial genomes already in the public domain • More to come! (one new every two weeks…) • « large » genomes (107-1010) eucaryotes • 15 finished (S.cerevisiae, S. Pombe, E. cuniculi, G. theta, C.elegans, D.melanogaster, A. gambiae, P. falciparum, P. yoelii, D. rerio, F. rubripes, A.thaliana, O. sativa (2x), M. musculus, Homo sapiens) • Many more to come: rat, pig, cow, maize (and other plants), insects, fishes, many pathogenic parasites (Leishmania…) • EST sequencing • Partial mRNA sequences ~15x106 sequences in the public domain

centromer exons of a gene locus control region telomer regulatory elements repetitive sequences Human genome • Size: 3 x 109 nt for a haploid genome • Highly repetitive sequences 25%, moderately repetitive sequences 25-30% • Size of a gene: from 900 to >2’000’000 bases (introns included) • Proportion of the genome coding for proteins: 5-7% • Number of chromosomes: 22 autosomal, 1 sexual chromosome • Size of a chromosome: 5 x 107 to 5 x 108 bases

How to sequence the human genome? • Consortium « international » approach: • Generate genetic maps (meiotic recombination) and pseudogenetic maps (chromosome hybrids) for indicator sequences • Generate a physical map based on large clones (BAC or PAC) • Sequence enough large clones to cover the genome • « commercial » approach (Celera): • Generate random libraries of fixed length genomic clones (2kb and 10kb) • Sequence both ends of enough clones to obtain a 10x coverage • Use computer techniques to reconstitute the chromosomal sequences, check with the public project physical map

Interpretation of the human draft • All chromosomes considered as finished • Even a genomic sequence does not tell you where the genes are encoded. The genome is far from being « decoded » • One must combine genome and transcriptome to have a better idea Last freeze Ncbi33 April 14, 2003

The transcriptome • The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome • The documentation of the localization (cell type) and conditions under which these RNAs are expressed • The documentation of the biological function(s) of each RNA species

Public draft transcriptome • Information about the expression specificity and the function of mRNAs • « full » cDNA sequences of know function • « full » cDNA sequences, but « anonymous » (e.g. KIAA or DKFZ collections) • EST sequences • cDNA libraries derived from many different tissues • Rapid random sequencing of the ends of all clones • ORESTES sequences • Growing set of expression data (microarrays, SAGE etc…) • Increasing evidences for multiple alternative splicing and polyadenylation

Example mapping of ESTs and mRNAs mRNAs ESTs Computer prediction

The proteome • Set of proteins present in a particular cell type under particular conditions • Set of proteins potentially expressed from the genome • Information about the specific expression and function of the proteins

Information on the proteome • Separation of a complex mixture of proteins • 2D PAGE (IEF + SDS PAGE) • Capillary chromatography • Individual characterisation of proteins • Tryptic peptides signature (MS) • Sequencing by chemistry or MS/MS • All post-translational modifications (PTMs) !

Tridimentional structures • Methods to determine structures • X-ray cristallography • NMR • Data format • Atoms coordinates (except H) in a cartesian space • Databases • For proteins and nucleic acids (RSCB, was PDB) • Independent databases for sugars and small organic molecules

Visualisation of the structures • Secondary structure elements • Alpha helices, beta sheets, other • Softwares • Various representations (atoms, bonds, secondary…) • Big choice of commercial and free software (e.g., DeepView)

Sequence information, and so what ? • How to store and organise ? • Databases (next lecture) • How to access, search, compare ? • Pairwise alignments, dot plots (Tuesday) • BLAST searches in db (Tuesday) • Patterns, PSI-BLAST, Profiles and HMMs (Wednesday) • Gene prediction (Wednesday) • EST clustering (Thursday) • Multiple Alignments (Thursday) • Protein function prediction (Friday) • Users problems (Friday)

Thanks you

Introduction to Bioinformatics