C. elegans Bioinformatics

C. elegans Bioinformatics Vuokko Aarnio 27.8.2008

Bioinformatics • Applies math, statistics and computer science to understand biological processes, usually on a molecular level • Data often from high-throughput techniques (sequencing, microarrays etc.) • Many research areas: • Sequence alignment • Sequence pattern finding • Gene expression - microarray bioinformatics • Assembly of genome from short sequences (in genome projects) • Protein structure prediction from sequence • Visualization of protein-protein interaction networks • Modeling of evolution (genetic algorithms) • C. elegans genome quite well annoted and accessible with many bioinformatics tools

Overview of lecture • Sequence analyses • Sequence alignments • PCR primer design • Prediction of transcription factor binding sites • Wormbase • Microarray bioinformatics • Data analysis overview • Functional annotation • Data repository GEO

Sequence analysis - Identifier • Name of the sequence • Different bioinformatics databases and tools operate with different identifiers • Example: different identifiers for a gene Wormbase locus ID ahr-1 Ensembl Gene ID C41G7.5 Entrez gene ID 172788 EMBL (Genbank) ID Z81048 RefSeq DNA ID NM_001025865 WB Gene ID WB00000096

Sequence analysis - Format • How the sequence is written • Tools require the sequence in a correct format • Most common format FASTA:

Annotation • Any information on a sequence • Can be identifier, description,...

Gene Ontology • Project that provides a controlled vocabulary to describe gene product attributes in any organism • Ontologies: Biological process, molecular function, cellular compartment • Ontology term: code and a common name e.g. GO:0007186 - GPCR protein signaling pathway • Gene Ontology annotation: characterization of gene products using the ontology terms • Based on wet lab experiments or sequence similarity with other characterized genes

BioMart tool • By EBI (European Bioinformatics Institute) • Finds annotations from databases e.g. Ensembl genome database • Good tool to e.g. convert gene identifiers and download sequences • Also finds chromosomal locations and Gene Ontology terms • Free web interface at http://www.biomart.org

Using the BioMart tool • 1. Choose database and organism • 2. Define what your input is (e.g. list of Ensembl gene IDs) • 3. Specify what you want as output (e.g. gene sequences with Entrez gene IDs) • 4. Run search and export your results

Sequence alignments - BLAST • Basic Local Alignment Tool • Sequence similarity search program • Finds matching sequences in NCBI database • The sequence can be nucleotide, protein, translated, genome,... • Free web interface at http://www.ncbi.nlm.nih.gov/blast

Two types of Sequence alignments in BLAST • 1. Compare two given sequences, e.g. • Does your PCR product have the right sequence? • How closely related your protein of interest is to its homolog in another species? • 2. Compare one sequence against genome, transcriptome, proteome etc. • Does a sequence correspond to any known gene or regulatory area? • Will a PCR primer bind to one or many sites in the genome?

Example: Alignment of two sequences with BLAST button that starts alignment sequences to be compared

Example: The two sequences were almost the same 87 / 96 right nucleotides 3 missing nucleotides Both sequences were in 5'-3' orientation 6 wrong nucleotides

Using Nucleotide BLAST The sequence in FASTA format What type of nucleotides (RNA, genomic DNA, expressed sequence tags etc.) Organism Run search

Example: gene cloned and sequenced - compared to genome Corresponds to a known gene (cyp-42A1) The sequence was backwards Matches closely but not perfectly

Multiple sequence alignment • Compares several given sequences • Builds a hierachical tree that shows how closely each sequence is related to the others • Several tools, e.g. ClustalW • Several tools to visualize the tree e.g. HyperTree, JalView,... • e.g. family of proteins in different species - which ones most closely related

Compares each given sequence to each other given sequence Free web interface at http://ebi.ac.uk/Tools/clustalW/index.html ClustalW multiple sequence alignment tool

Example: Hierarchical tree of C. elegans CYP proteins (HyperTree)

PCR primer design with Primer3 • Finds optimized primers from sequence • Takes into account the desired melting temperature, GC content and primer length • Improvements made in Tartu University

In silico PCR predicts what will be amplified with given primers

Transcription factors translation

Prediction of transcription factor binding sites • Transcription factors bind to specific short DNA sequences to induce or repress transcription • Sometimes binding sites of a transcription factor can vary in terms of one or few nucleotides • There are several tools to predict transcription factor binding sites, e.g. POXO

Different tools in POXO Kankainen, M. et al. Nucl. Acids Res. 2006 34:W534-W540; doi:10.1093/nar/gkl296

Finding of enriched patterns • Put in sequences upstream of your genes of interest (can be obtained from BioMart or POXO sequence retrieval) • POCO finds which patterns occur a lot • Compares to likelihood of finding that sequence, for example certain nucleotides are generally more common in the genome than others

Clustering of found patterns • Puts together patterns that have something in common • Forms longer patterns • Allows also some differences

Checking if a given pattern is enriched in the sequences • POBO counts how many times the given pattern is present in the sequences • Compares to how many times the sequence is present in the "background" • Background different in different tools • POXO creates several lists of random sequences (same number and length as in the given sequence list)

Wormbase • Major publicly available database of information about C. elegans • Essential for worm researcher • Search e.g. info on a gene

Names

Sequence exons, introns Sequence: exons colored, introns white

Chromosomal location, anatomic expression pattern…

Information of gene product function collected from publications (RNAi, microarray), links to the publications

Information on how a mutant allele is different from the wild-type gene In this example a point mutation... ...that leads to a stop codon in the middle of protein

Link to C. elegans Genetics Center from where you can request worm strains

Microarrays

Microarray bioinformatics • Expression levels of tens of thousands of genes in one experiment • Quantification of intensities from an image • Data analysis • Finding annotations to genes - DAVID tool • Using existing microarray data - GEO

Image quantification • TIGR Spotfinder - freely downloadable software • Input image files • Compose a grid - each spot to its own square • Segmentation method decides which part is spot and which is background • Intensity value for each spot represents amount of RNA in the original sample

Data analysis • Several commercial programs e.g. GeneSpring • Normalization • Sometimes one label (of 2-color experiment) is stronger than the other or some chips or chip areas have been hybridized more efficiently • Normalization makes these different labels, chips or chip areas comparable • Statistics • Which genes are significantly under- or over-expressed

Finding generally overrepresented functions in your gene list • DAVID annotation tool • Compares which annotations are over-represented in your gene list • Good for showing general trends in a large gene list

DAVID Functional Annotation Tool • By NIAID, National Institutes of Health • Finds significantly enriched annotations to gene products in your list: • Gene Ontology terms • Identifiers • Protein domains • Pathways • etc.

Example: Functional annotation chart

Microarray data repository - GEO • Gene Expression Omnibus, NCBI • MIAME: the Minimum Information About a Microarray Experiment that should be provided • Submit your own miroarray data to GEO • Browse, search and retrieve microarray data • http://www.ncbi.nlm.nih.gov/geo/

Summary: Bioinformatics • + More and larger data sets available • + -omics level approach • + Performs functions that would be extremely tideous to do manually • + Tools are easy to use • - information sometimes predictions based on similarity with other gene products -> wet lab experiments still needed

C. elegans Bioinformatics

C. elegans Bioinformatics

Presentation Transcript

C. elegans and the Pharmaceutical Industry

Introduction to C. elegans : Laboratory course

Genetics of C. elegans

The Unfolded Protein Response in C. Elegans

Caenorhabditis elegans (C. elegans) An elegant worm

Neurons for male mating in C . elegans

C. elegans

Scanning em of C. elegans

Nicholas lab: C. elegans research

Caenorhabditis elegans

MCDB 4650 Developmental Genetics in C. elegans

C. Elegans

Structural Genomics of C. elegans

C. elegans – “Back To The Future”.

Caenorhabditis elegans (C. elegans) Kathy Szeniawski Clayton State University Spring 2008

Caenorhabditis elegans

Claude Bernard legacy and C. elegans genomics

Introduction to C. elegans and RNA interference

Introduction of C. elegans

Caenorhabditis elegans