1 / 43

C. elegans Bioinformatics

C. elegans Bioinformatics. Vuokko Aarnio 27.8.2008. Bioinformatics. Applies math, statistics and computer science to understand biological processes, usually on a molecular level Data often from high-throughput techniques (sequencing, microarrays etc.) Many research areas:

teryl
Download Presentation

C. elegans Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C. elegans Bioinformatics Vuokko Aarnio 27.8.2008

  2. Bioinformatics • Applies math, statistics and computer science to understand biological processes, usually on a molecular level • Data often from high-throughput techniques (sequencing, microarrays etc.) • Many research areas: • Sequence alignment • Sequence pattern finding • Gene expression - microarray bioinformatics • Assembly of genome from short sequences (in genome projects) • Protein structure prediction from sequence • Visualization of protein-protein interaction networks • Modeling of evolution (genetic algorithms) • C. elegans genome quite well annoted and accessible with many bioinformatics tools

  3. Overview of lecture • Sequence analyses • Sequence alignments • PCR primer design • Prediction of transcription factor binding sites • Wormbase • Microarray bioinformatics • Data analysis overview • Functional annotation • Data repository GEO

  4. Sequence analysis - Identifier • Name of the sequence • Different bioinformatics databases and tools operate with different identifiers • Example: different identifiers for a gene Wormbase locus ID ahr-1 Ensembl Gene ID C41G7.5 Entrez gene ID 172788 EMBL (Genbank) ID Z81048 RefSeq DNA ID NM_001025865 WB Gene ID WB00000096

  5. Sequence analysis - Format • How the sequence is written • Tools require the sequence in a correct format • Most common format FASTA:

  6. Annotation • Any information on a sequence • Can be identifier, description,...

  7. Gene Ontology • Project that provides a controlled vocabulary to describe gene product attributes in any organism • Ontologies: Biological process, molecular function, cellular compartment • Ontology term: code and a common name e.g. GO:0007186 - GPCR protein signaling pathway • Gene Ontology annotation: characterization of gene products using the ontology terms • Based on wet lab experiments or sequence similarity with other characterized genes

  8. BioMart tool • By EBI (European Bioinformatics Institute) • Finds annotations from databases e.g. Ensembl genome database • Good tool to e.g. convert gene identifiers and download sequences • Also finds chromosomal locations and Gene Ontology terms • Free web interface at http://www.biomart.org

  9. Using the BioMart tool • 1. Choose database and organism • 2. Define what your input is (e.g. list of Ensembl gene IDs) • 3. Specify what you want as output (e.g. gene sequences with Entrez gene IDs) • 4. Run search and export your results

  10. Sequence alignments - BLAST • Basic Local Alignment Tool • Sequence similarity search program • Finds matching sequences in NCBI database • The sequence can be nucleotide, protein, translated, genome,... • Free web interface at http://www.ncbi.nlm.nih.gov/blast

  11. Two types of Sequence alignments in BLAST • 1. Compare two given sequences, e.g. • Does your PCR product have the right sequence? • How closely related your protein of interest is to its homolog in another species? • 2. Compare one sequence against genome, transcriptome, proteome etc. • Does a sequence correspond to any known gene or regulatory area? • Will a PCR primer bind to one or many sites in the genome?

  12. Example: Alignment of two sequences with BLAST button that starts alignment sequences to be compared

  13. Example: The two sequences were almost the same 87 / 96 right nucleotides 3 missing nucleotides Both sequences were in 5'-3' orientation 6 wrong nucleotides

  14. Using Nucleotide BLAST The sequence in FASTA format What type of nucleotides (RNA, genomic DNA, expressed sequence tags etc.) Organism Run search

  15. Example: gene cloned and sequenced - compared to genome Corresponds to a known gene (cyp-42A1) The sequence was backwards Matches closely but not perfectly

  16. Multiple sequence alignment • Compares several given sequences • Builds a hierachical tree that shows how closely each sequence is related to the others • Several tools, e.g. ClustalW • Several tools to visualize the tree e.g. HyperTree, JalView,... • e.g. family of proteins in different species - which ones most closely related

  17. Compares each given sequence to each other given sequence Free web interface at http://ebi.ac.uk/Tools/clustalW/index.html ClustalW multiple sequence alignment tool

  18. Example: Hierarchical tree of C. elegans CYP proteins (HyperTree)

  19. PCR primer design with Primer3 • Finds optimized primers from sequence • Takes into account the desired melting temperature, GC content and primer length • Improvements made in Tartu University

  20. In silico PCR predicts what will be amplified with given primers

  21. Transcription factors translation

  22. Prediction of transcription factor binding sites • Transcription factors bind to specific short DNA sequences to induce or repress transcription • Sometimes binding sites of a transcription factor can vary in terms of one or few nucleotides • There are several tools to predict transcription factor binding sites, e.g. POXO

  23. Different tools in POXO Kankainen, M. et al. Nucl. Acids Res. 2006 34:W534-W540; doi:10.1093/nar/gkl296

  24. Finding of enriched patterns • Put in sequences upstream of your genes of interest (can be obtained from BioMart or POXO sequence retrieval) • POCO finds which patterns occur a lot • Compares to likelihood of finding that sequence, for example certain nucleotides are generally more common in the genome than others

  25. Clustering of found patterns • Puts together patterns that have something in common • Forms longer patterns • Allows also some differences

  26. Checking if a given pattern is enriched in the sequences • POBO counts how many times the given pattern is present in the sequences • Compares to how many times the sequence is present in the "background" • Background different in different tools • POXO creates several lists of random sequences (same number and length as in the given sequence list)

  27. Wormbase • Major publicly available database of information about C. elegans • Essential for worm researcher • Search e.g. info on a gene

  28. Names

  29. Sequence exons, introns Sequence: exons colored, introns white

  30. Chromosomal location, anatomic expression pattern…

  31. Information of gene product function collected from publications (RNAi, microarray), links to the publications

  32. Information on how a mutant allele is different from the wild-type gene In this example a point mutation... ...that leads to a stop codon in the middle of protein

  33. Link to C. elegans Genetics Center from where you can request worm strains

  34. Microarrays

  35. Microarray bioinformatics • Expression levels of tens of thousands of genes in one experiment • Quantification of intensities from an image • Data analysis • Finding annotations to genes - DAVID tool • Using existing microarray data - GEO

  36. Image quantification • TIGR Spotfinder - freely downloadable software • Input image files • Compose a grid - each spot to its own square • Segmentation method decides which part is spot and which is background • Intensity value for each spot represents amount of RNA in the original sample

  37. Data analysis • Several commercial programs e.g. GeneSpring • Normalization • Sometimes one label (of 2-color experiment) is stronger than the other or some chips or chip areas have been hybridized more efficiently • Normalization makes these different labels, chips or chip areas comparable • Statistics • Which genes are significantly under- or over-expressed

  38. Finding generally overrepresented functions in your gene list • DAVID annotation tool • Compares which annotations are over-represented in your gene list • Good for showing general trends in a large gene list

  39. DAVID Functional Annotation Tool • By NIAID, National Institutes of Health • Finds significantly enriched annotations to gene products in your list: • Gene Ontology terms • Identifiers • Protein domains • Pathways • etc.

  40. Example: Functional annotation chart

  41. Microarray data repository - GEO • Gene Expression Omnibus, NCBI • MIAME: the Minimum Information About a Microarray Experiment that should be provided • Submit your own miroarray data to GEO • Browse, search and retrieve microarray data • http://www.ncbi.nlm.nih.gov/geo/

  42. Summary: Bioinformatics • + More and larger data sets available • + -omics level approach • + Performs functions that would be extremely tideous to do manually • + Tools are easy to use • - information sometimes predictions based on similarity with other gene products -> wet lab experiments still needed

More Related