430 likes | 679 Views
C. elegans Bioinformatics. Vuokko Aarnio 27.8.2008. Bioinformatics. Applies math, statistics and computer science to understand biological processes, usually on a molecular level Data often from high-throughput techniques (sequencing, microarrays etc.) Many research areas:
E N D
C. elegans Bioinformatics Vuokko Aarnio 27.8.2008
Bioinformatics • Applies math, statistics and computer science to understand biological processes, usually on a molecular level • Data often from high-throughput techniques (sequencing, microarrays etc.) • Many research areas: • Sequence alignment • Sequence pattern finding • Gene expression - microarray bioinformatics • Assembly of genome from short sequences (in genome projects) • Protein structure prediction from sequence • Visualization of protein-protein interaction networks • Modeling of evolution (genetic algorithms) • C. elegans genome quite well annoted and accessible with many bioinformatics tools
Overview of lecture • Sequence analyses • Sequence alignments • PCR primer design • Prediction of transcription factor binding sites • Wormbase • Microarray bioinformatics • Data analysis overview • Functional annotation • Data repository GEO
Sequence analysis - Identifier • Name of the sequence • Different bioinformatics databases and tools operate with different identifiers • Example: different identifiers for a gene Wormbase locus ID ahr-1 Ensembl Gene ID C41G7.5 Entrez gene ID 172788 EMBL (Genbank) ID Z81048 RefSeq DNA ID NM_001025865 WB Gene ID WB00000096
Sequence analysis - Format • How the sequence is written • Tools require the sequence in a correct format • Most common format FASTA:
Annotation • Any information on a sequence • Can be identifier, description,...
Gene Ontology • Project that provides a controlled vocabulary to describe gene product attributes in any organism • Ontologies: Biological process, molecular function, cellular compartment • Ontology term: code and a common name e.g. GO:0007186 - GPCR protein signaling pathway • Gene Ontology annotation: characterization of gene products using the ontology terms • Based on wet lab experiments or sequence similarity with other characterized genes
BioMart tool • By EBI (European Bioinformatics Institute) • Finds annotations from databases e.g. Ensembl genome database • Good tool to e.g. convert gene identifiers and download sequences • Also finds chromosomal locations and Gene Ontology terms • Free web interface at http://www.biomart.org
Using the BioMart tool • 1. Choose database and organism • 2. Define what your input is (e.g. list of Ensembl gene IDs) • 3. Specify what you want as output (e.g. gene sequences with Entrez gene IDs) • 4. Run search and export your results
Sequence alignments - BLAST • Basic Local Alignment Tool • Sequence similarity search program • Finds matching sequences in NCBI database • The sequence can be nucleotide, protein, translated, genome,... • Free web interface at http://www.ncbi.nlm.nih.gov/blast
Two types of Sequence alignments in BLAST • 1. Compare two given sequences, e.g. • Does your PCR product have the right sequence? • How closely related your protein of interest is to its homolog in another species? • 2. Compare one sequence against genome, transcriptome, proteome etc. • Does a sequence correspond to any known gene or regulatory area? • Will a PCR primer bind to one or many sites in the genome?
Example: Alignment of two sequences with BLAST button that starts alignment sequences to be compared
Example: The two sequences were almost the same 87 / 96 right nucleotides 3 missing nucleotides Both sequences were in 5'-3' orientation 6 wrong nucleotides
Using Nucleotide BLAST The sequence in FASTA format What type of nucleotides (RNA, genomic DNA, expressed sequence tags etc.) Organism Run search
Example: gene cloned and sequenced - compared to genome Corresponds to a known gene (cyp-42A1) The sequence was backwards Matches closely but not perfectly
Multiple sequence alignment • Compares several given sequences • Builds a hierachical tree that shows how closely each sequence is related to the others • Several tools, e.g. ClustalW • Several tools to visualize the tree e.g. HyperTree, JalView,... • e.g. family of proteins in different species - which ones most closely related
Compares each given sequence to each other given sequence Free web interface at http://ebi.ac.uk/Tools/clustalW/index.html ClustalW multiple sequence alignment tool
Example: Hierarchical tree of C. elegans CYP proteins (HyperTree)
PCR primer design with Primer3 • Finds optimized primers from sequence • Takes into account the desired melting temperature, GC content and primer length • Improvements made in Tartu University
In silico PCR predicts what will be amplified with given primers
Transcription factors translation
Prediction of transcription factor binding sites • Transcription factors bind to specific short DNA sequences to induce or repress transcription • Sometimes binding sites of a transcription factor can vary in terms of one or few nucleotides • There are several tools to predict transcription factor binding sites, e.g. POXO
Different tools in POXO Kankainen, M. et al. Nucl. Acids Res. 2006 34:W534-W540; doi:10.1093/nar/gkl296
Finding of enriched patterns • Put in sequences upstream of your genes of interest (can be obtained from BioMart or POXO sequence retrieval) • POCO finds which patterns occur a lot • Compares to likelihood of finding that sequence, for example certain nucleotides are generally more common in the genome than others
Clustering of found patterns • Puts together patterns that have something in common • Forms longer patterns • Allows also some differences
Checking if a given pattern is enriched in the sequences • POBO counts how many times the given pattern is present in the sequences • Compares to how many times the sequence is present in the "background" • Background different in different tools • POXO creates several lists of random sequences (same number and length as in the given sequence list)
Wormbase • Major publicly available database of information about C. elegans • Essential for worm researcher • Search e.g. info on a gene
Sequence exons, introns Sequence: exons colored, introns white
Information of gene product function collected from publications (RNAi, microarray), links to the publications
Information on how a mutant allele is different from the wild-type gene In this example a point mutation... ...that leads to a stop codon in the middle of protein
Link to C. elegans Genetics Center from where you can request worm strains
Microarray bioinformatics • Expression levels of tens of thousands of genes in one experiment • Quantification of intensities from an image • Data analysis • Finding annotations to genes - DAVID tool • Using existing microarray data - GEO
Image quantification • TIGR Spotfinder - freely downloadable software • Input image files • Compose a grid - each spot to its own square • Segmentation method decides which part is spot and which is background • Intensity value for each spot represents amount of RNA in the original sample
Data analysis • Several commercial programs e.g. GeneSpring • Normalization • Sometimes one label (of 2-color experiment) is stronger than the other or some chips or chip areas have been hybridized more efficiently • Normalization makes these different labels, chips or chip areas comparable • Statistics • Which genes are significantly under- or over-expressed
Finding generally overrepresented functions in your gene list • DAVID annotation tool • Compares which annotations are over-represented in your gene list • Good for showing general trends in a large gene list
DAVID Functional Annotation Tool • By NIAID, National Institutes of Health • Finds significantly enriched annotations to gene products in your list: • Gene Ontology terms • Identifiers • Protein domains • Pathways • etc.
Microarray data repository - GEO • Gene Expression Omnibus, NCBI • MIAME: the Minimum Information About a Microarray Experiment that should be provided • Submit your own miroarray data to GEO • Browse, search and retrieve microarray data • http://www.ncbi.nlm.nih.gov/geo/
Summary: Bioinformatics • + More and larger data sets available • + -omics level approach • + Performs functions that would be extremely tideous to do manually • + Tools are easy to use • - information sometimes predictions based on similarity with other gene products -> wet lab experiments still needed