1 / 63

creativecommons/licenses/by-sa/2.0/

http://creativecommons.org/licenses/by-sa/2.0/. Classification Schemes and Databases. Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course: http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/

lilka
Download Presentation

creativecommons/licenses/by-sa/2.0/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://creativecommons.org/licenses/by-sa/2.0/

  2. Classification Schemes and Databases Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course:http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/ Course: http://10.100.14.36/Student_Server/

  3. What obvious problem does large scale sets create? • Imagine 6 000 000 000 human beings born within the last 130 years and still alive. • By and large a majority of them has had and education. • How do you do that? Knowledge

  4. Two Problems • How to organize knowledge?

  5. Two Problems • Organize people in order for them to learn effectively Not effective

  6. School and Books are the servers and databases of educating people Database New Server: You Server Users

  7. What obvious problem do large scale approaches create in molecular biology? • To Much Data!!!!!! • How do we organize it? • Laundry list? • Order in the chromossome? • Similarity of sequence? Genome 1 Gene1 Gene 2 … Genome 2 Gene1 Gene 2 … Genome n Gene1 Gene 2 …

  8. What obvious problem do large scale approaches create in molecular biology? • To Much Data!!!!!! • We need functional classification schemes to help organize and make sense out of the data • Hopefully, these schemes must be computer friendly • Ontologies • Databases help in storage manipulation and mining

  9. Biological Classification Schemes • What is an Ontology (in the Biological sense)? A set of definitions of controlled vocabularies with hierarchical relationships to one another, that can easily be dealt with by computers

  10. What are Bio-Ontologies? Biological Ontologies (Bio-ontologies) can be defined as a complex hierarchical structure in which biological concepts are described by their meanings (definitions) and relationships to each other. There are many Bio-Ontologies available and in use by databases. The Plant Ontology, along with other ontologies such as the Gene Ontology, are included in the open source Open Biological Ontologies project at Sourceforge. http://obofoundry.org/

  11. The Gene Ontology The most well-known example of a bio-ontology is the Gene Ontology (GO; http://www.geneontology.org) which describes three biological domains: cellular component (where the gene product locates), molecular function (what the gene product does) and biological process (the cellular, developmental or physiological events the gene product is involved in). GO are used to describe gene products. Because these descriptions are independent of species-specific nomenclature and uniformly applied, it is possible to make meaningful and efficient comparisons of genes across diverse taxa.

  12. Three “Super Categories of GO • Molecular Function (what) • Tasks performed on the molecular level • Biological Process (why) • How it pertains to the organism • Cellular Component (where) • Its location

  13. Example • Gene Name: BRCA1 • Molecular Function: protein binding • Biological Process: DNA Replication and Chromosome Cycle • Cellular Component: nucleus

  14. Structure of GO • How to define the relationship between concepts? • Example: How to relate the terms: “cell” “nucleus” “membrane” • Directed cyclic Graph • A term may have multiple parents on the tree. • All attributes of a selected term must hold true for all its parents.

  15. Directed Acyclic Graph Representation

  16. Tree Representation

  17. How is GO Annotated? • Manual • Humans sifting through primary literature • Electronic • Assign GO Terms using already existing information in databases.

  18. Evidence Code for GO Annotation IEA Inferred from Electronic Annotation ISS Inferred from Sequence Similarity IEP Inferred from Expression Pattern IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IPI Inferred from Physical Interaction IDA Inferred from Direct Assay RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement NAS Non-traceable Author Statement IC Inferred by Curator ND No biological Data available Detailed info available from: http://www.geneontology.org/doc/GO.evidence.html

  19. How to use GO in data analysis • Simple Queries • Find over-represented GO categories in a list of genes • Search Biological “Themes” • Binning • Obtain a broad view of the distribution of major GO terms in a list of genes. • Clustering Genes on GO terms • Group together functionally related genes based on GO terms.

  20. Finding Biological Themes • Question: Which GO term is enriched or depleted in the list of genes that shows significant increase (or decrease) in expression? • Calculate the frequency of occurrence of a GO term in a data set. Test against to the frequency of the occurrence of the term in the genome.

  21. Comparison of Statistical Methods and Options EASE Fisher Exact Bonferroni TAB/RANK ALL OntoExpress Fisher Exact, Chi Square, Binomial TAB/RANK ALL

  22. Binning • Goal: to achieve an overview of distribution of GO terms. • Problem: The high granularity of GO annotation. • Solution: GO Slims – a reduced version of GO that only contain a subset of GO terms that are potentially relevant. • GO Slims allow us to separate a list of genes into broad categories. • Generic GO Slims available • http://www.geneontology.org/GO.slims.shtml • Customized GO Slim possible using DAG Edit.

  23. Binning

  24. Clustering • Goal: Group together functionally related genes based on GO terms. • For any two genes of the gene set, calculates an annotation-based distance between genes, taking into account all GO terms that are common to the pair and terms which are specific to each gene. Produce distance matrix. • Run the matrix through a clustering algorithm

  25. GO Tools • NetFlix – Get GO Annotation • AmiGO – Browser and Simple Queries • GoTermMapper – Binning(Go Slim) • GeneToolBox – • Finding over-represented GO categories • Clustering based on similar GO terms • Query for Gene with Similar Function.

  26. AmiGO • Query by Gene Name/ID or Query GO term.

  27. GoTermMapper • http://go.princeton.edu/cgi-bin/GOTermMapper • Based on map2slim.pl Perl script. Ability to incroporate user made GO Slim files.

  28. GOToolBox http://139.124.62.227/GOToolBox/index.php?page=home Schema of GOToolBox functions

  29. GOToolBox DataSet Creation • Select Species • Select Ontology System • Input Gene List • Select Mode (All, Level, Slim) 1 2 5. Select reference 6. Select Evidence Filter 3 4 5 6

  30. GOToolBox’s GoStat Input Output

  31. GOToolBox’s GoProxy Input Output

  32. GOToolBox’s GoFamily Input Output

  33. Further Resources • www.geneontology.org • Contain presentations, tutorials, links to GO tools. STOP the PRESS and go to GO

  34. GO is not very good • EC numbers • Protein classification schemes • TF classification schemes • Transport proteins classification schemes • Etc.

  35. The EC number database

  36. The BRENDA database

  37. The TF classification database

  38. The signal transduction classification database

  39. The transport proteins classification database

  40. A general protein classification database

  41. What obvious problem do large scale approaches create? • To Much Data!!!!!! • Classification schemes help organize the data • Ontologies • Databases help in storage manipulation and mining

  42. What is a Database? • A database is a collection of data organized in such a way that it is easy to store in a computer and to mine by appropriate software • A database is usually organized as a set of table in which information about an object is stored • The tables are related to each other in different ways.

  43. How do we store the information? • Use database technology • Physically, a database is a storage space for content / information (data) within a server

  44. What does database technology allow? • Making information useful • Avoiding "accidental disorganisation” • Making information easily accessible and integrated with the rest of our work

  45. Metadata & Data Table Imagine you have a set of organisms for which you have genome sequences and you want to store the information. What do you do?? Organism

  46. Metadata & Data Table (cont.) Gene

  47. Relationships • Used to connect tables • Field(s) that have the same value in the related tables • Organism.Accession=Gene.OAccession • Organism.Accession • Unique THERE MUST ALWAYS BE A UNIQUE ID • Primary key • Gene.OAccession • Not unique • Secondary key

  48. Relationships

  49. S(tructured)Q(uery)L(anguange) • ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. • SQL statements are used to retrieve and update data in a database. • Includes: • Data Manipulation Language (DML) • Data Definition Language (DDL)

  50. Web Databases • Data is accessible through Internet • Have different underlying database models • Example: biological databases • Molecular data: NCBI , Swissprot , PDB , GO • Protein interaction : DIP , BIND • Organism specific: Mouse , Worm, Yeast • Literature: Pubmed • Disease

More Related