630 likes | 751 Views
http://creativecommons.org/licenses/by-sa/2.0/. Classification Schemes and Databases. Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course: http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/
E N D
Classification Schemes and Databases Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques Basiques, 1st Floor, Room 1.08 Website of the Course:http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/ Course: http://10.100.14.36/Student_Server/
What obvious problem does large scale sets create? • Imagine 6 000 000 000 human beings born within the last 130 years and still alive. • By and large a majority of them has had and education. • How do you do that? Knowledge
Two Problems • How to organize knowledge?
Two Problems • Organize people in order for them to learn effectively Not effective
School and Books are the servers and databases of educating people Database New Server: You Server Users
What obvious problem do large scale approaches create in molecular biology? • To Much Data!!!!!! • How do we organize it? • Laundry list? • Order in the chromossome? • Similarity of sequence? Genome 1 Gene1 Gene 2 … Genome 2 Gene1 Gene 2 … Genome n Gene1 Gene 2 …
What obvious problem do large scale approaches create in molecular biology? • To Much Data!!!!!! • We need functional classification schemes to help organize and make sense out of the data • Hopefully, these schemes must be computer friendly • Ontologies • Databases help in storage manipulation and mining
Biological Classification Schemes • What is an Ontology (in the Biological sense)? A set of definitions of controlled vocabularies with hierarchical relationships to one another, that can easily be dealt with by computers
What are Bio-Ontologies? Biological Ontologies (Bio-ontologies) can be defined as a complex hierarchical structure in which biological concepts are described by their meanings (definitions) and relationships to each other. There are many Bio-Ontologies available and in use by databases. The Plant Ontology, along with other ontologies such as the Gene Ontology, are included in the open source Open Biological Ontologies project at Sourceforge. http://obofoundry.org/
The Gene Ontology The most well-known example of a bio-ontology is the Gene Ontology (GO; http://www.geneontology.org) which describes three biological domains: cellular component (where the gene product locates), molecular function (what the gene product does) and biological process (the cellular, developmental or physiological events the gene product is involved in). GO are used to describe gene products. Because these descriptions are independent of species-specific nomenclature and uniformly applied, it is possible to make meaningful and efficient comparisons of genes across diverse taxa.
Three “Super Categories of GO • Molecular Function (what) • Tasks performed on the molecular level • Biological Process (why) • How it pertains to the organism • Cellular Component (where) • Its location
Example • Gene Name: BRCA1 • Molecular Function: protein binding • Biological Process: DNA Replication and Chromosome Cycle • Cellular Component: nucleus
Structure of GO • How to define the relationship between concepts? • Example: How to relate the terms: “cell” “nucleus” “membrane” • Directed cyclic Graph • A term may have multiple parents on the tree. • All attributes of a selected term must hold true for all its parents.
How is GO Annotated? • Manual • Humans sifting through primary literature • Electronic • Assign GO Terms using already existing information in databases.
Evidence Code for GO Annotation IEA Inferred from Electronic Annotation ISS Inferred from Sequence Similarity IEP Inferred from Expression Pattern IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IPI Inferred from Physical Interaction IDA Inferred from Direct Assay RCA Inferred from Reviewed Computational Analysis TAS Traceable Author Statement NAS Non-traceable Author Statement IC Inferred by Curator ND No biological Data available Detailed info available from: http://www.geneontology.org/doc/GO.evidence.html
How to use GO in data analysis • Simple Queries • Find over-represented GO categories in a list of genes • Search Biological “Themes” • Binning • Obtain a broad view of the distribution of major GO terms in a list of genes. • Clustering Genes on GO terms • Group together functionally related genes based on GO terms.
Finding Biological Themes • Question: Which GO term is enriched or depleted in the list of genes that shows significant increase (or decrease) in expression? • Calculate the frequency of occurrence of a GO term in a data set. Test against to the frequency of the occurrence of the term in the genome.
Comparison of Statistical Methods and Options EASE Fisher Exact Bonferroni TAB/RANK ALL OntoExpress Fisher Exact, Chi Square, Binomial TAB/RANK ALL
Binning • Goal: to achieve an overview of distribution of GO terms. • Problem: The high granularity of GO annotation. • Solution: GO Slims – a reduced version of GO that only contain a subset of GO terms that are potentially relevant. • GO Slims allow us to separate a list of genes into broad categories. • Generic GO Slims available • http://www.geneontology.org/GO.slims.shtml • Customized GO Slim possible using DAG Edit.
Clustering • Goal: Group together functionally related genes based on GO terms. • For any two genes of the gene set, calculates an annotation-based distance between genes, taking into account all GO terms that are common to the pair and terms which are specific to each gene. Produce distance matrix. • Run the matrix through a clustering algorithm
GO Tools • NetFlix – Get GO Annotation • AmiGO – Browser and Simple Queries • GoTermMapper – Binning(Go Slim) • GeneToolBox – • Finding over-represented GO categories • Clustering based on similar GO terms • Query for Gene with Similar Function.
AmiGO • Query by Gene Name/ID or Query GO term.
GoTermMapper • http://go.princeton.edu/cgi-bin/GOTermMapper • Based on map2slim.pl Perl script. Ability to incroporate user made GO Slim files.
GOToolBox http://139.124.62.227/GOToolBox/index.php?page=home Schema of GOToolBox functions
GOToolBox DataSet Creation • Select Species • Select Ontology System • Input Gene List • Select Mode (All, Level, Slim) 1 2 5. Select reference 6. Select Evidence Filter 3 4 5 6
GOToolBox’s GoStat Input Output
GOToolBox’s GoProxy Input Output
GOToolBox’s GoFamily Input Output
Further Resources • www.geneontology.org • Contain presentations, tutorials, links to GO tools. STOP the PRESS and go to GO
GO is not very good • EC numbers • Protein classification schemes • TF classification schemes • Transport proteins classification schemes • Etc.
What obvious problem do large scale approaches create? • To Much Data!!!!!! • Classification schemes help organize the data • Ontologies • Databases help in storage manipulation and mining
What is a Database? • A database is a collection of data organized in such a way that it is easy to store in a computer and to mine by appropriate software • A database is usually organized as a set of table in which information about an object is stored • The tables are related to each other in different ways.
How do we store the information? • Use database technology • Physically, a database is a storage space for content / information (data) within a server
What does database technology allow? • Making information useful • Avoiding "accidental disorganisation” • Making information easily accessible and integrated with the rest of our work
Metadata & Data Table Imagine you have a set of organisms for which you have genome sequences and you want to store the information. What do you do?? Organism
Relationships • Used to connect tables • Field(s) that have the same value in the related tables • Organism.Accession=Gene.OAccession • Organism.Accession • Unique THERE MUST ALWAYS BE A UNIQUE ID • Primary key • Gene.OAccession • Not unique • Secondary key
S(tructured)Q(uery)L(anguange) • ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. • SQL statements are used to retrieve and update data in a database. • Includes: • Data Manipulation Language (DML) • Data Definition Language (DDL)
Web Databases • Data is accessible through Internet • Have different underlying database models • Example: biological databases • Molecular data: NCBI , Swissprot , PDB , GO • Protein interaction : DIP , BIND • Organism specific: Mouse , Worm, Yeast • Literature: Pubmed • Disease