110 likes | 281 Views
MN-B-C 2 Analysis of High Dimensional (-omics) Data. Week 5: Proteomics 3. Kay Hofmann – Protein Evolution Group http://www.genetik.uni-koeln.de/groups/Hofmann. Consider one single pathway at a time. Consider a group of genes with interesting experimental finding.
E N D
MN-B-C 2 Analysis of High Dimensional (-omics) Data Week 5: Proteomics 3 Kay Hofmann – Protein Evolution Grouphttp://www.genetik.uni-koeln.de/groups/Hofmann
Consider one single pathway at a time Consider a groupof genes with interestingexperimental finding Find all pathway associations Statistical testforpathwaysthatare over-represented in group Visualizeexperimental datain pathwaydiagram Mapping gene/proteinsetstobiologicalgroups/pathways Pathway-centric Analysis Gene set centric Analysis Map genes topathwaycomponents
Fas-L Fas FLIP FADD Casp8 Diablo APAF1 cIAP Casp9 Casp3 • Classicalnetwork/pathwayrepresentation • Impliesupstream/downstreamordering Exampleof a known 'biologicalpathway' Advantages: Rich Information Familiar to Biologists Easy to interpret Disadvantages: Not always known Difficult in multi-experiment context Statistical evaluation problematic Often not regulated as a whole Mainly used for pathway-centric analysis
Exampleofpathway-centricanalysis red/greencolorindicateup/down-regulation
Exampleofpre-definedgene/proteincategory • Ifstatisticsismoreimportantthangraphics: • Useof'categorial'data • Examples • Fas pathway • Apoptosis inducers • SNARE complex • p53 target • Chromosome 12q13.1 • Plasma membrane protein • NK-Cell marker Fas-L FADD FLIP Casp8 Diablo Fas APAF1 Casp3 Advantages: Suitable for non-network data Better amenable to statistics Many data sources available cIAP Casp9 Disadvantages: Fewer information Less intuitive More tedious interpretation Mainly used for gene set centric analysis
Fisher'sexacttestforgenesetenrichment The groupof 100 top-regulatedproteinscontains20 cMyctargets. Is thissignificant? Thereare 25 000 proteins in total, amongthem 200 cMyctargets 180 80 24720 24800 24900 Fisher's exact test ≈ χ2 test = Hypergeometric test http://www.langsrud.com/fisher.htm p-Value = 1.34E-22 Enrichment = (20*24720)/(80*180) = 34.3-fold
Frequently used sources for pathway annotation • Gene Ontology (GO)Comprehensive;Ontologies defined by consortium,gene assignments by EBI. Three different ontologies "biological process", "molecularfunction", "cellular component". • Sequence motifsFunctional domains and other conserved sequence regions. PROSITE, Pfam, etc. • UniPROT keywordsKeywords plus wordsfromthepublicationtitles, fromtheproteinnameanddescription. • Chromosomal localizationDerived from EnsEMBL, useful for tumor analysis, etc. • CellmarkersCollectedfromtheliteratureandmutlipepublishedexpressionprojects • KEGG"Kyoto Encyclopedia of Genes and Genomes", mainlymetabolic pathways • ComplexmembershipFrompublications (largelyhighthroughputexperiments). • TF targetsCollectedfromvariousdatabasesincludingMSigDB • Curated pathways Collected from various databases including NetPath, PathWiki, Reactome
GO is the most widely used resource "The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner" Ontologies defined by consortium (covering all of biology in all organisms) Gene assignments by 'genome authorities' human:EBI, mouse: MGD Three ontologies "biological process", "molecularfunction", "cellular component". Organized as 'directed acyclic graph' (DAG) ApoptosisCell cycle Response to pathogen Cell Protein KinaseReceptor Transcription factor Organelle Membrane Mitochondrium NucleusInner Mito. Membrane Ribosome Mitochondrial Membrane Intermembrane space Outer Membrane Inner Membrane
GO is braindead at multiple levels II. Automatic mass-annotations • good coverage in broad 'boring' categories • properties that can be gleaned from protein classes • properties that are associated with sequence domains/motifs • properties that can be guessed from the protein name • poor coverage in more specific categories Example 1: All Keratins (type I, II, cytokeratins, hair keratins, follicular keratins) have the same set of annotations: 'epidermis development', 'intermediate filament', 'keratin filament', 'structural constituent of epidermis', 'structural molecule'. Annotators often fall for misleading names: KCTD family is wrongly classified as 'potassium transporters' (with a whole group of associated annotations like e.g. 'plasma membrane associated') just because they contain a domain called 'potassium channel tetramerization domain'. There are lots of similar examples
GO is getting better: This problem from two years ago has disappeared CytokineActivity CytokineReceptor Binding SOCS2 ProlactinReceptor Binding Interleukin-10 Receptor Binding IL-10 Prolactin GH • Number of false-negatives greatly reduced • Number of inconsistencies between human and mouse greatly reduced
Useful outside resources for PA GSEA: http://www.broadinstitute.org/gsea/index.jsp Gene set enrichment analysis. Similar concept as TreeRanker. DAVID: http://david.abcc.ncifcrf.gov/ Several services, including annotation enrichment Cytoscape: http://www.cytoscape.org/ Network designer/editor, extensible through modules. Userful for protein interactionnetworks, coloring pathways by expression, etc. Genemania: http://genemania.org/ Useful for finding connections within gene sets. Also available as cytoscape module