490 likes | 607 Views
Introduction to Short Lists. Chair of Proteomics and Bioanalytics Technical University of Munich Amin M. Gholami March 2014. Interpreting Protein Lists. My cool new screen worked and produced 1000 hits! …Now what? Large-Scale Analysis Tell me what ’ s interesting about these proteins.
E N D
Introduction to Short Lists Chair of Proteomics and Bioanalytics Technical University of Munich Amin M. Gholami March 2014
Interpreting Protein Lists • My cool new screen worked and produced 1000 hits! …Now what? • Large-Scale Analysis • Tell me what’s interesting about these proteins
Interpreting Protein Lists • My cool new screen worked and produced 1000 hits! …Now what? • Large-Scale Analysis (Omics) • Proteomics • Tell me what’s interesting about these proteins • Are they enriched in known pathways, complexes, functions Analysis tools Ranking or clustering Eureka! New heart disease Protein/gene! Prior knowledge about cellular processes
Before analysis • Normalization • Quality control (garbage in, garbage out) • Use statistics that will increase signal and reduce noise specifically for your experiment • Other analyses you may want to use to evaluate changes • Make sure your protein IDs are compatible with software
Biological Questions • Step 1: What do you want to accomplish with your list (hopefully part of experiment design! ) • Summarize biological processes or other aspects of molecular function • Perform differential analysis – what pathways are different between samples? • Find new pathways or new pathway members • Discover new gene/protein function • Correlate with a disease or phenotype (candidate prioritization)
Biological Answers • Computational analysis methods we will cover • Pathway enrichment analysis: summarize and compare • Predict function, find new pathway members, identify functional modules (new pathways)
Topics • ID convert • Gene ontology – widely used gene set database • Pathway databases – KEGG • P value and FDR • One stop tools for analysis
For your information Common Identifiers Gene EnsemblENSG00000139618 Entrez Gene 675 UnigeneHs.34012 RNA transcript GenBankBC026160.1 RefSeqNM_000059 EnsemblENST00000380152 Protein EnsemblENSP00000369497 RefSeqNP_000050.2 UniProtBRCA2_HUMAN or A1YBP1_HUMAN IPI IPI00412408.1 EMBL AF309413 PDB 1MIU Species-specific HUGO HGNC BRCA2 MGI MGI:109337 RGD 2219 ZFIN ZDB-GENE-060510-3 FlyBaseCG9097 WormBaseWBGene00002299 or ZK1067.1 SGD S000002187 or YDL029W Annotations InterProIPR015252 OMIM 600185 PfamPF09104 Gene Ontology GO:0000724 SNPs rs28897757 Experimental Platform Affymetrix208368_3p_s_at Agilent A_23_P99452 CodeLinkGE60169 IlluminaGI_4502450-S Red = Recommended
Identifier Mapping • So many IDs! • Software tools recognize only a handful • May need to map from your short list IDs to standard IDs • Four main uses • Searching for a favorite gene / protein name • Link to related resources • Identifier translation • E.g. Proteins to genes, Affy ID to Entrez Gene • Merging data from different sources • Find equivalent records
ID Challenges • Avoid errors: map IDs correctly • Gene name ambiguity – not a good ID • e.g. FLJ92943, LFS1, TRP53, p53 • Better to use the standard gene symbol: TP53 • Excel error-introduction • OCT4 is changed to October-4 • Problems reaching 100% coverage • E.g. due to version issues • Use multiple sources to increase coverage Zeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80
ID Mapping Services • Biodbnet • http://biodbnet.abcc.ncifcrf.gov/db/db2db.php#biodb • David ID converter • PICR (proteins only) • http://www.ebi.ac.uk/Tools/picr/ Covert IPI to Gene symbol IPI00867739 IPI00000816 IPI00908804 IPI00002460 IPI00002614 IPI00167079 IPI00741970 IPI00003566
Recommendations • For proteins and genes • (do notconsider spliceforms) • Map everything to Entrez Gene IDs using a spreadsheet • If 100% coverage desired, manually curate missing mappings • Be careful of Excel auto conversions – especially when pasting large gene lists! • Remember to format cells as ‘text’ before pasting
What Have We Learned? • Genes and their products and attributes have many identifiers (IDs) • Genomics often requires conversion of IDs from one type to another • ID mapping services are available • Use standard, commonly used IDs to reduce ID mapping challenges
Gene set • What is gene set • Databases: • GO • MSigDB • Most widely used: Gene Ontology
What is the Gene Ontology (GO)? • Set of biological phrases (terms) which are applied to genes: • protein kinase • apoptosis • membrane • Dictionary: term definitions • Ontology: A formal system for describing knowledge www.geneontology.org
Ontologies Cellular component: the parts of a cell or its extracellular environment. cells, tissues, organs, and organisms. Molecular function: the elemental activities of a gene product at the molecular level, such as binding, catalysis Biological process: ordered envents performed by a set of biological molecules, such as cellular physiological process or signal transduction
GO Structure • Terms are related within a hierarchy • is-a • part-of • Describes multiple levels of detail of gene function • Terms can have more than one parent or child
What GO Covers? • GO terms divided into three aspects: • cellular component • molecular function • biological process glucose-6-phosphate isomerase activity Cell division
Part 1/2: Terms • Where do GO terms come from? • GO terms are added by editors at EBI and gene annotation database groups • Terms added by request • Experts help with major development • 34065 terms, with definitions • 20703 biological_process • 2824 cellular_component • 9029 molecular_function • As of April 2011
Part 2/2: Annotations • Genes are linked, or associated, with GO terms by trained curators at genome databases • Known as ‘gene associations’ or GO annotations • Multiple annotations per gene • Some GO annotations created automatically (without human review)
Annotation Sources • Manual annotation • Curated by scientists • High quality • Small number (time-consuming to create) • Reviewed computational analysis • Electronic annotation • Annotation derived without human validation • Computational predictions (accuracy varies) • Lower ‘quality’ than manual codes • Key point: be aware of annotation origin
Evidence Types • Experimental Evidence Codes • EXP: Inferred from Experiment • IDA: Inferred from Direct Assay • IPI: Inferred from Physical Interaction • IMP: Inferred from Mutant Phenotype • IGI: Inferred from Genetic Interaction • IEP: Inferred from Expression Pattern • Author Statement Evidence Codes • TAS: Traceable Author Statement • NAS: Non-traceable Author Statement • Curator Statement Evidence Codes • IC: Inferred by Curator • ND: No biological Data available • Computational Analysis Evidence Codes • ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model • IGC: Inferred from Genomic Context • RCA: inferred from Reviewed Computational Analysis • IEA: Inferred from electronic annotation http://www.geneontology.org/GO.evidence.shtml
Species Coverage • All major eukaryotic model organism species and human • Several bacterial and parasite species through TIGR and GeneDB at Sanger • New species annotations in development • Current list: • http://www.geneontology.org/GO.downloads.annotations.shtml
GO Software Tools • GO resources are freely available to anyone without restriction • Includes the ontologies, gene associations and tools developed by GO • Other groups have used GO to create tools for many purposes • http://www.geneontology.org/GO.tools
Other GO tools • GOrilla (multiple eukaryotes) • http://cbl-gorilla.cs.technion.ac.il/ • - AgriGO (many plants, several animals) • http://bioinfo.cau.edu.cn/agriGO/ • - L2L (human, mouse and rat) • http://depts.washington.edu/l2l/ • - GOTermFinder (many species) • http://go.princeton.edu/cgi-bin/GOTermFinder • - FatiGO (muliple eukaryotes) • http://babelomics.bioinfo.cipf.es/
Use gene ontology - Example Give you a protein, for example ABL2, what is the function of this protein? Where the protein is located in a cell? Which process the protein is involved in? Give a Terms, for example, calcium-dependent cell-cell adhesion What genes/proteins involved in this process What are the parent and child terms?
Pathway • What is pathway • A set of gene • What is different with gene set (like GO)? • Databases: KEGG, reactome, biocarta, PID • KEGG: • Multiple organism • 10+ million genes • 449 pathway
example • Pathway: Cell cycle • http://www.kegg.jp/kegg/pathway.html • High light your protein in the pathway map, for example: p53 • You have a set of genes, see the pathway involved: • Using pathway mapper • http://www.genome.jp/kegg/tool/map_pathway1.html • TP53, TGF-beta, cd25a, cdk2 • K04451 K13375 K06645 K02206
What Have We Learned? • Many gene attributes in databases • Gene Ontology (GO) provides gene function annotation • GO is a classification system and dictionary for biological concepts • Annotations are contributed by many groups • More than one annotation term allowed per gene • Some genomes are annotated more than others • Annotation comes from manual and electronic sources • GO can be simplified for certain uses (GO Slim) • Many gene attributes available from Ensembl and Entrez Gene
Multiple Testing Correction • If we test many independent samples from the same population, some of them will lead to the null hypothesis rejection • However, even if the null hypothesis is TRUE, we do expect a rejection rate > 0: M*α, where M is the number of tests performed • Therefore, we need multiple test correction. • Method: • Bonferroni correction (conservative) • FDR correction (less conservative)
Why DAVID? • There are many on their website: http://david.abcc.ncifcrf.gov/home.jsp • Multiple function: ID conversion, enrichment analysis from molecule level to gene module/term level • Comprehensive: annotation data base including GO, KEGG, Reactome, other molecule module database, etc • From detail to sweeping: linking to external websites, different representation methods facilitate the interpretation of your data
David tools • Gene functional classification tool • Function annotation tool • Function annotation clustering • Function annotation chart • Function annotation table • Gene ID conversion tool • Gene name batch viewer • List names of ID, from IPI or probe ID, etc
Background Change Change background Other possible backgrounds
Why change background • Most analysis based on the enrichment analysis, which means that the analysis is not only depends on the short list itself, but also the total number of proteins we have in the corresponding condition. Therefore a certain background must be set up to performs the comparison. • Except default setting, there are several microarray platform could be set as background. Since in MA study, all genes which is possible to be detected is determined by the array. • This is not the case in our MS data, so keep the default setting is a wise idea.
Gene function classification • Group functional related genes together • fuzzy clustering techniques • on the basis of genes annotation term co-occurrence Gene cluster identified by DAVID Ranked by Enrichment Score Related Genes Term Report genes-to-terms 2-D view User Gene IDs Gene Names
Gene function classification • Enrichment score: the higher, more enriched in the listed cluster • Related genes: measured by the similarity of genes global annotation profiles, such as transcriptional factor, kinase, etc • Term report: lists the major annotation terms associated within the functional groups based on the DAVID enrichment engine
Gene function classification • genes-to-terms 2-D view: • gives the detailed relationship of genes-to-terms. • The annotation terms are ordered based on their enrichment scores associated with the group • Green represents the positive association between the gene-term; black represent an unknown relationship.
Function annotation tool • Term based annotation • Each term contains genes involved in its function • The terms could be • Analysis: • Term enrichment analysis • Terms clustering
Function annotation tool • Functional annotation chart: identify the most overrepresented biological terms associated with a given gene list List genes involved in the term Related terms Other columns Term Resource of term
Function annotation tool • count: How many of gene in your list are presented in the term? • Percent: The percentage of genes in the term over total number of genes in your list. • P-value: The probability of the more genes in your list involved in the term just by chance. • Benjamine …: the FDR correction of multiple testing.
Function annotation tool • Functional annotation table: The same analysis with function annotation chart but with a different representation. Term resources Genes Terms
Function annotation tool • Which one should we use, functional annotation chart or functional annotation table?
Function annotation tool • Functional annotation clustering • enrichment analysis on the term level • fuzzy clustering based on co-association Term Enrichment Score “Red G” Related Terms Term Resource List genes in the term Term Clusters 2-D heatmap Term
Function annotation tool • “Red G”: pull out all genes involved in the cluster • 2-D heatmap: the term-gene 2D plot where which interpreted as the same way with before. • Term: please select a term with resource from Biocarta or KEGG and click it. It will provide a nice view…
Tasks • HSP90 data from the paper: • http://www.mcponline.org/content/11/6/M111.016675 • The experiment goal is to analysis HSP90 related proteins. • Differential expression between Geldanamycin treated cell line vs. untreated cell line (Geldanamycin inhibits the function of HSP90) • 4 cancer cell lines: CAL27, COLO205, MDAMB231, K562 • For each cell line: Geldanamycin treated cell line vs. untreated cell line • T-test used to find the differentially expressed proteins
Select IPI IDs that have FDR’s lower than 0.05 from one of the cell lines, load the list to DAVID, convert to • ENSEMBL GENE ID • ENSEMBL TRANSCRIPTS ID • How much unique converted IDs you get? • REFSEQ PROTEIN ID • REFSEQ MRNA ID • REFSEQ GENOMIC ID • How much unique IDs converted you get? • Compare the number of unique IDs your get from ENSEMBL Transcripts ID and REFSEQ mRNAIDs, how do you explain the difference between them?
Annotation databases are the core part of DAVID, it defines what information we can get from the analysis. Look into them and think about when should we use which database? • Select your interesting protein IDs from one of the cell lines, load to DAVID and run analysis, • Which cellular compartment is the ID list most enriched? • Which pathways is your ID list most enriched? • Which is the most over-enriched protein domain of your ID list?
Perform the same analysis on other cell lines, do you find the same or similar genes or gene sets among different cell lines? • What conclusion you get from mining the HSP90 protein? Discuss which protein/pathway you want to examine further in the next wet lab?