150 likes | 305 Views
LESSIONS FROM THE BIOCREATIVE PPI TASK. LESSIONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree , Friday, December, 1st, (2006). MARTIN KRALLINGER, 2006. LESSIONS FROM THE BIOCREATIVE PPI TASK. PROTEIN-PROTEIN INTERACTIONS (PPI).
E N D
LESSIONS FROM THE BIOCREATIVE PPI TASK LESSIONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree , Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK PROTEIN-PROTEIN INTERACTIONS (PPI) • Crucial to understanding functional role of proteins • Relevant for organization of biological processes • Development of high throughput experimental technologies • Implication PPI for gene regulation (TF and co-regulators) • Interaction networks and diseases (e.g. cancer) M. Krallinger and A. Valencia. Applications of Text Mining in Molecular Biology, from name recognition to Protein interaction maps. In Data Analysis and Visualization in Genomics and Proteomics, chapter 4, Wiley. MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK PPI ANNOTATION AND DATABASES • iMEX agreement to share curation efforts • Protein Standard Initiative (PSI) recommendation • Molecular Interaction (MI) Ontology • Large scale experiments • Literature curation MARTIN KRALLINGER, 2006
1010101010102010 0101010010101010 1101001010100101 0101010100010100 11010101101010100 1010101111010010 LESSIONS FROM THE BIOCREATIVE PPI TASK BIOCREATIVE PPI TASK • Rapid literature growth and manual curation • Automatic extraction of protein-protein interactions from text • Variety of published strategies • Main goals: • (1) To determine the state of the art • (2) To produce useful resources for training and testing • (3) To learn which approaches are successful and practical • (4) To monitor interesting new approaches; • (5) To provide useful tools to extract protein-protein interactions from texts • Task design resembles manual curation process steps Structured record MARTIN KRALLINGER, 2006
Second BioCreative challenge evaluation LESSIONS FROM THE BIOCREATIVE PPI TASK http://biocreative.sourceforge.net/index.html MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK INTERACTION ARTICLE SUBTASK (IAS) • Identify those articles which are curation relevant • Document categorization task • Based on PubMed abstracts • Training set consisted in: • (1) P: Abstracts of PPI relevant abstracts form MINT/IntAct • (2) N: Abstracts not relevant for PPI (exhaustive curation) • (3) P*: Abstracts of interaction relevant articles: other DB • Return two collections of ranked documents: P, N • Evaluation: precision, recall, f-score and AROC • Participating systems: supervised learning • Balanced test set, recent publications NOT RELEVANT RELEVANT MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK LESSION I: IAS TASK AND OREGANNO • Determine relevance of abstract vs. full text for article selection • Balanced training collection: positive and negative • Avoid journal and date used as classifier features • Define training and test set in terms of publication date, e.g.: • Training set: published before 2003 • Test set: published after 2003 • Enriched training data: sentences with relevant evidence • Define basic selection strategy: • Exhaustive curation of a set of journals: high recall • Whole PubMed mining: high precision • Curation relevance and annotation types • Integration of resulting applications into annotation pipeline • Interactive evaluation: timing and annotation efficiency MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK INTERACTION PAIR SUBTASK (IPS) • Identify protein-protein interaction pairs from full text articles (HTML, PDF) • Individual protein identified using UniProt ID/Acc • Restrict / define a baseline UniProt release • Extraction of physical PPI (MI ontology) • Training set: articles and associated PPI pairs • System output: for each article ranked list of PPI pairs • Evaluation: precision, recall or predicted compared to manual annotation • Main difficulties gene normalization / inter-species ambiguity • No limitation in organism source PMID: 11739376 Interactor 1: P73213_SYNY3 (Ssr2857 protein ) Interactor 2: ATCS_SYNY3 (pacS protein) MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK LESSON II: IPS TASK AND OREGANNO GENERAL ASPECTS • Difficulties due to inter-organism gene name ambiguity • Difficulty to differentiate experimentally confirmed interactions • Importance of additional lexical resources • Indirect expressions for interactions • Author names of the protein interactors for training • Protein family ambiguity ASPECTS FOR A GENE REGULATION EXTRACTION TASK • Define database for gene normalization • Consider experimentally confirmed regulation • Bio-entity types: Protein vs. gene (promoter) name finding • Provide negative and positive training of co-occurrences (passages) compared to manual annotation • Define actual evaluation metric depending on the needs MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK INTERACTION SENTENCE SUBTASK (ISS) • Select the most relevant sentence expressing a protein-protein interaction from full text article • Useful for human interpretation and summary generation • Provide for each interaction pair a ranked list of maximum 5 evidence passages (max 3 sentences) • Pooling method of the predicted passages • Evaluation: Percentage of relevant sentences with respect to the total number of submitted and mean reciprocal rank of the passages compared to the manual ones • Example: Using a biochemical approach to search for such co-regulatory factors, we identified hGCN5, TRRAP, and hMSH2/6 as BRCA1-interacting proteins. • Also additional collection included: Prodisen collection, Veuthey collection, Brun collection, GeneRif interaction sentences M. Krallinger, R. Malik and Alfonso Valencia Text Mining and Protein Annotations: the Construction and Use of Protein Description Sentences, Genome Informatics Vol.17,No.2. MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK LESSON III: ISS TASK AND OREGANNO GENERAL ASPECTS • Difficulties due to lack of collections ‘negative training sentences’ • Need of larger (additional) training instances from full text • Complex descriptions of referring to interactions • Protein normalization and protein family name ambiguity problems • Multiple sentence evidence cases (referring expressions, anaphora) • Importance of figure legends and certain section titles • Article format dependency (PDF vs. HTML) ASPECTS FOR A GENE REGULATION EXTRACTION TASK • Define semantic types of (or structure) comment fields • Length restriction of training passages • Restriction to certain format type and journals • Define type of passage which should be extracted: for gene regulation or for evidence type annotation MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK INTERACTION METHOD SUBTASK (IMS) • Identify protein-protein interaction pairs from full text articles together with interaction detection method • Map to the MI Ontology (CV) • Maximum of 5 MI for a PPI pair • Extraction of physical PPI (MI ontology) • Mean reciprocal rank compared to the manual annotation <ENTRY> <PPI_SUB_TASK_ID> BC2_PPI_IMS </PPI_SUB_TASK_ID> <TEAM_ID> T1_BC2_PPI </TEAM_ID> <RUN_NR> 1 </RUN_NR> <PMID> 10924507 </PMID> <INTERACTION_PAIR> <INTERACTOR_1> Q08211 </INTERACTOR_1> <INTERACTOR_2> Q9UBU9 </INTERACTOR_2> </INTERACTION_PAIR> <INT_DET_METHOD> <INT_DET_METHOD_ID> MI:0004 </INT_DET_METHOD_ID> <RANK> 1 </RANK> </INT_DET_METHOD> </ENTRY> MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK LESSON IV: IMS AND OREGANNO GENERAL ASPECTS • Difficulties due to lack of training method sentences • Very complex task: both PPI pair as well as terms for methods • Community focus more on IPS than on IMS (too much task overlap) • Difficulty to separate PPI pair and interaction detection method identification • Different parts of documents referring to the method • Information in non-textual data (e.g. figures) ASPECTS FOR A GENE REGULATION EXTRACTION TASK • Define controlled vocabulary relevant for annotation (e.g. evidence types) • Provide lexical resources evidence types (synonyms, …) • Extraction of controlled vocabulary (ontology concepts) to full text MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK REGCREATIVE TEXT MINING TASKS • Different tasks which might result in automatic annotation relevant summary, which could include: 0. Detection of relevant articles (document categorization & ranking) • Ranked (normalized) TF list extracted from the paper • Ranked list of regulated genes extracted from the paper • Ranked list of Evidence types (and subtypes) extracted from the articles together with text passages. 4. Ranked list of associations between TF and regulated genes together with evidence text MARTIN KRALLINGER, 2006
LESSIONS FROM THE BIOCREATIVE PPI TASK Acknowledgements • MINT and IntAct for providing the training and test data collections • Publishers for allowing use of the full text articles (NPG and Elsevier) • MITRE, NCBI for collaboration in organizing the BioCreative Challenge • CNIO for their assistance • Thanks to Lynette Hirschman and Alfonso Valencia for their coordination. • Thanks to the participating teams from all over the world for their effort in developing the participating systems. Detailed results will be presented in Madrid at the BioCreative II Evaluation workshop, sponsored by the European Science Foundation, ESF (23-25th of April 2007, CNIO, Madrid) and in a special issue of Genome Biology. http://biocreative.sourceforge.net/index.html MARTIN KRALLINGER, 2006