1 / 7

Cis-Regulatory/ Text Mining Interface

Cis-Regulatory/ Text Mining Interface. Discussion. Questions. (1) What does ORegAnno want from text mining? Curation queue Document mark-up Mapping to database IDs (2) What does text mining need from ORegAnno? (3) What can text mining provide? What level of performance is needed?

Download Presentation

Cis-Regulatory/ Text Mining Interface

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cis-Regulatory/Text Mining Interface Discussion

  2. Questions (1) What does ORegAnno want from text mining? • Curation queue • Document mark-up • Mapping to database IDs (2) What does text mining need from ORegAnno? (3) What can text mining provide? • What level of performance is needed? (4) What is the right way to proceed? • Data sets for BioCreAtIvE? • Custom tools for individual “early adopters”?

  3. Answers: (1) What does ORegAnno Want from Text Mining • Management of curation queue • Ideally, user customized, so that user annotates those documents of immediate interest to her/him • Document mark-up to highlight relevant passages • A workflow pipeline making either the html or pdf version of the document available, with the (potentially) relevant terms highlighted • Support for “cut and paste” transfer of relevant regions to the database comments fields • Mapping to IDs, ontology codes • Gene, transcription factor (protein), organism, cell and tissue type, evidence types

  4. Answers: (2) What does Text Mining Need From ORegAnno? • Significant quantity of reliably annotated data to train text mining systems • Annotated at a level useful for natural language processing (e.g., marked for evidence at the phrase, sentence or passage level, depending on task) • This requires that ORegAnno have: • A clear statement of the scope of the ORegAnno database and a stable set of annotation guidelines • Annotations with high inter-annotator agreement • Tracking of entries by annotator, including depth of annotation (different annotators will annotate to different levels of detail, depending on interests)

  5. Answers: (3) What Can Text Mining Provide? • Curation queue management: • Document classification approaches (from e.g., TREC Genomics or BioCreAtIvE) can be applied and evaluated, making use of new training data from pre-jamboree and jamboree annotation • We can experiment with “user defined” criteria, based on restrictions for gene, transcription factor, organism, tissue, etc. • Document mark-up • Users could be provided with a list of genes/transcription factors in a paper, with hot links into the paper to find relevant passages • This would allow the annotator to drive the annotation process, selecting only those annotations that are correct and relevant. This in turn provides feedback using ORegAnno annotations to validate & train the text mining • Such a tool should make it easy for the annotator to provide the underlying text passages as evidence for the annotation, to provide more training data • Mapping to unique identifiers/controlled vocabulary/ontology • For each entity type (gene, transcription factor, organism, tissue type...), a tool can provide a mapping to the correct identifier; where there is possible ambiguity, the tool could provide a ranked list for the annotator to choose from • A tool can also flag different evidence types, with suggested code(s)

  6. Answers: (4) How to Proceed? • Stabilize guidelines and redo the inter-annotator agreement expt (and write up) • Prepare a Gold Standard data set of expert annotated data for training new annotators • Collect sufficient amount of training data for the various tasks (queue management, document mark up, automated mapping) • Develop end-to-end pipeline (in the style of the FlySlip project) to capture whole documents in machine-readable form for mark-up

  7. Recommendations: Training Materials & Tools • Case studies and gold-standard annotated articles • On-line training • Perhaps with a way for new annotators to test themselves against a set of gold standard annotations • This will require automated comparison of annotations for certain fields • Best tools links • Tools: • Copy mechanism for largely duplicated record

More Related