Cis-Regulatory/ Text Mining Interface

Cis-Regulatory/Text Mining Interface Discussion

Questions (1) What does ORegAnno want from text mining? • Curation queue • Document mark-up • Mapping to database IDs (2) What does text mining need from ORegAnno? (3) What can text mining provide? • What level of performance is needed? (4) What is the right way to proceed? • Data sets for BioCreAtIvE? • Custom tools for individual “early adopters”?

Answers: (1) What does ORegAnno Want from Text Mining • Management of curation queue • Ideally, user customized, so that user annotates those documents of immediate interest to her/him • Document mark-up to highlight relevant passages • A workflow pipeline making either the html or pdf version of the document available, with the (potentially) relevant terms highlighted • Support for “cut and paste” transfer of relevant regions to the database comments fields • Mapping to IDs, ontology codes • Gene, transcription factor (protein), organism, cell and tissue type, evidence types

Answers: (2) What does Text Mining Need From ORegAnno? • Significant quantity of reliably annotated data to train text mining systems • Annotated at a level useful for natural language processing (e.g., marked for evidence at the phrase, sentence or passage level, depending on task) • This requires that ORegAnno have: • A clear statement of the scope of the ORegAnno database and a stable set of annotation guidelines • Annotations with high inter-annotator agreement • Tracking of entries by annotator, including depth of annotation (different annotators will annotate to different levels of detail, depending on interests)

Answers: (3) What Can Text Mining Provide? • Curation queue management: • Document classification approaches (from e.g., TREC Genomics or BioCreAtIvE) can be applied and evaluated, making use of new training data from pre-jamboree and jamboree annotation • We can experiment with “user defined” criteria, based on restrictions for gene, transcription factor, organism, tissue, etc. • Document mark-up • Users could be provided with a list of genes/transcription factors in a paper, with hot links into the paper to find relevant passages • This would allow the annotator to drive the annotation process, selecting only those annotations that are correct and relevant. This in turn provides feedback using ORegAnno annotations to validate & train the text mining • Such a tool should make it easy for the annotator to provide the underlying text passages as evidence for the annotation, to provide more training data • Mapping to unique identifiers/controlled vocabulary/ontology • For each entity type (gene, transcription factor, organism, tissue type...), a tool can provide a mapping to the correct identifier; where there is possible ambiguity, the tool could provide a ranked list for the annotator to choose from • A tool can also flag different evidence types, with suggested code(s)

Answers: (4) How to Proceed? • Stabilize guidelines and redo the inter-annotator agreement expt (and write up) • Prepare a Gold Standard data set of expert annotated data for training new annotators • Collect sufficient amount of training data for the various tasks (queue management, document mark up, automated mapping) • Develop end-to-end pipeline (in the style of the FlySlip project) to capture whole documents in machine-readable form for mark-up

Recommendations: Training Materials & Tools • Case studies and gold-standard annotated articles • On-line training • Perhaps with a way for new annotators to test themselves against a set of gold standard annotations • This will require automated comparison of annotations for certain fields • Best tools links • Tools: • Copy mechanism for largely duplicated record

Cis-Regulatory/ Text Mining Interface