90 likes | 168 Views
Extracting Disease-Gene Associations from MEDLINE abstracts. Tsujii laboratory University of Tokyo. Outline. NLP tools Part-of-speech tagger, HPSG parser Machine learning based approach for extracting Disease-Gene Association Evaluation Precision / recall / f-score
E N D
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo
Outline • NLP tools • Part-of-speech tagger, HPSG parser • Machine learning based approach for extracting Disease-Gene Association • Evaluation • Precision / recall / f-score • Effectiveness of predicate argument structures • DGA explorer • Annotation tool
Part-of-speech tagger • Trained on the corpus containing newspaper articles and biology texts.
HPSG parser • Output • Phrase structures (e.g. np, vp, pp) • Predicate-argument structures We demonstrate that E2F-1 activates the promoter. demonstrate activates ARG1: we , 1 ARG1: E2F-1 , … ARG2: promoter ARG2: 1
Parsing MEDLINE • Corpus • 1,500,000 MEDLINE abstracts • Parsing speed • 5 secs / sentence • Server • PC cluster (100 processors) • Time • 10 days
Extracting Disease-Gene Association • Preliminary experiments • Patterns on predicate-argument structures accelerates demonstrates ARG1: GENE … ARG1: DISEASE ARG2: DISEASE ARG2: GENE Low recall and precision
Machine learning based approach • Extracted association • Sentence selection Using the patterns on predicate-argument structures as the features for machine learning
Training data The latter is also implied by fibroblast-associated alterations in tumor cell morphology and ECM distribution in the system. Lung fibrosis is a fatal condition of excess extracellular matrix (ECM) deposition associated with increased transforming growth factor beta (TGF-beta) activity. All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found. Dominant radial drusen and Arg345Trp EFEMP1 mutation. The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months. These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.
Maximum entropy learning • Log-linear model • Features • Bag-of-words • Local context • Gene/disease name • Predicate-argument structures • : Binary-valued feature function Weight of the feature