1 / 9

Extracting Disease-Gene Associations from MEDLINE abstracts

Extracting Disease-Gene Associations from MEDLINE abstracts. Tsujii laboratory University of Tokyo. Outline. NLP tools Part-of-speech tagger, HPSG parser Machine learning based approach for extracting Disease-Gene Association Evaluation Precision / recall / f-score

lewis
Download Presentation

Extracting Disease-Gene Associations from MEDLINE abstracts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo

  2. Outline • NLP tools • Part-of-speech tagger, HPSG parser • Machine learning based approach for extracting Disease-Gene Association • Evaluation • Precision / recall / f-score • Effectiveness of predicate argument structures • DGA explorer • Annotation tool

  3. Part-of-speech tagger • Trained on the corpus containing newspaper articles and biology texts.

  4. HPSG parser • Output • Phrase structures (e.g. np, vp, pp) • Predicate-argument structures We demonstrate that E2F-1 activates the promoter. demonstrate activates ARG1: we , 1 ARG1: E2F-1 , … ARG2: promoter ARG2: 1

  5. Parsing MEDLINE • Corpus • 1,500,000 MEDLINE abstracts • Parsing speed • 5 secs / sentence • Server • PC cluster (100 processors) • Time • 10 days

  6. Extracting Disease-Gene Association • Preliminary experiments • Patterns on predicate-argument structures accelerates demonstrates ARG1: GENE … ARG1: DISEASE ARG2: DISEASE ARG2: GENE Low recall and precision

  7. Machine learning based approach • Extracted association • Sentence selection Using the patterns on predicate-argument structures as the features for machine learning

  8. Training data The latter is also implied by fibroblast-associated alterations in tumor cell morphology and ECM distribution in the system. Lung fibrosis is a fatal condition of excess extracellular matrix (ECM) deposition associated with increased transforming growth factor beta (TGF-beta) activity. All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found. Dominant radial drusen and Arg345Trp EFEMP1 mutation. The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months. These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.

  9. Maximum entropy learning • Log-linear model • Features • Bag-of-words • Local context • Gene/disease name • Predicate-argument structures • : Binary-valued feature function Weight of the feature

More Related