260 likes | 434 Views
Extracting Semantic Predication from Medline Citations for Pharmacogenomics. C.B. Ahlers 1 , M. Fiszman 2 , D.D. Fushman 1 , F.M. Lang 1 and T.C. Rindflesch 1 1 National Center for Biomedical Communications, National Library of Medicine 2 University of Tennessee, USA (PSB 2007 12:209-220).
E N D
Extracting Semantic Predication from Medline Citations for Pharmacogenomics C.B. Ahlers1, M. Fiszman2, D.D. Fushman1, F.M. Lang1 and T.C. Rindflesch1 1National Center for Biomedical Communications, National Library of Medicine 2University of Tennessee, USA (PSB 2007 12:209-220)
Abstract • This paper describes a NLP system (Enhanced SemRep) to identify core assertions on pharmacogenomics (基因藥理學) in Medline. • The development of the system is based on the adaptation of an existing system and depends on UMLS. • Preliminary evaluation: 55% recall and 73% precision.
1. Introduction (1/3) • Core research in pharmacogenomics investigates the interaction of genes/proteins with therapeutic substances. E.g. treatment of oncology(腫瘤學). • Current NLP for pharmacogenomics concentrates on co-occurrence information without specifying exact relations. • Enhanced SemRep complements that approach by representing assertions in text as semantic predications.
1. Introduction (2/3) • Example • These findings therefore demonstrate that dexamethasone (皮質類固醇) is a potent inducer of multidrug resistance-associated protein (多抗藥性蛋白質) expression in rathepatocytes (肝細胞) through a mechanism that seems not to involve the classical glucocorticoid receptor (糖皮質激素受體) pathway. • Dexamethasone STIMULATES MultidrugResistance-Associated Proteins • Dexamethasone NEG_INTERACTS_WITH Glucocorticoid receptor • Multidrug Resistance-Associated Proteins PART_OF Rats • Hepatocytes PART_OF Rats
1. Introduction (3/3) • Based on two existing systems • SemRep: extract semantic predications from clinical text. • SemGen: developed from SemRep to identify etiologic (病因的) relations between genetic phenomena and diseases. • Relations • Genes, drugs, diseases, and population groups. • At the gene level, no more specific genetic phenomena ( e.g. mutations, single nucleotide polymorphisms, and haplotype information).
2. Background • NLP for Biomedicine • The Unified Medical Language System • SemRep and SemGen
2.1 NLP for Biomedicine (1/2) • Co-occurrence of entities in text (gene-disease relations, Yen et al., 2006; drug-gene, Rindflesch et al., 2000). • Machine learning techniques (gene-disease relations, Chun et al., 2006; drug-gene, Chang et al., 2004). • Syntactic templates and shallow parsing (protein interactions, Blaschke et al., 1999)
2.1 NLP for Biomedicine (2/2) • Enhanced SemRep addresses a wide range of syntactic structures and specific semantic relations pertinent to pharmacogenomics. • Example • STIMULATES • DISRUPTS • CAUSES
2.2 UMLS • Metathesaurus (more than 106 concepts) • Concept: fever; • Synonyms: pyrexia, febrile, hyperthermia; • Semantic Type:‘Finding’ • Semantic types represent allowable relationships between concepts • ‘Gene or Genome’ PART_OF ‘Cell’ • ‘Pharmacologic Substance’ INTERACTS_WITH ‘Enzyme’ • ‘Disease or Syndrome’ CO-OCCURS_WITH ‘Neoplastic Process’ (腫瘤突起)
2.3 SemRep and SemGen (1/2) • SemRep: a rule-based symbolic NLP system. • Example • Phenytoin (二苯妥因) induced gingival hyperplasia (齒齦增生) • [[head(noun(phenytoin)), metaconc(‘Phenytoin’:[orch,phsu]))], [verb(induced)], [head(noun([‘gingival hyperplasia’)), metaconc(‘Gingival Hyperplasia’:[dsyn]))]] • ‘Pharmacological Substance’ CAUSES ‘Disease or Syndrome’ • Phenytoin CAUSES Gingival Hyperplasia Pharmacological Substance Disease or Syndrome Semantic Network relation/ argument identification
2.3 SemRep and SemGen (2/2) • SemGen: identify semantic predications on the genetic etiology of disease. • Gene and protein name: ABGene. • Since UMLS Semantic Network does not cover molecular genetics, semantic relations are created: • Gene-disease interactions: (ASSOCIATE_WITH, PREDISPOSE(易感染的), and CAUSE) • Gene-gene interactions: (INHIBIT, STIMULATE, and INTERACTS_WITH)
3. Methods (1/2) • Scrutiny of the pharmacogenomics literature to identify relevant predications not identified by either SemRep or SemGen. • 1000 Medline were retrieved containing drug and gene names. • 400 sentences were selected, including genetic(gene-disease), genomic (gene-gene), and pharmacogenomic (drug-gene, drug-genome) relations; in addition relations between genes and population groups; disease and population groups; and pharmacological relations (drug-disease, drug-pharmacological effect, drug-drug) were scrutinized.
3. Methods (2/2) • After processing these 400 sentences with SemRep, errors were analyzed and categorized for etiology. • The majority of errors • The Semantic Network • Errors in argument identification due to “empty” heads • Gene name identification • Extensive modifications for Enhanced SemRep. • Gene name identification was addressed by adding ABGene to the machinery.
3.1 Modification of Semantic Network for Enhanced SemRep (1/4) • Grouping semantic types: Five broader semantic groups (Substance, Anatomy, Living Being, Process, and Pathology) were defined to permit predications relevant to pharmacogenomics. • Substance: ‘Amino Acid, Peptide, or Protein’, ‘Antibiotic’(抗生素), ‘Carbohydrate’(碳水化合物), ... • Anatomy: ‘Anatomical Structure’(解剖學構造), ‘Body Part, Organ, or Organ Component’, ‘Cell’, ‘Gene or Genome’, ‘Neoplastic Process’, ‘Tissue’…
3.1 Modification of Semantic Network for Enhanced SemRep (2/4) • Living Being: ‘Animal’, ‘Archaeon’(第三類有機體), ‘Bacterium’, ‘Fungus’(真菌), ‘Human’, ‘Invertebrate’(無脊椎動物), ‘Mammal’, ‘Organism’, ‘Vertebrate’, ‘Virus’ • Process: ‘Acquired Abnormality’(後天異常), ‘Anatomical Abnormality’, ‘Cell Function’, ‘Cell or Molecular Dysfunction’(機能障礙), ‘Congenital Abnormality’(先天性異常), ‘Laboratory Test Result’… • Pathology: ‘Acquired Abnormality’, ‘Anatomical Abnormality’, ‘Cell or Molecular Dysfunction’, ‘Congenital Abnormality’, ‘Disease or Syndrome’, ‘Injury or Poisoning’, Mental or Behavioral Disorder’(心理及行為障礙), …
3.1 Modification of Semantic Network for Enhanced SemRep (3/4) • Define predications: categories 1-6 • 1: Genetic Etiology (基因病理學) • {Substance} ASSOCIATED_WITH OR PREDISPOSES OR CAUSES {Pathology} • 2: Substance Relations • {Substance} INTERACTS_WITH OR INHIBITS OR STIMULATES {Substance} • 3: Pharmacological Effects • {Substance} AFFECTS OR DISRUPTS OR AUGMENTS {Anatomy OR Process}
3.1 Modification of Semantic Network for Enhanced SemRep (4/4) • 4: Clinical Actions • {Substance} ADMINISTERED_TO {Living Being} • {Process} MANIFESTATION_OF {Process} • {Substance} TREATS {Living Being OR Pathology} • 5: Organism Characteristics • {Anatomy OR Living Being} LOCATION_OF {Substance} • {Anatomy} PART_OF {Anatomy OR Living Being} • {Process} PROCESS_OF {Living Being} • 6: Co-existence • {Substance} CO-EXISTS_WITH {Substance} • {Process} CO-EXISTS_WITH {Process}
3.2 Empty Heads • Example: • We saw differential activation of CYP2C9 variants by dapsone(藥:氨苯). • “Variant” is ‘Qualitative Concept’. • We want CYP2C9 variantbe a member of the Substance group. • Enumerate several categories of terms as semantically empty heads, e.g. allele (等位基因), mutation, variant, levels, expression… • Words from these lists that have been labeled as heads are hidden and the word to their left is relabeled as heads.
3.3 Evaluation • Test 300 sentences which are randomly generated from the set of 36,577 sentences containing drug and gene co-occurrences found on the Web-site. (bionlp.stanford.edu/genedrug) • These sentences were annotated by three physicians (CBA, DD-F, MF). • They did not mark up all assertions in the sentences, only those representing a predication defined in Enhanced SemRep. • A total of 850 predications were assigned by the annotators.
5.1 Discussion: Error Analysis (1/2) • Word sense ambiguity (28%) • Ticlopidine (血小板抑制劑) inhibition of phenytoin (二苯妥因) metabolism mediated by potent inhibition of CYP2C19 (基因). • Inhibition wrongly mapped to ‘Psychological Inhibition’. • CYP2C19 AFFECTS Psychological Inhibition.
5.1 Discussion: Error Analysis (2/2) • Process coordinate structures (35%) • The cytotoxic (細胞毒素) activities of mercaptopurine (藥:胇基嘌呤) and fluorouracil (抗腫瘤代謝藥物 ) are regulated by thiopurine methyltransferase (TPMT) and dihydropyrimidine dehydrogenase (DPD), respectively. • Fluorouracil INTERACTS_WITH DPD gene. (○) • mercaptopurine INTERACTS_WITH thiopurine methyltransferase. (X)
5.2 Process Medline Citations on CYP2D6 (1/3) • 2849 Medline citations contain variant forms of CYP2D6. • 5219 predications containing CYP2D6 as an argument were analyzed according totwo predication categories (Genetic Etiology and Substance Relations). • Compare with relations listed for this gene on the PharmGKB Web site (PharmacoGenetic Knowledge Base).
5.2 Process Medline Citations on CYP2D6 (2/3) • Genetic Etiology • 267 total predications represented CYP2D6 as an etiologic agent for a disease. • Parkinson’s disease (帕金森氏症) (35), carcinoma of the lung (肺癌) (21), tardive dyskinesia (遲發性不自主運動) (15), Alzheimer’s disease (阿茲海默症) (9), bladder carcinoma (膀胱癌) (8). • 169 TP, and 4 FP, two were found not to contain the disease name in the referenced citation. • Only carcinoma of the lung occurs in PharmGKB.
5.2 Process Medline Citations on CYP2D6 (3/3) • Substance Relations • 1128 total predications involve CYP2D6 and a drug. • 69 drugs occurred 3 or more times in those predications where 41 drugs were in PharmGKB and 28 were not. • 68 were true positives. • Inhibit CYP2D6: quinidine (45), paroxetine (34), fluoxetine (27), fluvoxamine (8), sertraline (8). • Quinidine and sertraline are not in PharmGKB. • Interact_with CYP2D6: bufuralol (27), antipsychotic agents (25) dextromethorphan (21), venlafaxine (19), debrisoquin (18). • Bufuralol is not in PharmGKB. • SemRep failed to capture: cocaine, levomepromazine, maprotiline, trazodone, and yohimbine.
6. Conclusion • This paper applies an existing NLP system in the pharmacogenomics domain. • The major changes for developing Enhanced SemRep from SemRep involved modifying the semantic space stipulated by the UMLS Semantic Network. • The outputs are semantic predications that represent assertions from Medline citations expressing a range of specific relations in pharmacogenomics. • The information can support advanced information management applications for pharmacogenomics research and clinical care. • In the future, authors intend to adapt the summarization and visualization techniques developed for clinical text.