TM and NLP for Biology Research Issues in HPSG Parsing

TM and NLPfor BiologyResearch Issues in HPSG Parsing Junichi TSUJII Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN School of Computer Science National Centre for Text Mining University of Manchester, UK

600,000 14,000,000 12,000,000 Increments 500,000 ：accumulation MEDLINE alone More than 0.5million per year More than 1.3 thousand per day Articles added 10,000,000 400,000 accumulation 8,000,000 increments 300,000 6,000,000 200,000 Medline Access 1997: 0.163 M accesses/month 2006: 82.027 M accesses/month 4,000,000 100,000 2,000,000 0 0 年 G-protein coupled receptor [D.L.Banville 2006] 2005 14,000 papers Before 1988 9 papers 1992 256 papers 500 times more 1964 1966 1968 1970 1972 1974 1976 1978 1982 1984 1986 1990 1992 1994 1996 1998 2000 2002 1980 1988 Increase in Medline

NaCTeMwww.nactem.ac.uk • First such centre in the world • Funding: JISC, BBSRC, EPSRC • Consortium investment • Chair in TM (Prof. J. Tsujii, Univ. Tokyo) • Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust • Initial focus: biomedical academic community • Extend services to industry • Extend focus to other domains (social sciences)

Consortium • Universities of Manchester, Liverpool • Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) • Self-funded partners • San Diego Supercomputing Center • University of California, Berkeley • University of Geneva • University of Tokyo • Strong industrial & academic support • IBM, AZ, EBI, Wellcome Trust, Sanger Institute, Unilever, NowGEN, MerseyBio, …

NLP and TM Linking text with knowledge Natural Language Processing Language as a complex system linking surface strings of characters with their meanings Text and words as structured objects Text Mining Text as a bag of words Words as surface strings NLP-based TM

From surface diversities and ambiguities to conceptual invariants Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language

Example

[A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. Retrieval using Regional Algebra [sentence] > ([arg1_activate] > [protein]) Non-trivial Mapping Same relations with different Structures Language Domain Knowledge Domain Independently motivated of Language

Predicate-argument structureParser based on Probabilistic HPSG (Enju) S VP VP VP S VP arg3 arg1 NP ADVP NP arg2 arg2 p53 has been shown to directly activate the Bcl-2 protein

Passive Passive and Infinitival Clause 述語/項構造確率ＨＰＳＧ解析器 (Enju)の出力 s Semantic Retrieval System Using Deep Syntax MEDIE vp vp np pp arg2 arg1 mod dt np vp vp pp np DT NN VBZ VBN IN PRP The protein is activated by it

Demos • MEDIE • Info-PubMed

Predicate-argument structureParser based on Probabilistic HPSG (Enju) S VP VP VP S VP arg3 arg1 NP ADVP NP arg2 arg2 p53 has been shown to directly activate the Bcl-2 protein

Performance of Semantic Parser

Scalability of TM Tools Target Corpus: MEDLINE corpus Suppose, for example, that it takes one second for parsing one sentence…. 70 million seconds, that is, about 2 years

TM and GRID • Solution • The entire MEDLINE were parsed by distributed PC clusters consisting of 340 CPUs • Parallel processing was managed by grid platform GXP [Taura2004] • Experiments • The entire MEDLINE was parsed in 8 days • Output • Syntactic parse trees and predicate argument structures in XML format • The data sizes of compressed/uncompressed output were 42.5GB/260GB.

Efficient Parsing for HPSG

Background: HPSG • Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994] • Lexicalized and Constraints-based Grammar • A few Rule Schema General constraints on linguistic constructions • Constraints embedded in Lexicon Word-Specific Constraints • Constraints between phrase structures and semantic structures

Parsing by HPSG I like it

<NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ COMPS <NP> I it like Parsing by HPSG Assignment of Lexical Entries

HEAD verbSUBJ COMPS < > < > 1 < > 2 HEAD verbSUBJ COMPS HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > < > 1 it like I Application of Rule Schema Head-Complement 2

HEAD verbSUBJ COMPS < > < > 2 HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ COMPS HEAD verbSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > < < > > 1 1 it like I Application of Rule Schema Subject-Head 2 1

Inefficiency of HPSG Parsing • Complex DAG：Typed-feature structures • Abstract machine for Unification (LiLFeS) • Unification: Expensive Operation（⇔CFG Approximation: CFG Filtering） • Assignment of Lexical Entries • High reduction of search space / Super tagging

Filtering with CFG (1/5) • 2-phased parsing • Approximate HPSG with CFG with keeping important constraints. • Obtained CFG might over-generate, but can be used in filtering. • Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Feature Structures HPSG + Compile CFG Input Sentences Built-in CFG Parser LiLFeS Unification Parsing Output Complete parse trees

HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD verbSUBJ <NP>COMPS <NP> HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > ... it it it I I I like like like I like it like System Overview Input sentence I like it CFG Filtering Supertagger Deterministic Shift/Reduce Parser HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> P High it I

Experiment Results 6 times faster 20 times faster than the initial model

Domain/Text Type Adaptation

Adaptation with Reference Distribution Lexical Assignment Syntactic Preference Feature function Feature weight Original model

90 89 88 score 87 - Baseline (PTB) F 86 Simple Retraining （GENIA) Retraining (GENIA+PTB) 85 Structure with Ref.Dist Lexical with RefDist 84 Lexical/Structure woth RefDist 83 0 2000 4000 6000 8000 Number of Sentence of the GENIA Training Set

Retrinaing (GENIA) 90 89 88 score 87 - F 86 Structure with RefDist 85 Lexicon woth RefDist Lex/Str with RefDist 84 83 0 10000 20000 30000 Training Time （Sec）

Tool1: POS Tagger • General-Purpose POS taggers, trained by WSJ • Brill’s tagger, TnT tagger, MX POST, etc. • 97% • General-Purpose POS taggers do not work well for MEDLINE abstracts The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

TM and NLP for Biology Research Issues in HPSG Parsing

TM and NLP for Biology Research Issues in HPSG Parsing

Presentation Transcript

Issues in Managing and Disseminating Changing Information in Biology

Social and Ethical Issues in Biology

Social and Ethical Issues in Biology

Issues in Equipment and Research Projects

Lexicon and Lexical Rules in HPSG

Just-in-time Subgrammar Extraction for HPSG

Challenges and issues in Measurement research

Data and modeling issues in population biology

SC issues for TM 110 cavities

Research Issues in Verification and Validation

Issues in Computational Linguistics: Parsing and Generation

Modular HPSG

NLP Tools for Biology Literature Mining

ML for NLP With Special Focus on Tagging and Parsing

Unit 9: Issues in Biology

Grammar Engineering: Parsing with HPSG Grammars

Advanced NLP: Speech Research and Technologies

Issues in Computational Linguistics: Parsing and Generation

Lexicon and Lexical Rules in HPSG