650 likes | 824 Views
TM and NLP for Biology Research Issues in HPSG Parsing. Junichi TSUJII. Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN. School of Computer Science National Centre for Text Mining University of Manchester, UK. 600,000. 14,000,000.
E N D
TM and NLPfor BiologyResearch Issues in HPSG Parsing Junichi TSUJII Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN School of Computer Science National Centre for Text Mining University of Manchester, UK
600,000 14,000,000 12,000,000 Increments 500,000 :accumulation MEDLINE alone More than 0.5million per year More than 1.3 thousand per day Articles added 10,000,000 400,000 accumulation 8,000,000 increments 300,000 6,000,000 200,000 Medline Access 1997: 0.163 M accesses/month 2006: 82.027 M accesses/month 4,000,000 100,000 2,000,000 0 0 年 G-protein coupled receptor [D.L.Banville 2006] 2005 14,000 papers Before 1988 9 papers 1992 256 papers 500 times more 1964 1966 1968 1970 1972 1974 1976 1978 1982 1984 1986 1990 1992 1994 1996 1998 2000 2002 1980 1988 Increase in Medline
NaCTeMwww.nactem.ac.uk • First such centre in the world • Funding: JISC, BBSRC, EPSRC • Consortium investment • Chair in TM (Prof. J. Tsujii, Univ. Tokyo) • Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust • Initial focus: biomedical academic community • Extend services to industry • Extend focus to other domains (social sciences)
Consortium • Universities of Manchester, Liverpool • Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) • Self-funded partners • San Diego Supercomputing Center • University of California, Berkeley • University of Geneva • University of Tokyo • Strong industrial & academic support • IBM, AZ, EBI, Wellcome Trust, Sanger Institute, Unilever, NowGEN, MerseyBio, …
NLP and TM Linking text with knowledge Natural Language Processing Language as a complex system linking surface strings of characters with their meanings Text and words as structured objects Text Mining Text as a bag of words Words as surface strings NLP-based TM
From surface diversities and ambiguities to conceptual invariants Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language
[A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. Retrieval using Regional Algebra [sentence] > ([arg1_activate] > [protein]) Non-trivial Mapping Same relations with different Structures Language Domain Knowledge Domain Independently motivated of Language
Predicate-argument structureParser based on Probabilistic HPSG (Enju) S VP VP VP S VP arg3 arg1 NP ADVP NP arg2 arg2 p53 has been shown to directly activate the Bcl-2 protein
Passive Passive and Infinitival Clause 述語/項構造確率HPSG解析器 (Enju)の出力 s Semantic Retrieval System Using Deep Syntax MEDIE vp vp np pp arg2 arg1 mod dt np vp vp pp np DT NN VBZ VBN IN PRP The protein is activated by it
Demos • MEDIE • Info-PubMed
Predicate-argument structureParser based on Probabilistic HPSG (Enju) S VP VP VP S VP arg3 arg1 NP ADVP NP arg2 arg2 p53 has been shown to directly activate the Bcl-2 protein
Scalability of TM Tools Target Corpus: MEDLINE corpus Suppose, for example, that it takes one second for parsing one sentence…. 70 million seconds, that is, about 2 years
TM and GRID • Solution • The entire MEDLINE were parsed by distributed PC clusters consisting of 340 CPUs • Parallel processing was managed by grid platform GXP [Taura2004] • Experiments • The entire MEDLINE was parsed in 8 days • Output • Syntactic parse trees and predicate argument structures in XML format • The data sizes of compressed/uncompressed output were 42.5GB/260GB.
Background: HPSG • Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994] • Lexicalized and Constraints-based Grammar • A few Rule Schema General constraints on linguistic constructions • Constraints embedded in Lexicon Word-Specific Constraints • Constraints between phrase structures and semantic structures
Parsing by HPSG I like it
<NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ COMPS <NP> I it like Parsing by HPSG Assignment of Lexical Entries
HEAD verbSUBJ COMPS < > < > 1 < > 2 HEAD verbSUBJ COMPS HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > < > 1 it like I Application of Rule Schema Head-Complement 2
HEAD verbSUBJ COMPS < > < > 2 HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ COMPS HEAD verbSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > < < > > 1 1 it like I Application of Rule Schema Subject-Head 2 1
Inefficiency of HPSG Parsing • Complex DAG:Typed-feature structures • Abstract machine for Unification (LiLFeS) • Unification: Expensive Operation(⇔CFG Approximation: CFG Filtering) • Assignment of Lexical Entries • High reduction of search space / Super tagging
Filtering with CFG (1/5) • 2-phased parsing • Approximate HPSG with CFG with keeping important constraints. • Obtained CFG might over-generate, but can be used in filtering. • Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Feature Structures HPSG + Compile CFG Input Sentences Built-in CFG Parser LiLFeS Unification Parsing Output Complete parse trees
HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD verbSUBJ <NP>COMPS <NP> HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > ... it it it I I I like like like I like it like System Overview Input sentence I like it CFG Filtering Supertagger Deterministic Shift/Reduce Parser HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> P High it I
Experiment Results 6 times faster 20 times faster than the initial model
Adaptation with Reference Distribution Lexical Assignment Syntactic Preference Feature function Feature weight Original model
90 89 88 score 87 - Baseline (PTB) F 86 Simple Retraining (GENIA) Retraining (GENIA+PTB) 85 Structure with Ref.Dist Lexical with RefDist 84 Lexical/Structure woth RefDist 83 0 2000 4000 6000 8000 Number of Sentence of the GENIA Training Set
Retrinaing (GENIA) 90 89 88 score 87 - F 86 Structure with RefDist 85 Lexicon woth RefDist Lex/Str with RefDist 84 83 0 10000 20000 30000 Training Time (Sec)
Tool1: POS Tagger • General-Purpose POS taggers, trained by WSJ • Brill’s tagger, TnT tagger, MX POST, etc. • 97% • General-Purpose POS taggers do not work well for MEDLINE abstracts The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS