230 likes | 364 Views
A Robust Shallow Parser for Swedish. Ola Knutsson, Johnny Bigert, Viggo Kann Royal Institute of Technology, Sweden. Introduction. What is robustness? Robust against noisy, ill-formed and partial natural language data. Shallow parsing. Many NLP-applications do not need full parsing
E N D
A Robust Shallow Parser for Swedish Ola Knutsson, Johnny Bigert, Viggo Kann Royal Institute of Technology, Sweden
Introduction What is robustness? Robust against noisy, ill-formed and partial natural language data
Shallow parsing Many NLP-applications do not need full parsing Shallow parsing: A parsing approach Pre-processing for full parsing A collection of techniques Abney - finite state cascades (1991) Currently, a lot of attention on ML Well suitable for modularization
Chunking and phrase identification Common modules in a shallow parser: Tokenizer PoS-tagger Chunker Phrase identifier Grammatical function identifier
Chunking [NP Den mycket gamla mannen][VC gillade][NP mat] Phrase identification [NP Den [AP mycket gamla] mannen][VC gillade][NP mat]
Parsers for Swedish Full parser: UCP (Sågvall Hein) and SLE (Gambäck) Shallow parsers (phrase structure): Cass-Swe (Kokkinakis) and Megyesi using machine learning Dependency: CG (Birn) and FDG (Voutilainen)
Granska Text Analyzer (GTA) Hand-crafted rules Context-free backbone Partly object-oriented notation
Major Phrase Categories NP: Han såg den lilla mannen på bänken VC: Han har spelat kort hela natten PP: Han såg spår i sanden AP: Han ogillade små vita lögner ADVP: Han vill inte gå på bio. INFP: Han tycker om att spela
Clause Boundary Identification Based on Ejerhed’s algorithm Context-sensitive rules Using only PoS information
Different kinds of rules GTA contains 260 rules 200 identify phrase structure 20 clause boundary identification 40 selection rules (disambiguation)
Example rule, [NP den lilla bilen] NPmin@ { X(wordcl=dt| wordcl=hd | wordcl=rg), X2(wordcl=ab | wordcl=rg)?, Y(wordcl=jj | wordcl=ro | wordcl=pc)*, Z(wordcl=nn) --> action(help, wordcl:=Z.wordcl, pnf:= undef, gender:=Z.gender, num:=Z.num, spec:=Z.spec, case:=Z.case)
Clause boundary rule cl@ { V(sed!=sen & text!="som" & wordcl!=sn), X((wordcl=pn & pnf=sub)| (wordcl=pm & case=nom) | (wordcl=nn & case=nom & V.case!=gen) | wordcl=ab), ---endleftcontext---, Y(wordcl=kn), ---beginrightcontext---, Y2(((wordcl=pn & pnf=sub) | (wordcl=pm & case=nom) | (wordcl=nn & case=nom) | wordcl=ab) & wordcl=X.wordcl), Z(wordcl=vb & (vbf=prs | vbf=prt | vbf=imp)) --> action(help, wordcl:=Y.wordcl) }
The Tetris Algorithm PP till general PP till general Claes NP general Claes Olsson NP Fänrik Ax VC gav NP boken PP till general Claes Olsson
The IOB format Marcus and Ramshaw 1995 A phrase/clause tag contains two parts: • Phrase/Clause type, e.g. NP, PP • One of two tags: I = Inside a phrase/clause B = Beginning a phrase/clause When a word does not belong to a phrase 3. O = Outside
Disagreement error De dt.utr/neu.plu.def NPB CLB gamla jj.pos.utr/neu.plu.ind/def.nom APB|NPI CLI äppelträdet nn.neu.sin.def.nom NPI CLI kan vb.prs.akt.mod VCB CLI bli vb.inf.akt.kop VCI CLI som kn O CLI nya jj.pos.utr/neu.plu.ind/def.nom APB CLI . mad O CLI
Partial input Arrangör nn.utr.sin.ind.nom NPB CLB var vb.prt.akt.kop VCB CLI Järfälla pm.gen NPB|NPB CLI naturskyddsförening nn.utr.sin.ind.nom NPB|NPI CLI där ab ADVPB CLI är vb.prs.akt.kop VCB CLI medlem nn.utr.sin.ind.nom NPB CLI . mad O CLI
Noisy data Inte ab APB CLB så ab ADVPB|APB|API CLI tjck jj.pos.utr.sin.ind.nom APB|API|API CLI som ha O CLB det pn.neu.sin.def.sub/obj NPB CLI ofta ab.pos ADVPB CLI står vb.prs.akt VCB CLI i pp PPB CLI lärobökerna nn.utr.plu.def.nom NPB|PPI CLI ; mid 0 CLI
Word order violation Ympkvisten nn.utr.sin.def.nom NPB CLB inte ab ADVPB CLI ska vb.prs.akt.mod VCB CLI vara vb.inf.akt.kop VCI CLI sådär ab ADVPB|APB CLI lång jj.pos.utr.sin.ind.nom APB CLI , mid O CLI
Evaluation Manually corrected output from GTA Untuned GTA in the evaluation 15 000 words from SUC 5 genres
F-score for clause boundary identification F-score for a baseline identifier was 69.0%
Aplications with GTA We are using GTA in: Grammar checking, statistical and rule based Clustering of medical texts CALL-systems What do you want to do with GTA?
More information www.nada.kth.se/theory/projects/xcheck Contact: Ola Knutsson knutsson@nada.kth.se