230 likes | 324 Views
Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora. Alessandro Lenci (Università di Pisa, Italy) Barbara McGillivray ( ILC-CNR / Università di Pisa, Italy) Simonetta Montemagni ( ILC-CNR, Italy) Vito Pirrelli ( ILC-CNR, Italy). Outline.
E N D
Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora Alessandro Lenci (Università di Pisa, Italy) Barbara McGillivray (ILC-CNR / Università di Pisa, Italy) Simonetta Montemagni (ILC-CNR, Italy) Vito Pirrelli (ILC-CNR, Italy)
Outline 1. Subcategorization acquisition 2. MDL verb clustering
1.Subcategorization acquisition: summary Previous work Our acquisition process Evaluation of results
Previous work (1) Brent, 1991; Ushioda et al., 1993; Briscoe & Carroll, 1997; Korhonen, 2002 These approaches presuppose a battery of predefined frames there are languages for which no such SCF repertoires are already available
Previous work(2) alternative: acquisition process as a “SCF discovery” process in corpora Basili et al., 1997; Zeman & Sarkar, 2000; Alonso et al., 2007; Bourigault & Frérot, 2005 we present a variation of this “discovery approach” to SC acquisition for Italian verbs
Our SC extraction method simply requires a “chunked” corpus and a limited number of search heuristics that do not rely on any previous knowledge about SCFs languages other than English a looser notion of SCF including typical verb modifiers and strongly selected arguments
The acquisition process 0. experimental setting chunked PAROLE Corpus Italian general corpus 3 million word tokens chunked with CHUG-IT 47 communication verbs
The acquisition process (step 1) extraction of verb local contexts (SLCs) from chunked texts Ex.: [N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico] ‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’
The acquisition process (step 2) Context carving: linguistically-motivated criteria select only those chunks that are in the dependency scope of v noise information is minimized Ex.: [N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico] ‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’
The acquisition process (step 3) induction of potential subcategorization frames (PSF) assumption: all contextual chunks occurring immediately after the verb are very likely governed by it potentially subcategorized slots (PSS) Frequency filter on PSSs a SLC is eligible as a PSF if its contextual chunks belong to the list of selected PSS Frequency filter on PSFs
The acquisition process (step 3) Verb accettare ’accept’
Evaluation of results - Italian Evaluation of our SCF induction method extracted carved contexts: baseline (step 2) induced subcat frames (step 4) type precision type recall F-measure
Evaluation - Italian (2) carried out against three gold standards IGS1: a general purpose computational lexicon (SIMPLE-PAROLE-CLIPS lexicon) IGS2: Italian dictionary (Sabatini-Coletti 2006) IGS3: merging IGS1 and IGS2 Manual evaluation
Evaluation - English four gold standards EGS1: general purpose computational lexicon (Valex5 Lexicon) EGS2: Longman Dictionary (2006); EGS3: biomedical English lexicon (SPECIALIST Lexicon) EGS4: merging EGS1, EGS2 and EGS3
2. Verb clustering: summary The MDL Principle Verb clustering using MDL
Why verb clustering? syntax-semantics lexical interface starting from the SCFs extracted, we aim at inducing clusters of verbs that share similar semantic properties each verb is represented as a vector whose dimensions report its statistical distribution with the automatically extracted SCFs a clustering of verb vectors is performed using the Minimum Description Length Principle (MDL)
The MDL Principle from information theory (Rissanen 1989) model description length: code length in bits for the encoding of the model itself complexity of the model data description length: code length in bits for the encoding of the given data observed through the model fit of the model to the data MDL: “any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally”
Verb clustering using MDL Baseline model: each verb belongs to one class Compare with any model Choose such that Cluster together into the class
MDL -clustering • 47 Italian communication verbs: 23 clustering steps PROMETTERE RISPONDERE PARLARE PROTESTARE CHIEDERE DIRE ASSERIRE MINACCIARE COMANDARE INSEGNARE AMMONIRE DICHIARARE CONFESSARE CHIARIRE PROIBIRE SUGGERIRE COMUNICARE ACCETTARE PROPORRE MOSTRARE COMMENTARE CHIAMARE PREGARE DISCUTERE RIVELARE RICHIAMARE RIMPROVERARE LEGGERE SPIEGARE REPLICARE DESCRIVERE RICHIEDERE DENUNCIARE OFFRIRE RIMPIANGERE ORDINARE
Conclusions a preliminary qualitative analysis of induced verb clusters shows encouraging results we expect to evaluate the coherence of the obtained lexico-semantic clusters and the coverage of the subcategorization behaviour of clustered verbs
MDL -clustering The verb classes are assigned a new cluster-based frame distribution