SENSEVAL2

SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

SENSEVAL • SENSEVAL/SIGLEX98: (Brighton, Sep,98) • Workshop on Word Sense Disambiguation • Hector, corpus-based sense inventory • 34 words, nouns, verbs, adjectives, mixed • Inter-annotator agreement over 90% • English (18 participating systems) • Also Italian (2) and French(5)

Siglex99: All words Experiment • WSJ 5K word corpus • running text • WordNet 1.6 • 2100 words sense tagged twice (10 days) • 89% inter-annotator agreement • 700 verb tokens – 81% agreement (disagreement in 90/350 verb tokens)

SENSEVAL2 • Toulouse, France, July 5,6 (ACL’02) • Samples, mid-DEC • Training data, April • Testing data, May • 13 Languages • Lexical sample and all words • Standardized data and formats, central server • Closer tie to applications

13 Languages • Swedish - lexical sample • Dimitrios Kokkinakis <Dimitrios.Kokkinakis@svenska.gu.se> • Chinese - lexical sample • Chu-Ren Huang churen@sinica.edu.tw • Keh-jiann Chen <kchen@iis.sinica.edu.tw> • Danish - lexical sample • Bolette Pedersen <bolette@cst.ku.dk> • Estonian - all words (in principle) • Haldur Oim <hoim@psych.ut.ee>

13 Languages, cont. • Japanese - lexical sample • Sadao Kurohashi kuro@i.kyoto-u.ac.jp • Bangla - lexical sample • Niladri Sekhar Dash niladri@isical.ac.in • Italian - lexical sample • Nicoletta Calzolari <glottolo@ilc.pi.cnr.it> • English - lexical sample and All words • Adam Kilgarriff Adam.Kilgarriff@itri.brighton.ac.uk • Martha Palmer mpalmer@linc.cis.upenn.edu

13 Languages, cont. • Basque - lexical sample • Eneko Agirre <eneko@si.ehu.es> • Spanish - lexical sample • Mariona Taulé <mtaule@pcb.ub.es> • German Rigau <g.rigau@lsi.upc.es> • Korean - • Key-Sun Choi <kschoi@cs.kaist.ac.kr> • Czech - • Ondrej Cikhart <ondrej.cikhart@schemantix.com> • Dutch - • Antal van den Bosch <Antal.vdnBosch@kub.nl>

Lexical Sample DTD <!ELEMENT corpus (lexset+)> <!ATTLIST corpus lang CDATA #REQUIRED> <!ELEMENT lexset (instance+)> <!ATTLIST lexset item CDATA #REQUIRED> <!ELEMENT instance (answer*,context)> <!ELEMENT answer EMPTY> <!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED> <!ELEMENT context (#PCDATA | itemloc)+> <!ELEMENT itemloc (#PCDATA)

<!DOCTYPE corpus SYSTEM "lexical-sample.dtd"> <corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>

XML version? <!ELEMENT corpus (descr?,rtext+)> <!ATTLIST corpus lang CDATA #REQUIRED> <!ELEMENT descr (#PCDATA)> <!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)> <!ELEMENT tloc (#PCDATA)> <!ATTLIST tloc id ID #REQUIRED> <!ELEMENT answer (lexentry,loc+,sense+)> <!ELEMENT lexentry (#PCDATA)> <!ELEMENT loc EMPTY> <!ATTLIST loc ids IDREFS #REQUIRED> <!ELEMENT sense EMPTY> <!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>

<!DOCTYPE corpus SYSTEM "all-words.dtd"> <corpus lang="en"> <rtext> <descr> taken from the man page for intro of section 3 of from a FreeBSD 4.0 system. </descr> <text>

Words in text are tagged: This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>. Most of these <tloc id="w10">functions</tloc><tloc id="w11">are</tloc> <tloc id="w12">available</tloc>from the C <tloc id="w13">library</tloc>, libc. Other <tloc id="w14">libraries</tloc>,

Then, for each tag: </text> <answer> <lexentry>section</lexentry> <loc ids="w0"/> <sense id="1"/> </answer> </rtext> </corpus>

SENSEVAL2

SENSEVAL2

Presentation Transcript