1 / 13

SENSEVAL2

SENSEVAL2. Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania. SENSEVAL. SENSEVAL/SIGLEX98: (Brighton, Sep,98) Workshop on Word Sense Disambiguation Hector, corpus-based sense inventory 34 words, nouns, verbs, adjectives, mixed

patsya
Download Presentation

SENSEVAL2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania

  2. SENSEVAL • SENSEVAL/SIGLEX98: (Brighton, Sep,98) • Workshop on Word Sense Disambiguation • Hector, corpus-based sense inventory • 34 words, nouns, verbs, adjectives, mixed • Inter-annotator agreement over 90% • English (18 participating systems) • Also Italian (2) and French(5)

  3. Siglex99: All words Experiment • WSJ 5K word corpus • running text • WordNet 1.6 • 2100 words sense tagged twice (10 days) • 89% inter-annotator agreement • 700 verb tokens – 81% agreement (disagreement in 90/350 verb tokens)

  4. SENSEVAL2 • Toulouse, France, July 5,6 (ACL’02) • Samples, mid-DEC • Training data, April • Testing data, May • 13 Languages • Lexical sample and all words • Standardized data and formats, central server • Closer tie to applications

  5. 13 Languages • Swedish - lexical sample • Dimitrios Kokkinakis <Dimitrios.Kokkinakis@svenska.gu.se> • Chinese - lexical sample • Chu-Ren Huang churen@sinica.edu.tw • Keh-jiann Chen <kchen@iis.sinica.edu.tw> • Danish - lexical sample • Bolette Pedersen <bolette@cst.ku.dk> • Estonian - all words (in principle) • Haldur Oim <hoim@psych.ut.ee>

  6. 13 Languages, cont. • Japanese - lexical sample • Sadao Kurohashi kuro@i.kyoto-u.ac.jp • Bangla - lexical sample • Niladri Sekhar Dash niladri@isical.ac.in • Italian - lexical sample • Nicoletta Calzolari <glottolo@ilc.pi.cnr.it> • English - lexical sample and All words • Adam Kilgarriff Adam.Kilgarriff@itri.brighton.ac.uk • Martha Palmer mpalmer@linc.cis.upenn.edu

  7. 13 Languages, cont. • Basque - lexical sample • Eneko Agirre <eneko@si.ehu.es> • Spanish - lexical sample • Mariona Taulé <mtaule@pcb.ub.es> • German Rigau <g.rigau@lsi.upc.es> • Korean - • Key-Sun Choi <kschoi@cs.kaist.ac.kr> • Czech - • Ondrej Cikhart <ondrej.cikhart@schemantix.com> • Dutch - • Antal van den Bosch <Antal.vdnBosch@kub.nl>

  8. Lexical Sample DTD <!ELEMENT corpus (lexset+)> <!ATTLIST corpus lang CDATA #REQUIRED> <!ELEMENT lexset (instance+)> <!ATTLIST lexset item CDATA #REQUIRED> <!ELEMENT instance (answer*,context)> <!ELEMENT answer EMPTY> <!ATTLIST answer senseid CDATA #REQUIRED weight CDATA #REQUIRED> <!ELEMENT context (#PCDATA | itemloc)+> <!ELEMENT itemloc (#PCDATA)

  9. <!DOCTYPE corpus SYSTEM "lexical-sample.dtd"> <corpus lang="en"> <lexset item="banana"> <instance> <answer senseid="0" weight="0.3"/> <context>The monkeys ravenously devoured the <itemloc>bananas</itemloc> after the famine. </context> </instance> </lexset>

  10. XML version? <!ELEMENT corpus (descr?,rtext+)> <!ATTLIST corpus lang CDATA #REQUIRED> <!ELEMENT descr (#PCDATA)> <!ELEMENT rtext (descr?, (tloc | #PCDATA)+, answer*)> <!ELEMENT tloc (#PCDATA)> <!ATTLIST tloc id ID #REQUIRED> <!ELEMENT answer (lexentry,loc+,sense+)> <!ELEMENT lexentry (#PCDATA)> <!ELEMENT loc EMPTY> <!ATTLIST loc ids IDREFS #REQUIRED> <!ELEMENT sense EMPTY> <!ATTLIST sense senseid CDATA #REQUIRED weight CDATA #IMPLIED>

  11. <!DOCTYPE corpus SYSTEM "all-words.dtd"> <corpus lang="en"> <rtext> <descr> taken from the man page for intro of section 3 of from a FreeBSD 4.0 system. </descr> <text>

  12. Words in text are tagged: This <tloc id="w0">section</tloc> <tloc id="w1">provides</tloc> an <tloc id="w2">overview</tloc> of the C <tloc id="w3">library</tloc> <tloc id="w4">functions</tloc>, their <tloc id="w5">error</tloc> <tloc id="w6">returns</tloc> and other <tloc id="w7">common</tloc> <tloc id="w8">definitions</tloc> and <tloc id="w9">concepts</tloc>. Most of these <tloc id="w10">functions</tloc><tloc id="w11">are</tloc> <tloc id="w12">available</tloc>from the C <tloc id="w13">library</tloc>, libc. Other <tloc id="w14">libraries</tloc>,

  13. Then, for each tag: </text> <answer> <lexentry>section</lexentry> <loc ids="w0"/> <sense id="1"/> </answer> </rtext> </corpus>

More Related