1 / 18

A resource and tool for Super Sense Tagging of Italian Texts

A resource and tool for Super Sense Tagging of Italian Texts. LREC 2010, Malta – 19-21/05/2010. Summary. Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work. Semantic tagging.

zita
Download Presentation

A resource and tool for Super Sense Tagging of Italian Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010

  2. Summary Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work

  3. Semantic tagging • Named Entity Recognition (NER) • Simple ontologies: person, organization, location … • Limited semantic/syntactic coverage • High accuracy • Word Sense Disambiguation • Identifying WordNet senses • tens of thousands of specific “word senses” • all open class words covered, domain-independent • inadeguate performance

  4. Super Senses • Super Senses • Introduced by Ciaramita and Altun (2006) • WordNet super senses • Noun and verb synsets mapped to 41 general semantic classes (lexicographic categories) • 26 noun categories; 15 verb categories Example: “Clara Harrisperson, one of the guestspersonin the boxartifact, stood upmotionand demandedcommunicationwatersubstance”

  5. Super Sense Tagging • For English (Ciaramita and Altun, 2006) • training on SemCor (Senseval-3) • discriminative HMM, trained with an average perceptron algorithm • average F-Score on 41 categories: 77.18 • For Italian (Picca, Gliozzo, Ciaramita, 2008) • trained on MultiSemCor (Bentivoglio et al.) • average F-Score on 41 categories: 62,90

  6. Improving MultiSemCor • Problems • Smaller size (64% of English corpus) • Incomplete alignment (sense in Eng., no sense in Ita.) • PoS coarseness • Word by word translation • Stategy • Retagging, adding morphology • Results • average F-Score: 64,95 (same algorithm; 45 categories)

  7. Further work • Our requirements • Integration of a SST tagger in the TANL pipeline • Useful model for annotating realistic Italian texts • Two directions for improvement • A brand new resource for SST • A new algorithm for SST, based on Maximum Entropy

  8. ILI IWN WordNet Supersense <Lemma, sense> ISST Corpus Building the new resource • ISST - Italian Syntactic-Semantic Treebank • 305,547 tokens • 81,236 content words annotated at the lexico-semantic level, including IWN senses • ILI* mapping from IWN to WN senses * Inter Linguistic Index

  9. ItalWordNet Synset ILI n#24931n #08770969 Synset ILI n#12484 #04692559 WordNet Hyperonym Supersense Synset ILI n#16564 – ISST Corpus Token <Lemma, Sense> L’ <lo, 0> atmosfera <atmosfera, 4> di <di, 0> festa <festa, 1> From Italian senses to English super senses Starting from sense Si: • If Si is in ILI, return che corresponding Se • If not, look for the first hyperonym in the ILI and return the corresponding Se • In both cases return the super sense of Se in WN

  10. ISST-SST after mapping

  11. Revision • Mapping of adverbs in adv.all (~ 10,000) • Listing of possible super senses • Alternative for nouns: 2-6 • Alternative for verbs: 3-10 • An ad-hoc tool for revision • Difficulties • Aspectual verbs: “continuare a …”, “stare per …” • Support verbs: “prestare attenzione a …”, “ dar una mano …”

  12. ISST-SST after revision

  13. Super Sense Tagger • Adapting a generic chunker, part of the Tanl pipeline • Maximum Entropy classifier • Effective for chunking since it does not assume independence of features • Dynamic programming to select sequences of tags with higher probability • The tagger is flexibile and customizable for different tasks • specialization of class FeatureExtractor

  14. Features • No external resources, no first sense heuristics • Local features • Token Attribute features POSTAG -2 -1 0 1 2 CPOSTAG -1 0 • Form features FORM ^\p{Lu} -1 +1 FORM ^\p{Lu}*$ 0 • Global features • Whether a word in the document was previously annotated with a given tag

  15. Detailed results

  16. Results for Italian • Improvement for Italian due to: • new corpus • the different algorithm and the tuning of features

  17. Analysis of improvement • Improvement due to new corpus • MultiSemCor vs ISST-SST, ME tagger • about +4.5 on the F1 score • Improvement due to new algorithm and features • Ciaramita-Altun tagger vs ME tagger, on ISST-SST • about +10 on the F1 score

  18. Conclusions • Significant improvement in accuracy for SS tagging • The tagger has been used to annotate the Italian Wikipedia • Examples of queries made possible on the semantic index • Who proves emotions? (the subj of a verb.emotion) • What did Edison invent/create/discover …? (Edison as the subject of a verb.creation) • Completion of the ISST-ISST resource can further improve accuracy

More Related