180 likes | 315 Views
A resource and tool for Super Sense Tagging of Italian Texts. LREC 2010, Malta – 19-21/05/2010. Summary. Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work. Semantic tagging.
E N D
A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010
Summary Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work
Semantic tagging • Named Entity Recognition (NER) • Simple ontologies: person, organization, location … • Limited semantic/syntactic coverage • High accuracy • Word Sense Disambiguation • Identifying WordNet senses • tens of thousands of specific “word senses” • all open class words covered, domain-independent • inadeguate performance
Super Senses • Super Senses • Introduced by Ciaramita and Altun (2006) • WordNet super senses • Noun and verb synsets mapped to 41 general semantic classes (lexicographic categories) • 26 noun categories; 15 verb categories Example: “Clara Harrisperson, one of the guestspersonin the boxartifact, stood upmotionand demandedcommunicationwatersubstance”
Super Sense Tagging • For English (Ciaramita and Altun, 2006) • training on SemCor (Senseval-3) • discriminative HMM, trained with an average perceptron algorithm • average F-Score on 41 categories: 77.18 • For Italian (Picca, Gliozzo, Ciaramita, 2008) • trained on MultiSemCor (Bentivoglio et al.) • average F-Score on 41 categories: 62,90
Improving MultiSemCor • Problems • Smaller size (64% of English corpus) • Incomplete alignment (sense in Eng., no sense in Ita.) • PoS coarseness • Word by word translation • Stategy • Retagging, adding morphology • Results • average F-Score: 64,95 (same algorithm; 45 categories)
Further work • Our requirements • Integration of a SST tagger in the TANL pipeline • Useful model for annotating realistic Italian texts • Two directions for improvement • A brand new resource for SST • A new algorithm for SST, based on Maximum Entropy
ILI IWN WordNet Supersense <Lemma, sense> ISST Corpus Building the new resource • ISST - Italian Syntactic-Semantic Treebank • 305,547 tokens • 81,236 content words annotated at the lexico-semantic level, including IWN senses • ILI* mapping from IWN to WN senses * Inter Linguistic Index
ItalWordNet Synset ILI n#24931n #08770969 Synset ILI n#12484 #04692559 WordNet Hyperonym Supersense Synset ILI n#16564 – ISST Corpus Token <Lemma, Sense> L’ <lo, 0> atmosfera <atmosfera, 4> di <di, 0> festa <festa, 1> From Italian senses to English super senses Starting from sense Si: • If Si is in ILI, return che corresponding Se • If not, look for the first hyperonym in the ILI and return the corresponding Se • In both cases return the super sense of Se in WN
Revision • Mapping of adverbs in adv.all (~ 10,000) • Listing of possible super senses • Alternative for nouns: 2-6 • Alternative for verbs: 3-10 • An ad-hoc tool for revision • Difficulties • Aspectual verbs: “continuare a …”, “stare per …” • Support verbs: “prestare attenzione a …”, “ dar una mano …”
Super Sense Tagger • Adapting a generic chunker, part of the Tanl pipeline • Maximum Entropy classifier • Effective for chunking since it does not assume independence of features • Dynamic programming to select sequences of tags with higher probability • The tagger is flexibile and customizable for different tasks • specialization of class FeatureExtractor
Features • No external resources, no first sense heuristics • Local features • Token Attribute features POSTAG -2 -1 0 1 2 CPOSTAG -1 0 • Form features FORM ^\p{Lu} -1 +1 FORM ^\p{Lu}*$ 0 • Global features • Whether a word in the document was previously annotated with a given tag
Results for Italian • Improvement for Italian due to: • new corpus • the different algorithm and the tuning of features
Analysis of improvement • Improvement due to new corpus • MultiSemCor vs ISST-SST, ME tagger • about +4.5 on the F1 score • Improvement due to new algorithm and features • Ciaramita-Altun tagger vs ME tagger, on ISST-SST • about +10 on the F1 score
Conclusions • Significant improvement in accuracy for SS tagging • The tagger has been used to annotate the Italian Wikipedia • Examples of queries made possible on the semantic index • Who proves emotions? (the subj of a verb.emotion) • What did Edison invent/create/discover …? (Edison as the subject of a verb.creation) • Completion of the ISST-ISST resource can further improve accuracy