A resource and tool for Super Sense Tagging of Italian Texts

A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010

Summary Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work

Semantic tagging • Named Entity Recognition (NER) • Simple ontologies: person, organization, location … • Limited semantic/syntactic coverage • High accuracy • Word Sense Disambiguation • Identifying WordNet senses • tens of thousands of specific “word senses” • all open class words covered, domain-independent • inadeguate performance

Super Senses • Super Senses • Introduced by Ciaramita and Altun (2006) • WordNet super senses • Noun and verb synsets mapped to 41 general semantic classes (lexicographic categories) • 26 noun categories; 15 verb categories Example: “Clara Harrisperson, one of the guestspersonin the boxartifact, stood upmotionand demandedcommunicationwatersubstance”

Super Sense Tagging • For English (Ciaramita and Altun, 2006) • training on SemCor (Senseval-3) • discriminative HMM, trained with an average perceptron algorithm • average F-Score on 41 categories: 77.18 • For Italian (Picca, Gliozzo, Ciaramita, 2008) • trained on MultiSemCor (Bentivoglio et al.) • average F-Score on 41 categories: 62,90

Improving MultiSemCor • Problems • Smaller size (64% of English corpus) • Incomplete alignment (sense in Eng., no sense in Ita.) • PoS coarseness • Word by word translation • Stategy • Retagging, adding morphology • Results • average F-Score: 64,95 (same algorithm; 45 categories)

Further work • Our requirements • Integration of a SST tagger in the TANL pipeline • Useful model for annotating realistic Italian texts • Two directions for improvement • A brand new resource for SST • A new algorithm for SST, based on Maximum Entropy

ILI IWN WordNet Supersense <Lemma, sense> ISST Corpus Building the new resource • ISST - Italian Syntactic-Semantic Treebank • 305,547 tokens • 81,236 content words annotated at the lexico-semantic level, including IWN senses • ILI* mapping from IWN to WN senses * Inter Linguistic Index

ItalWordNet Synset ILI n#24931n #08770969 Synset ILI n#12484 #04692559 WordNet Hyperonym Supersense Synset ILI n#16564 – ISST Corpus Token <Lemma, Sense> L’ <lo, 0> atmosfera <atmosfera, 4> di <di, 0> festa <festa, 1> From Italian senses to English super senses Starting from sense Si: • If Si is in ILI, return che corresponding Se • If not, look for the first hyperonym in the ILI and return the corresponding Se • In both cases return the super sense of Se in WN

ISST-SST after mapping

Revision • Mapping of adverbs in adv.all (~ 10,000) • Listing of possible super senses • Alternative for nouns: 2-6 • Alternative for verbs: 3-10 • An ad-hoc tool for revision • Difficulties • Aspectual verbs: “continuare a …”, “stare per …” • Support verbs: “prestare attenzione a …”, “ dar una mano …”

ISST-SST after revision

Super Sense Tagger • Adapting a generic chunker, part of the Tanl pipeline • Maximum Entropy classifier • Effective for chunking since it does not assume independence of features • Dynamic programming to select sequences of tags with higher probability • The tagger is flexibile and customizable for different tasks • specialization of class FeatureExtractor

Features • No external resources, no first sense heuristics • Local features • Token Attribute features POSTAG -2 -1 0 1 2 CPOSTAG -1 0 • Form features FORM ^\p{Lu} -1 +1 FORM ^\p{Lu}*$ 0 • Global features • Whether a word in the document was previously annotated with a given tag

Detailed results

Results for Italian • Improvement for Italian due to: • new corpus • the different algorithm and the tuning of features

Analysis of improvement • Improvement due to new corpus • MultiSemCor vs ISST-SST, ME tagger • about +4.5 on the F1 score • Improvement due to new algorithm and features • Ciaramita-Altun tagger vs ME tagger, on ISST-SST • about +10 on the F1 score

Conclusions • Significant improvement in accuracy for SS tagging • The tagger has been used to annotate the Italian Wikipedia • Examples of queries made possible on the semantic index • Who proves emotions? (the subj of a verb.emotion) • What did Edison invent/create/discover …? (Edison as the subject of a verb.creation) • Completion of the ISST-ISST resource can further improve accuracy

A resource and tool for Super Sense Tagging of Italian Texts

A resource and tool for Super Sense Tagging of Italian Texts

Presentation Transcript

A sense of place

the development of a generalized resource tool for aggregate data greta

Top-down Budgeting A Tool for Central Resource Management

A Sense of Identity

Neutrality of the Resource Super Profits Tax

Duckweed: The Next Super-Resource

WordNet Sense Tag Annotation Tool

A Sense of Urgency

Tagging Systems and Their Effect on Resource Popularity

A Sense of Place

Texts and Other Texts

TAGPRO A system for ITALIAN POS TAGGING based on SVM

PERSONAL SENSE OF RESPONSIBILITY AND WORK ENGAGEMENT IN A SAMPLE OF ITALIAN HIGH-SCHOOL TEACHERS

Scaling Up Word Sense Disambiguation via Parallel Texts

Italian Sense of Identity

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

A Sense of Self:

LABOR MARKET INFORMATION: A Tool for Making Sense of the World

Super Healthy Italian Food Selection

A CLOUD-BASED RESOURCE MANAGEMENT TOOL

PERSONAL SENSE OF RESPONSIBILITY AND WORK ENGAGEMENT IN A SAMPLE OF ITALIAN HIGH-SCHOOL TEACHERS