Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany

Estimating Supersenses with Conditional Random Fields Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany

Agenda Agenda • Introduction • Models for Supersenses • Conditional Random Fields • Lumped Observations • Summary è

Use Case Contentus • Digitize a Multimedia Collection of German National Library • Music of former GDR • Digitize • Quality control • Meta data collection • Semantic indexing • Semantic search engine • Target • Provide content: text, score sheets, video, images, speech • Generate meta data: composers, premiere, director, artists, … • Extract entities: dates, places, relations, composers, pieces of music, … • Assign meanings to words and phrases: use ontology

Wordnet as Ontology • WordNet is a fine-grained word sense hierarchy • The same word may have different senses: bank = financial institutebank = river boundary • Defines senses (synsets) for • Verbs • Common & proper nouns • Adjectives • Adverbs Target: assign each word to a synset • Easy semantic indexing & retrieval

Fine Grained Word Senses • Example: senses of noun „blow“ Very subtle differences between senses

Hierarchy of Hypernyms • Supersense level • Fewer distinctions • Retains main differences Target: assign verbs / nouns to a supersense

List of Supersenses 26 15

Supersenses discriminate between many synsets Noun blow: • 7 synsets • 5 supersenses Verb blow: • 22 synsets • 9 supersenses Sufficient for coarse disambiguation

Training Data: SemCor Dataset Synset Supersense A A DT compromise compromise NN 1190419 noun.communication will will MD leave leave VB 2610151 verb.stative both both DT sides side NN 8294366 noun.group without without IN the the DT glow glow NN 13864852 noun.state of of IN triumph triumph NN 7425691 noun.feeling , , PUNC but but CC it it PRP will will MD save save VB 2526596 verb.social Berlin location NNP 26074 noun.location . . PUNC Input Output

Prior Work: Classifier • Bag-of-words is not sufficient: code relative positions • Use classifiers • MaxEnt • SVM • Naive Bayes • kNN • Proc. SemEval 2007Coarse-Grained English All-Words Task

Prior Work: Sequence Modelling Chiaramita & Altun 06: Use perceptron-trained HMM • Maximize predictive performance on training set • Ignore ambiguity: use only most frequent sense Deschacht & Moens 06: Use Conditional Random Field • Exploit hierarchy to model many classes • Apply to fine grained word sense: good results

Definition of Conditional Random Fields • observed words / features X1,…,Xn • states Y1,…,Yn • each state Yt may be influenced by many of the X1,…,Xnè features Definition: Let G=(V,E) be a graph such that Y=(Yt)tÎV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yt obey the Markov property with respect to the graph: • Hammersley Cliffort Theorem: probability can be written as a product of potential functions • Variable sets are cliques in the neighborhood set

Simplification: Sequential Chain • observed words / features X1,…,Xn • states Y1,…,Yn • features may involve two consecutive states and all observed words examples • feature has value 1 if Yt-1="other" and Yt=location and Xthas pos tag "proper name" and (Xt-2,Xt-1)="arrived in". Otherwise the value is 0 • estimated POS tags, noun phrase tags, weekday, amounts, etc. • prefixes, suffixes; matching regular expressions for capitalization, etc. • information from lexica, lists of proper names [Lafferty, McCallum, Pereira 01]

Derivative of Likelihood estimate the optimal parameter vector l for the training set (Xd,Yd), d=1,…,D observed feature value expected feature value how can we calculate the expected feature values? • need for every document d and state Yd,t the probability p(Ydt=i|Xd) • need for every d and states Yd,t , Yd,t+1 the probability p(Yd,t=i,Yd,t+1=j|Xd) use forward - backward algorithm as for the hidden Markov model

Optimization Procedure: Gradient Ascent observed feature value expected feature value regularization • Regularization term: use small weight values, if possible è smaller generalization errorBayesian prior distribution: Gaussian, Laplace, etc. • use gradient-based optimizer: e.g. conjugate gradient, BFGS è approximate quadratic optimization • use stochastic gradient [Sutton, McCallum 06]

Features • Lemmas of verbs of previous position • Part-of-Speech of lags -2|-1|0|1|2 • Coarse POS lags -2|-1|0|1|2 • Three letter prefixes lag -1,0,1 • Three letter suffixes lag -1,0,1 • INITCAP -1|0|1 • ALLDIGITS -1|0|1 • ALLCAPS -1|0|1 • MIXEDCAPS -1|0|1 • CONTAINSDASH -1|0|1 • Class-ID of unsupervised LDA topic model with 50 classes SemCor training set: ~ 20k sentences, 5-fold cross validation

Noun Results for nouns • Different F1-valuesevent: 67.0%Tops: 98.2% • Micro-Average: 83.5% • Macro-Average: 77.9% • Different frequencies of examplesmotive: 133artifact: 8894

Comparison to Prior Result Micro-Average: • Ciaramita & Altun 06: 77.18 % (s=0.45) • CRF 83.5% (s=0.11)~ 28% reduction of error

Agenda • Introduction • Models for Supersenses • Conditional Random Fields • Lumped Observations • Summary è

Need for More Data • WordNet covers more than 100000 synsets • Few examples per supersense: higher training error • Many examples required to train each synsetSemCor: ~20k sentences • Manual labelling is costly Exploit restrictions in WordNet • Each word has only a subset of possible supersenses • Blow: n.act, n.event, n.phenomenon, n.artifact, n.act • Unlabeled data: assign possible supersenses to each word Specialized CRF required

Conditional Random Field with Lumped Supersenses • observed words / features X1,…,Xn • states Y1,…,Yn, YtÎ{h1,…, hk} • Observations: Subsets of supersenses Yt Í{h1,…, hk} • An observation (X1,…,Xn; Y1,…,Yn) contains a large number of sequences • Adapt likelihood computation

Training Data: SemCor Dataset Possible Supersenses A A DT compromise compromise NN n.communication, n.act will will MD leave leave VB v.stative, v.motion, v.cogn., v.change, v.social, v.possession both both DT sides side NN n.group, n. location, n.body, n.artifact, n.cogn., n. food, n. communication, n.object, n.event without without IN the the DT glow glow NN n.state, n.attribute, n.phenomenon, n.feeling of of IN triumph triumph NN n.feeling, n.event , , PUNC but but CC it it PRP will will MD save save VB v.social, v.possession, v.change, Berlin location NNP n.location, n.person, n.artifact . . PUNC

Results for Lumped Supersenses: SemCor Work in progress • Simulate lumped supersenses • Determine possible supersenses for SemCor • Use different fractions of annotated / possible supersenses • Supersenses estimated without annotations: only 0.8%reduction of F-value

Agenda • Theseus Overview • Use Case Contentus • Core Technology Cluster • Supersense Tagging • Summary & Conclusions è

Summary • Sequence models are able to extract supersenses • New features like topic models help • We may use non-annotated texts by exploiting restrictions in the ontology • Chance to improve classifiers considerably • May enhance higher order IE and information retrieval Todo • Apply to lower levels of hierarchy • Detect new senses / supersenses of words in WordNet

Frank Reichartz Gerhard Paaß Knowledge Discovery Fraunhofer IAIS St. Augustin, Germany