Thesaurus Design (from analised corpora)

Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural) FCT, Universidade Nova de Lisboa

Thesaurus design Linguistic goals fine  sanction president  secretary small  big ministery  minister banc  organisation

Thesaurus designProprieties • Distribucional Hypothesis:Words sharing similar contexts are semantically related • Types of context: simple co-occurrence (bigrams) co-occurrence within a window (n-grams) syntactic structures • Domain specific corpus

Thesaurus designSteps • Extraction of syntactic contexts from the corpus • Similarity measure between words (based on their syntactic contexts) • For each word, identify its most similar words

Extraction of syntactic contexts • Tagging (PoS tags) • Chunking (parsing in basic chunks) • Attachment heuristics • Identification of binary dependencies • Extraction of syntactic contexts

Tagging and chunking Clinton sent a clear message to the president of Portugal Tagger: Clinton_N sent_V a_ART clear_ADJ message_N to_PREP the_ART authorities_N of_PREP Portugal_N Chunking: NP (Clinton) VP (send) NP (message, clear) PP(to, NP(authority)) PP(of, NP(portugal))

Attachment Heuristics and Syntactic Dependencies • Attachment of Basic Chunks: • <NP(Clinton) , VP( sent)> • <VP( sent), NP(message, clear)> • <NP(message, clear), PP(to, NP(authority))> • <NP(president), PP(of, NP(portugal))> • Binary Dependencies: • <SUBJ, send , Clinton> • <DOBJ, send, message> • <TO, message, authority> • <OF, authority, portugal>

Syntactic Contexts <DOBJ, send , message> : <DOBJ, send, (*)> <DOBJ, (*), message> <TO, message, authority> : <TO, message, (*)> <TO, (*), authority> <OF, authority, portugal > : <OF, authority, (*)> <OF, (*), portugal>

Similarity MeasureBinary Jaccard coefficient The similarity between two words relies on: The ratio between the number of contexts that are common to both words and the total number of their contexts.

Similarity MeasureWeighted Jaccard coefficient

MicroCorpus Pedro is reading a book and Maria is reading a book, Pedro is reading a novel and Maria read a novel yesterday, Pedro is reading a lot of things, but Pedro loves Maria, Maria loves books, in fact Maria loves a lot of things. Maria is eating an apple and Pedro is eating an apple too, Pedro eated eggs yesterday, Pedro eats a lot of things, Maria is eating eggs, Maria loves eggs a lot.

Thesaurical relations between names Pedro  Maria book  novel apple  egg thing  book, egg, apple, novel (book  egg)? (Maria  thing)?? (Pedro  egg)???

Extracting syntactic contexts of names • Pedro: (<SUBJ, read , (*)>, 3) (<SUBJ, love , (*)>, 1) ( <SUBJ, eat, (*)>, 3) • Maria: (<SUBJ, read , (*)>,2) (<SUBJ, love, (*)>, 3) (<SUBJ, eat, (*)>,2) (<IOBJ-DE, love, (*)>,1) • novel: (<DOBJ, read , (*)>,2) • book: (<DOBJ, read , (*)>,3) (<IOBJ-DE, love , (*)>,1) • thing: (<DOBJ, read , (*)>,1) (<DOBJ, eat, (*)>,1) (<IOBJ-DE, love, (*)>,1) • apple: (<DOBJ, eat, (*)>,2). • egg: (<DOBJ, eat , (*)>,2) (<IOBJ-DE, love, (*)>,1)

Computing the weigth of a context for each word (1): Pedro: (<SUBJ, read , (*)>, 3)GW(<SUBJ, read , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Pedro, <SUBJ, read , (*)>) = log(3) = 0.47W(Pedro, <SUBJ, read , (*)>) = 1.03 Pedro: (<SUBJ, love , (*)>, 1)GW(<SUBJ, love , (*)>) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11LW(Pedro, <SUBJ, love , (*)>) = log(1) = 0W(Pedro, <SUBJ, read , (*)>) = 0.11 Pedro: (<SUBJ, eat , (*)>, 3) GW(<SUBJ, eat , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Pedro, <SUBJ, eat , (*)>) = log(3) = 0.47W(Pedro, <SUBJ, eat, (*)>) = 1.03

Computing the weigth of a context for each word (2): Maria: (<SUBJ, read , (*)>, 2) GW(<SUBJ, read , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Maria, <SUBJ, read , (*)>) = log(2) = 0.3W(Maria,, <SUBJ, read , (*)>) = 0.86 Maria: (<SUBJ, love , (*)>, 3) GW(<SUBJ, love , (*)>) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11LW(Maria, <SUBJ, love , (*)>) = log(3) = 0.47W(Maria, <SUBJ, read , (*)>) = 0.58 Maria: (<SUBJ, eat , (*)>, 2) GW(<SUBJ, eat , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Maria, <SUBJ, eat , (*)>) = log(3) = 0.3W(Maria, <SUBJ, eat, (*)>) = 0.86 Maria: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(Maria, < IOBJ-DE, love , (*)>) = log(1) = 0.W(Maria, < IOBJ-DE, love , (*)>) = 0.31

Computing the weigth of a context for each word (3): novel: (<DOBJ, read , (*)>, 2)GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(novel, <DOBJ, read , (*)>) = log(2) = 0.3W(novel, <DOBJ, read , (*)>) = 1.45 book: (<DOBJ, read , (*)>, 3)GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(book, <DOBJ, read , (*)>) = log(3) = 0.47W(book, <DOBJ, read , (*)>) = 1.62 book: (<IOBJ-DE, love , (*)>, 1)GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(book, < IOBJ-DE, love , (*)>) = log(1) = 0.W(book, < IOBJ-DE, love , (*)>) = 0.31

Computing the weigth of a context for each word (4): thing: (<DOBJ, read , (*)>, 1)GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(thing, <DOBJ, read , (*)>) = log(1) = 0W(thing, <DOBJ, read , (*)>) = 1.15 thing: (<DOBJ, eat , (*)>, 1)GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(eat, <DOBJ, eat , (*)>) = log(1) = 0W(book, <DOBJ, eat , (*)>) = 1.1 thing: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(thing, < IOBJ-DE, love , (*)>) = log(1) = 0.W(thing, < IOBJ-DE, love , (*)>) = 0.31

Computing the weigth of a context for each word (5): apple: (<DOBJ, eat, (*)>, 2)GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(apple, <DOBJ, eat , (*)>) = log(2) = 0.3W(apple, <DOBJ, eat, (*)>) = 1.4 egg: (<DOBJ, eat , (*)>, 2)GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(egg, <DOBJ, eat , (*)>) = log(2) = 0.3W(book, <DOBJ, eat, (*)>) = 1.4 egg: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(egg, < IOBJ-DE, love , (*)>) = log(1) = 0.W(egg, < IOBJ-DE, love , (*)>) = 0.31

Similarity between words (1) WJ(Pedro, Maria) = 2.17 / 2.61 = 0.83min( (1.03+0.11+1.03), (0.86+0.58+0.86) ) = 2.17max( (1.03+0.11+1.03), (0.86+0.58+0.86+0.31) ) = 2.61 WJ(book, novel) = 1.45 / 1.93 = 0.75min( (1.45), (1.62) ) = 1.45max((1.45), (1.62+ 0.31) ) = 1.93 WJ(book, thing) = 1.58 / 2.69 = 0.58min( (1.62+0.33), (1.27+0.31) ) = 1.58max( (1.62+0.33), (1.27+0.31+1.1) ) = 2.69

Similarity between words (2) WJ(apple, egg) = 1.4 / 1.71 = 0.81min( (1.4), (1.4) ) = 1.4max( (1.4), (1.4+0.31) ) = 1.71 WJ(apple, thing) = 1.1 / 2.68 = 0.41min( (1.4), (1.1) ) = 1.1max((1.4), (1.27+0.31+1.1) ) = 2.68 WJ(novel, thing) = 1.1 / 2.68 = 0.41min( (1.45), (1.1) ) = 1.1max((1.45), (1.27+0.31+1.1) ) = 2.68 WJ(egg, thing) = 1.41 / 2.68 = 0.51min( (1.4+0.25), (1.1+0.31) ) = 1.41max( (1.4+0.25), (1.27+0.31+1.1) ) = 2.68

Similarity between words (3) WJ(Maria, thing) = 0.31 / 2.68 = 0.09min( (0.31), (0.31) ) = 0.31max( (0.86+0.58+0.86+0.31) , (1.27+0.31+1.1) ) = 2.68 WJ(Maria, egg) = 0.31 / 2.61 = 0.11min( (0.31), (0.31) ) = 0.31max( (0.86+0.58+0.86+0.31) , (1.4+0.31) ) = 2.61 WJ(book, egg) = 0.31 / 1.93= 0.16min((0.31), (0.31) ) = 0.31max((1.62+.31), (1.4+0.31) ) = 1.93 WJ(Pedro, thing) = 0 / 2.62 = 0WJ(novel, egg) = 0 / 1.65 = 0WJ(book, apple) = 0 / 1.87 = 0;

Similarity between words(Sorting) (0.83) Pedro  Maria (0.81) apple  egg (0.75) book  novel (0.58) thing  book (0.51) thing  egg (0.41) thing  apple, novel (0.16) book  egg (0.11) Maria  egg (0.09) Maria  thing (0.0) Pedro  egg (0.0) novel egg

Lists of similar words Corpus “Procuradoria Geral da República” (P.G.R.) • juíz| {dirigente, presidente, subinspector, governador, árbitros} • diploma| {decreto, lei, artigo, convenção, regulamento} • decreto| {diploma, lei, artigo, nº, código} • regulamento| {estatuto, código, sistema, decreto, norma} • regra| {norma, princípio, regime, legislação, plano} • renda| {caução, indemnização, reintegração, multa, quota} • conceito| {noção, estatuto, regime, temática, montante}

Thesaurus Design (from analised corpora)