1 / 23

Thesaurus Design (from analised corpora)

Thesaurus Design (from analised corpora). Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural) FCT, Universidade Nova de Lisboa. Thesaurus design Linguistic goals. fine  sanction president  secretary small  big

ketterman
Download Presentation

Thesaurus Design (from analised corpora)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural) FCT, Universidade Nova de Lisboa

  2. Thesaurus design Linguistic goals fine  sanction president  secretary small  big ministery  minister banc  organisation

  3. Thesaurus designProprieties • Distribucional Hypothesis:Words sharing similar contexts are semantically related • Types of context: simple co-occurrence (bigrams) co-occurrence within a window (n-grams) syntactic structures • Domain specific corpus

  4. Thesaurus designSteps • Extraction of syntactic contexts from the corpus • Similarity measure between words (based on their syntactic contexts) • For each word, identify its most similar words

  5. Extraction of syntactic contexts • Tagging (PoS tags) • Chunking (parsing in basic chunks) • Attachment heuristics • Identification of binary dependencies • Extraction of syntactic contexts

  6. Tagging and chunking Clinton sent a clear message to the president of Portugal Tagger: Clinton_N sent_V a_ART clear_ADJ message_N to_PREP the_ART authorities_N of_PREP Portugal_N Chunking: NP (Clinton) VP (send) NP (message, clear) PP(to, NP(authority)) PP(of, NP(portugal))

  7. Attachment Heuristics and Syntactic Dependencies • Attachment of Basic Chunks: • <NP(Clinton) , VP( sent)> • <VP( sent), NP(message, clear)> • <NP(message, clear), PP(to, NP(authority))> • <NP(president), PP(of, NP(portugal))> • Binary Dependencies: • <SUBJ, send , Clinton> • <DOBJ, send, message> • <TO, message, authority> • <OF, authority, portugal>

  8. Syntactic Contexts <DOBJ, send , message> : <DOBJ, send, (*)> <DOBJ, (*), message> <TO, message, authority> : <TO, message, (*)> <TO, (*), authority> <OF, authority, portugal > : <OF, authority, (*)> <OF, (*), portugal>

  9. Similarity MeasureBinary Jaccard coefficient The similarity between two words relies on: The ratio between the number of contexts that are common to both words and the total number of their contexts.

  10. Similarity MeasureWeighted Jaccard coefficient

  11. MicroCorpus Pedro is reading a book and Maria is reading a book, Pedro is reading a novel and Maria read a novel yesterday, Pedro is reading a lot of things, but Pedro loves Maria, Maria loves books, in fact Maria loves a lot of things. Maria is eating an apple and Pedro is eating an apple too, Pedro eated eggs yesterday, Pedro eats a lot of things, Maria is eating eggs, Maria loves eggs a lot.

  12. Thesaurical relations between names Pedro  Maria book  novel apple  egg thing  book, egg, apple, novel (book  egg)? (Maria  thing)?? (Pedro  egg)???

  13. Extracting syntactic contexts of names • Pedro: (<SUBJ, read , (*)>, 3) (<SUBJ, love , (*)>, 1) ( <SUBJ, eat, (*)>, 3) • Maria: (<SUBJ, read , (*)>,2) (<SUBJ, love, (*)>, 3) (<SUBJ, eat, (*)>,2) (<IOBJ-DE, love, (*)>,1) • novel: (<DOBJ, read , (*)>,2) • book: (<DOBJ, read , (*)>,3) (<IOBJ-DE, love , (*)>,1) • thing: (<DOBJ, read , (*)>,1) (<DOBJ, eat, (*)>,1) (<IOBJ-DE, love, (*)>,1) • apple: (<DOBJ, eat, (*)>,2). • egg: (<DOBJ, eat , (*)>,2) (<IOBJ-DE, love, (*)>,1)

  14. Computing the weigth of a context for each word (1): Pedro: (<SUBJ, read , (*)>, 3)GW(<SUBJ, read , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Pedro, <SUBJ, read , (*)>) = log(3) = 0.47W(Pedro, <SUBJ, read , (*)>) = 1.03 Pedro: (<SUBJ, love , (*)>, 1)GW(<SUBJ, love , (*)>) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11LW(Pedro, <SUBJ, love , (*)>) = log(1) = 0W(Pedro, <SUBJ, read , (*)>) = 0.11 Pedro: (<SUBJ, eat , (*)>, 3) GW(<SUBJ, eat , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Pedro, <SUBJ, eat , (*)>) = log(3) = 0.47W(Pedro, <SUBJ, eat, (*)>) = 1.03

  15. Computing the weigth of a context for each word (2): Maria: (<SUBJ, read , (*)>, 2) GW(<SUBJ, read , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Maria, <SUBJ, read , (*)>) = log(2) = 0.3W(Maria,, <SUBJ, read , (*)>) = 0.86 Maria: (<SUBJ, love , (*)>, 3) GW(<SUBJ, love , (*)>) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11LW(Maria, <SUBJ, love , (*)>) = log(3) = 0.47W(Maria, <SUBJ, read , (*)>) = 0.58 Maria: (<SUBJ, eat , (*)>, 2) GW(<SUBJ, eat , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Maria, <SUBJ, eat , (*)>) = log(3) = 0.3W(Maria, <SUBJ, eat, (*)>) = 0.86 Maria: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(Maria, < IOBJ-DE, love , (*)>) = log(1) = 0.W(Maria, < IOBJ-DE, love , (*)>) = 0.31

  16. Computing the weigth of a context for each word (3): novel: (<DOBJ, read , (*)>, 2)GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(novel, <DOBJ, read , (*)>) = log(2) = 0.3W(novel, <DOBJ, read , (*)>) = 1.45 book: (<DOBJ, read , (*)>, 3)GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(book, <DOBJ, read , (*)>) = log(3) = 0.47W(book, <DOBJ, read , (*)>) = 1.62 book: (<IOBJ-DE, love , (*)>, 1)GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(book, < IOBJ-DE, love , (*)>) = log(1) = 0.W(book, < IOBJ-DE, love , (*)>) = 0.31

  17. Computing the weigth of a context for each word (4): thing: (<DOBJ, read , (*)>, 1)GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(thing, <DOBJ, read , (*)>) = log(1) = 0W(thing, <DOBJ, read , (*)>) = 1.15 thing: (<DOBJ, eat , (*)>, 1)GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(eat, <DOBJ, eat , (*)>) = log(1) = 0W(book, <DOBJ, eat , (*)>) = 1.1 thing: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(thing, < IOBJ-DE, love , (*)>) = log(1) = 0.W(thing, < IOBJ-DE, love , (*)>) = 0.31

  18. Computing the weigth of a context for each word (5): apple: (<DOBJ, eat, (*)>, 2)GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(apple, <DOBJ, eat , (*)>) = log(2) = 0.3W(apple, <DOBJ, eat, (*)>) = 1.4 egg: (<DOBJ, eat , (*)>, 2)GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(egg, <DOBJ, eat , (*)>) = log(2) = 0.3W(book, <DOBJ, eat, (*)>) = 1.4 egg: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(egg, < IOBJ-DE, love , (*)>) = log(1) = 0.W(egg, < IOBJ-DE, love , (*)>) = 0.31

  19. Similarity between words (1) WJ(Pedro, Maria) = 2.17 / 2.61 = 0.83min( (1.03+0.11+1.03), (0.86+0.58+0.86) ) = 2.17max( (1.03+0.11+1.03), (0.86+0.58+0.86+0.31) ) = 2.61 WJ(book, novel) = 1.45 / 1.93 = 0.75min( (1.45), (1.62) ) = 1.45max((1.45), (1.62+ 0.31) ) = 1.93 WJ(book, thing) = 1.58 / 2.69 = 0.58min( (1.62+0.33), (1.27+0.31) ) = 1.58max( (1.62+0.33), (1.27+0.31+1.1) ) = 2.69

  20. Similarity between words (2) WJ(apple, egg) = 1.4 / 1.71 = 0.81min( (1.4), (1.4) ) = 1.4max( (1.4), (1.4+0.31) ) = 1.71 WJ(apple, thing) = 1.1 / 2.68 = 0.41min( (1.4), (1.1) ) = 1.1max((1.4), (1.27+0.31+1.1) ) = 2.68 WJ(novel, thing) = 1.1 / 2.68 = 0.41min( (1.45), (1.1) ) = 1.1max((1.45), (1.27+0.31+1.1) ) = 2.68 WJ(egg, thing) = 1.41 / 2.68 = 0.51min( (1.4+0.25), (1.1+0.31) ) = 1.41max( (1.4+0.25), (1.27+0.31+1.1) ) = 2.68

  21. Similarity between words (3) WJ(Maria, thing) = 0.31 / 2.68 = 0.09min( (0.31), (0.31) ) = 0.31max( (0.86+0.58+0.86+0.31) , (1.27+0.31+1.1) ) = 2.68 WJ(Maria, egg) = 0.31 / 2.61 = 0.11min( (0.31), (0.31) ) = 0.31max( (0.86+0.58+0.86+0.31) , (1.4+0.31) ) = 2.61 WJ(book, egg) = 0.31 / 1.93= 0.16min((0.31), (0.31) ) = 0.31max((1.62+.31), (1.4+0.31) ) = 1.93 WJ(Pedro, thing) = 0 / 2.62 = 0WJ(novel, egg) = 0 / 1.65 = 0WJ(book, apple) = 0 / 1.87 = 0;

  22. Similarity between words(Sorting) (0.83) Pedro  Maria (0.81) apple  egg (0.75) book  novel (0.58) thing  book (0.51) thing  egg (0.41) thing  apple, novel (0.16) book  egg (0.11) Maria  egg (0.09) Maria  thing (0.0) Pedro  egg (0.0) novel egg

  23. Lists of similar words Corpus “Procuradoria Geral da República” (P.G.R.) • juíz| {dirigente, presidente, subinspector, governador, árbitros} • diploma| {decreto, lei, artigo, convenção, regulamento} • decreto| {diploma, lei, artigo, nº, código} • regulamento| {estatuto, código, sistema, decreto, norma} • regra| {norma, princípio, regime, legislação, plano} • renda| {caução, indemnização, reintegração, multa, quota} • conceito| {noção, estatuto, regime, temática, montante}

More Related