280 likes | 452 Views
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval. Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005. Main research goals.
E N D
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005
Main research goals • To provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system: • to create specialized Semantic Dictionaries for English, French and Bulgarian based on WordNet semantic relations; • to provide compete formalization of the inflection for simple and compound words included in the Wn structure.
History • The integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop. • Later on the idea was advanced into the Joint research RILA project Information retrieval based on semantic relations • LASELDI, Université de Franche-Comté • Department of Computational Linguistics, IBL, Bulgarian Academy of Sciences.
Language resources • Bulgarian grammatical dictionary (BGD) – over 83 000 lemmas and 1 100 000 word forms; • English WordNet 2.0 – 115 424 synonymous sets; • Bulgarian WordNet (BalkaNet project) – 22 867 synonymous sets; • French WordNet (EuroWordNet project) – 33 512 synonymous sets; • English dictionary – over 30 000 lemmas (not inflected); • French dictionary – extracted with INTEX.
Implementation tasks • To transform the format of the BGD into the NooJ standard; • To create semantic dictionaries for Bulgarian and English; • To associate lemmas from the Bulgarian semantic dictionaries with the corresponding inflection types; • To add missing lemmas and inflection types in BGD, if any; • To create extensive dictionaries and corresponding inflection types for compounds.
BGD – Information structure design • Category information – 6 classes: Noun, Verb, Adjective, Pronoun, Numeral, Others (Adverb, Preposition, Conjunction, Particle, Interjection) ; • Paradigmatic information – Personal, Transitive, Perfective, Common, …; • Grammatical information – Inflection, Conjugation, Sound alternations, ….
BGD – Grammatical subclasses • Nouns - 22 subclasses with respect of their Type (Common, Proper, Singularia tantum, Pluralia tantum) and Gender; • Verbs – 32 subclasses with respect of Transitivity, Perfectiveness, and Personality; • Adjectives – 2 subclasses; • Pronouns – 26 subclasses with respect of their Type and Possessor; • Numerals – 6 sunclasses.
BGD – Grammatical types • Noun – Number, Definiteness, Counting form, Case, Optional forms – 266 types; • Verb – Person, Number, Tense, Mood, Voice, Participles, Gender, Definiteness – 257 types; • Adjective – Gender, Number, Definiteness – 30 types; • Pronoun – Gender, Person, Number, Definiteness, Case, Clitic, Possessing – 28 types; • Numeral – Gender, Number, Definiteness, Approximate form, Male form – 20 types.
BGD – Dictionary format а,ЧА,0ПРИ, 7 sm0, Ok, ‘‘ абсол`ютен, ПРИ, 7 smh, Ok, '2RCия‘ `август, С+М, 10 sml, Ok, '2RCият‘ авиокомп`ания, С+Ж, 1 sf0, Ok, '2RCа‘ австр`ийски, ПРИ, 3 sfd, Ok, '2RCата‘ автоб`ус, С+М, 11 sn0, Ok, '2RCо‘ автомат`ичен, ПРИ, 7 snd, Ok, '2RCото‘ адрес`ирам, Г+Н+Т, 4 p0, Ok, '2RCи‘ агит`ирам, Г+Н+Т, 4 pd, Ok, '2RCите'
NooJ dictionary → aбсол`ютен, ПРИ, 7 aбсолютен,A+FLX=A-7 `август, С+М, 10 август,N+M+FLX=N_M-10 авиокомп`ания, С+Ж,1 авиокомпания,N+F+FLX=N_F-1 aвстр`ийски, ПРИ, 3 aвстрийски,A+FLX=A-3 автоб`ус, С+М, 11 автобус,N+M+FLX=N_M-11 автомат`ичен, ПРИ, 7 автоматичен,A+FLX=A-7 адрес`ирам,Г+Н+Т,4 адресирам,V+IT+FLX=V_IT-4
NooJ formal descriptions → sm0, Ok, ‘‘A-7 = <E>/sm0 + smh, Ok, '2RCия‘<L2><S><R>ия<S1>/smh + sml, Ok, '2RCият‘<L2><S><R>ият<S1>/sml + sf0, Ok, '2RCа‘<L2><S><R>а<S1>/sf0 + sfd, Ok, '2RCата‘<L2><S><R>ата<S1>/sfd + sn0, Ok, '2RCо‘<L2><S><R>о<S1>/sn0 + snd, Ok, '2RCото‘<L2><S><R>ото<S1>/snd + p0, Ok, '2RCи‘<L2><S><R>и<S1>/p0 + pd, Ok, '2RCите‘<L2><S><R>ите<S1>/pd;
Selected relations • Synonymy (reflexive, symmetric, and transitive relation of equivalence); • Hypernymy (inverse, asymmetric, and transitive relation between synonym sets), • Meronymy (inverse, asymmetric, and transitive relation between synonym sets): Part meronymy; Member meronymy; Portion meronymy.
Selected relations • Similar to (symmetric relation between similar adjectival synsets); • Verb group (symmetric relation between semantically related verb synsets); • Also see (symmetric relation between synsets - verbs or adjectives, that are close in meaning); • Category domain (asymmetric extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to).
DELAF semantic dictionaries • These dictionaries consist of pairs of literals defined for the corresponding semantic relation: • car,automobile.N • auto,automibile.N • All possible combinations between literals in the given synsets are listed: • car,automobile.N • cars,automobile.N • auto,automibile.N • autos,automibile.N
NooJ Semantic dictionaries Synonymy relation ‘a plant consisting of buildings with facilities for manufacturing’ фабрика,N+FLX=ENG20-03196165-nпредпрятие,N+FLX=ENG20-03196165-n factory,N+FLX=ENG20-03196165-n mill,N+FLX=ENG20-03196165-n manufacturing plant,N+FLX=ENG20-03196165-n manufactory,N+FLX=ENG20-03196165-n
NooJ Semantic dictionaries Hypernymy relation ‘the organized action of making of goods and services for sale’ производство,N+FLX=ENG20-00859333-nпромишленост,N+FLX=ENG20-00859333-nиндустрия,N+FLX=ENG20-00859333-n production,N+FLX=ENG20-00859333-n industry,N+FLX=ENG20-00859333-n manufacture,N+FLX=ENG20-00859333-n
Inflecting wordnet <SYNSET> <ID>...</ID> <POS>...</POS> <SYNONYM> <LITERAL> otstranqwam (to remove) <SENSE>…</SENSE> <LNOTEGR>ГНТ12</LNOTEGR> </LITERAL> </SYNONYM> <ILR>...<TIPE>...</TYPE></ILR> <DEF> remove something concrete, as by lifting, pushing, taking off, etc. or remove something abstract </DEF> <BCS>...</BCS> </SYNSET>
NooJ Semantic descriptions ‘the organized action of making of goods and services for sale’ ENG20-00859333-n = <E>/Hs0 + то/Hsd + <L1>а<S1>/Hp0 + <L1>ата<S1>/Hpd + <L9>мишленост<S9>/Ss0 + <L9>мишлеността<S9>/Ssd + <L9>мишлености<S9>/Sp0 + <L9>мишленостите<S9>/Spd + <B12>индустрия/Ss0 + <B12>индустрията/Ssd + <B12>индустрии/Sp0 + <B12>индустриите/Spd; ENG20-00859333-n = <E>/Hs + <B10>industry/Ss + <B10>industries/Sp0+ <B10>manifactures/Ss + <B10>manifactures/Sp;
After the nice solutions • Lemmas which are not included in the BGD: • Lemmas classification to existing inflection types; • Formal description of new inflection types • Literals in Latin; • Validating WordNet. • Semantic ambiguity - literals with two inflectional descriptions in BGD; • Compound words • Formal description of inflection types; • Compounds classification.
NooJ Compound semantic descriptions ENG20-04182583-n = <E>/Ss0 + <P>та/Ssd + <B>и<P><B>(и/p0 +ите/pd) + <B7>завод<P><B2>ен/Ss0 + <B7>завод<P><B2>ния/Ssh + <B7>завод<P><B2>ният/Ssl + <B7>заводи<P><B2>ни/Sа0 + <B7>заводи<P><B2>ните/Sа0 + <B7>рафинерия/Ss0 + <B7>рафинерия<P>та/Ssd + <B7>рафинерии<P><B>и/Sp0 + <B7>рафинерии<P><B>ите/Spd;
Applications of the Semantic Dictionaries • Information retrieval by means of semantic equivalence with synonymy dictionaries; • Information retrieval by means of semantic specification with hyperonymy and meronymy dictionaries; • Information retrieval by means of similarity; • Information retrieval by means thematic domains affiliations; • Validation WordNet structure against its completeness and consistency.
Future directions • Extensions and enhancements of the semantic dictionaries by means of: • Extension of the dictionaries coverage; • Addition of other semantic relations; • Inclusion of additional information to the entries. • Integration of multilingual semantic extraction with NooJ using the Inter-Lingual-Index relation.