200 likes | 305 Views
Information Query Formulation in a Slavonic Language and its Automatic Processing. Experience from Polish and Czech in comparison to Western European Languages Petr Strossa University of Economics, Prague Department of Information & Knowledge Engineering. General Issue.
E N D
Information Query Formulation in a Slavonic Language and its Automatic Processing Experience from Polish and Czech in comparison to Western European Languages Petr Strossa University of Economics, Prague Department of Information & Knowledge Engineering
General Issue 86 Question/Answer Types and the basic idea of their recognition in texts [D. Laurent et al., SYNAPSE, Toulouse] TEL-ME-MOR/M-CAST Seminar, 2006
Technology Priberam’s lexicon data structure SintaGest software tool [Priberam Informática, Lisbon] TEL-ME-MOR/M-CAST Seminar, 2006
Question-Answer Pattern(Example) Question(WEIGHT) : Root("jaký")? Dist(0,5) WeightNoun = 20// Jaká je hmotnost Země? : Wrd(jak) WeightAdj = 20// Jak těžký může být slon? : Wrd(kolik) WeightUnit = 20// Kolik kg má dospělý kapr? : Wrd(kolik) Root("vážit") = 20// Kolik váží kapr? Answer : WeightNoun Definition With Pivot Dist(0,5) {Number6 WeightUnit} = 20 // Váha kapra může dosáhnout až 5 kg. : Pivot Dist(0, 5) Cat(V) Dist(0,5) {Number6 WeightUnit} = 20 // Roční kapr může dosáhnout 5 kg tělesné váhy. ; Answer(WEIGHT) : Number6 WeightUnit = 20 ; TEL-ME-MOR/M-CAST Seminar, 2006
Definitions of Constants Used in the Previous Example Const WeightNoun = AnyRoot(hmotnost, hmota, "tíha", "váha", "zatížení"); Const WeightAdj = AnyRoot("těžký", "lehký"); Const WeightUnit1 = AnyRoot(mikrogram, miligram, centigram, decigram, gram, dekagram, hektogram, kilogram, kilo, cent, megagram, miligram, tuna, "karát", pond, kilopond, megapond, libra); Const WeightUnit2 = AnyWrd(mg, cg, dg, g, dag, deka, Dg, dkg, hg, kg, q, Mg, t, p, kp, Mp, lb, "lb.", lbs, "lbs.", cwt, "cwt."); Const WeightUnit = AnyConst(WeightUnit1, WeightUnit2); TEL-ME-MOR/M-CAST Seminar, 2006
General Observation • The conception and the tools designed to process Western European languages can be adapted to process Slavonic languages, as Polish and Czech. • Some basic differences between the language families must be kept in mind during such an adaptation! TEL-ME-MOR/M-CAST Seminar, 2006
The Abundance of Morphology • Nouns: 4 (!) genders, 2 numbers, 7 cases • Adjectives: e.g. světlý (bright) • 3 degrees: světlý↔ světlejší, nejsvětlejší • 4 genders: světlý↔ světlá, světlé • 2 numbers: světlý↔ světlí • 7 cases: světlý↔ světlého, světlému, ... TEL-ME-MOR/M-CAST Seminar, 2006
The Abundance of Morphology (2) • Adjectives Continued: • Theoretically every adjective may have 3*4*2*7 = 168 forms altogether! • Practically some of them are regularly (without exceptions) equal... • A general scheme for a morphology pattern description cannot work with less than 57 forms(= 3 degrees * 19 possibly differing gender/number/case endings). TEL-ME-MOR/M-CAST Seminar, 2006
The Abundance of Morphology (3):Illustration – the 19 Ending System TEL-ME-MOR/M-CAST Seminar, 2006
The Abundance of Morphology (4) • Adjectives Continued: • In fact, not all of them may have all the forms. • Some adjectives cannot undergo gradation for purely morphological reasons: domácí (home, home-made) • Other adjectives usually do not undergo gradation for semantic reasons: jednofázový (one-phase) TEL-ME-MOR/M-CAST Seminar, 2006
Morphological Pattern (Ex. 1) TEL-ME-MOR/M-CAST Seminar, 2006
Morphological Pattern (Ex. 2) TEL-ME-MOR/M-CAST Seminar, 2006
Morphology of Nouns: Some Statistics TEL-ME-MOR/M-CAST Seminar, 2006
Morphology of Nouns: Some Statistics (2) • We need about 300 noun patterns altogether. • We have about 90 noun patterns that describe the declension of at least 10 different nouns. • We have about 80 noun patterns that describe only 1 noun each. • About one half of the noun patterns describe the declension of 1–3 nouns each. TEL-ME-MOR/M-CAST Seminar, 2006
Inherent Homonymy of Forms • A typical situation for our type of morphology:světlé(bright) • nominative/accusative/vocative singular neuter • genitive/dative/locative singular feminine • nom./acc./voc. plural fem. • acc. pl. masculine animate • nom./acc./voc. pl. masculine inanimate • i.e. 13 possible grammatical interpretations altogether! TEL-ME-MOR/M-CAST Seminar, 2006
Inherent Homonymy of Forms (2) • Only a little bit less typical situation: Ženu holí stroj. • I am setting a machine in motion with a stick. • OR: I am setting a machine of sticks in motion. (*) • The woman is shaved by a machine. • Dress the woman with a stick. • OR: Dress the woman of sticks. (*) TEL-ME-MOR/M-CAST Seminar, 2006
Inherent Homonymy of Forms (3) • All the previous once again – in a question:Jaký je plat Petra Hanka? • What is the salary of XY? • X {Petr, Peter, Petar} • Y {Hank, Hanek, Hanke, Hanko} • The only thing we know for sure:X ≠ Petra (though such name exists);Y ≠ Hanka (though such name exists)! TEL-ME-MOR/M-CAST Seminar, 2006
Inherent Homonymy of Forms (4) Jaký je plat Petra Hanka? • What is the salary of XY? • The only thing we know for sure:X ≠ Petra (though such name exists);Y ≠ Hanka (though such name exists)! : Jaký plat Hanka dává svým zaměstnancům? • What salary does Hanka give to her/his employees? TEL-ME-MOR/M-CAST Seminar, 2006
Inherent Homonymy of Forms (Conclusion) • Due to our free word order, it is generally quite problematic to try any limited context disambiguation. • A really safe disambiguation is possible only after a complete syntactic analysis of a sentence (which should keep all the possible meanings of all the words up to the end). • (But we do not make complete syntactic analysis of sentences in M-CAST.) TEL-ME-MOR/M-CAST Seminar, 2006
Free Word Order Again • How far is it to Brno? • Jak daleko je do Brna? (+++) • Jak je daleko do Brna? (+++) • Jak je do Brna daleko? (++) • Do Brna je jak daleko? (++) • Do Brna jak je daleko? (+) • Do Brna je daleko jak? (+) • Daleko je do Brna jak? (+) TEL-ME-MOR/M-CAST Seminar, 2006