320 likes | 715 Views
Frank Van Eynde Centre for Computational Linguistics. Computational Lexicography. OUTLINE. 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech.
E N D
Frank Van Eynde Centre for Computational Linguistics Computational Lexicography
OUTLINE 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech
1. Tokens vs. types (1) The girl gave the flowers to the athlete. - 3 tokens the : properties are context specific - 1 type <THE> : properties are generalizations over the various uses Heracleitos vs. Plato (2) The sooner they come, the better it is. <THE, article> vs. <A, article> NL de, het <THE, adverb> vs. <FAR, adverb> NL hoe
1. Tokens vs. types (3) I do not think that the dog of that girl is really that dangerous. <THAT, compl> vs. <IF, compl> FR que <THAT, det> vs. <THIS, det> FR ce/cette <THAT, adverb> vs. <SO, adverb> FR si (4) Je ne pense pas que le chien de cette fille est vraiment si dangereux.
1. Tokens vs. types The abstraction problem: given a word W, how many types <W,POS> do we have to distinguish? (5) It is not far from here. (6) We didn't go far. (7) He's living in the Far West. (8) Paris is far more expensive than Dublin. <FAR, adj> vs. <NEAR, adj> NL ver <FAR, adv> vs. <LITTLE, adv> NL veel
1. Tokens vs. types (9) De bal van de finale wordt verkocht op het bal van de FIFA. <BAL, noun [non-neuter]> IT palla <BAL, noun [neuter]> IT ballo (10) La palla del finale sarà venduta al ballo della FIFA.
1. Tokens vs. types (11) That girl has been very lucky. (12) That girl has a lot of hair. <W,POS,VAL> <HAVE, verb [aux], _VP[PSP]> IT avere/essere <HAVE, verb [main], _NP> IT avere
(13) The pen is in my pocket. (14) The pig is in the pen. <W, POS, VAL, SENSE> <PEN, noun, writing implement> NL pen <PEN, noun, fenced enclosure> NL hok 1. Tokens vs. types
2. Lexicographic practice The entries of pen and peg in the Oxford Advanced Learner's Dictionary of Current English. <ORTHⁿ, PHON, POS, m, (VAL,) SENSE> Homonymy vs. polysemy Problem: for any given ORTH, how many n and how many m does one have to distinguish?
2. Lexicographic practice The entries of pen and peg in the Collins Cobuild Dictionary of the English Language. <ORTH, PHON, m, SENSE> There is no 1 to 1 correspondence between the senses in both dictionaries
3. Computational Lexica Dictionaries are made for people who already understand (much of) the language. Computational lexica are made for machines that do not understand (anything of) the language Consequence: an NLP system can only make sense of information which is presented in the notation (or format) which it employs for processing the language.
3. Computational Lexica <two hundred fifty-six, 256> <two hundred fifty-six, CCLVI> POS tagger The entry for ik in Van Dale The entry for ikin the lexicon of the Spoken Dutch Corpus
4. Lexical databases Computational lexica are often task-specific and application-dependent. The need for reusability, maintainability, extensibility Creation of a lexical database which is sufficiently general and abstract to be reusable, maintainable and easily extensible Two aspects of abstractness: theory-neutral and level-independent
4. Lexical databases Lexical knowledge representation languages DATR (Gazdar and Evans) Typed feature structures (HPSG) The number of lexical entries for any given natural language is enormous. The information to be captured in each lexical entry is detailed and complex.
4. Lexical databases WordNet English nouns, verbs, adjectives and adverbs Inspired by psycholinguistic and computational theories of human lexical memory Organized into synonym sets, each representing one underlying concept Example: call Extension to other languages: EuroWordNet Application to Dutch: Cornetto Other initiatives: FrameNet and VerbNet
5. Lexical knowledge acquisition from scratch from a machine-readable dictionary from an agency for the distribution of resources (TST, ELRA and LDC) inductive: from a partial lexicon and a corpus
6. Lexica in text-to-speech written text text normalisation expanded graphemic representation tagging & syntactic analysis graphemic representation with prosody grapheme-to-phoneme sequence of phonemes, incl. lexical stress speech synthesis fluent speech