860 likes | 1.1k Views
Introduction to Computational Linguistics. Dipti Misra Sharma IIIT, Hyderabad < dipti@iiit.ac.in > IASNLP 05-07-2012. Outline. Background What is Computational Linguistics (CL)? What do the Computational Linguists do? What are the issues in processing natural languages?
E N D
Introduction to Computational Linguistics Dipti Misra Sharma IIIT, Hyderabad <dipti@iiit.ac.in> IASNLP 05-07-2012
Outline • Background • What is Computational Linguistics (CL)? • What do the Computational Linguists do? • What are the issues in processing natural languages? • What can we do with CL? • Approaches in CL?
Background • Language is a means of communication Therefore, one can say • It encodes what is communicated <information> We apply the processes of • Analysis (decoding) for understanding • Synthesis (encoding) for expression (speaking)
What do we communicate ? • Information (SPAIN delivered a football masterclass at Euro 2012) • Intention <purpose> • Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012) • Introduces variation
How do we communicate ? We use linguistic elements such as • Words (country, park, the, is, Bandipur, of, as, and, considered, National, a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one) • Arrangement of the words (Sentences) Words are related to each-other to provide the composite meaning (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country)
How do we communicate ? • Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning *(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.) (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km) • Languages differ in the way they organise information in these entities • All of these interact in the organisation of information
What is Computational Linguistics? • Computational linguistics is the scientific study of language from a computational perspective.
What does it mean? Scientific • Provides explanation for a linguistic or psycholinguisitc phenomenon Computational • Develops computational models/techniques for linguistic phenomena Human language is the subject of study
In other words Computational linguistics is the application of • linguistic theories and • computational techniques to problems of natural language processing. http://www.ba.umist.ac.uk/public/departments/registrars/academicoffice/uga/lang.htm
What do the Computational Linguists do? • Linguistic research • Develop language models for processing natural languages • Develop language resources for NLP research/applications • Understand and develop models for analysis and generation of natural languages by the computers
So, A Computational Linguist needs to understand • How language works • What information is available in the language? • How languages encode information? • How this knowledge/information can be representated for computational processing?
Information in Language (1/4) • Languages encode information cuuhe maarate haiN kutte rats kill dogs • Hindi sentence is ambiguous • Possible interpretations Dogs kill rats Rats kill dogs However, English sentence is not ambiguous
Information in Language (2/4) Ambiguity in Hindi is resolved if, cuuhe maarate haiM kuttoN ko rats kill dogs acc • English encodes information in positions • Hindi in morphemes Languages encode information differently
Information in Language (3/4) Another example, This chair has been sat on • The chair has been used for sitting • X sat on this chair, and it is known • The sentence does not mention X Languages encode information partially
Information in Language (4/4) English pronouns he, she, it Hindi pronoun vaha He is going to Delhi ==> vaha dilli jaa rahaa hai She is going to Delhi ==> vaha dillii jaa rahii hai It broke ==> vaha TuuTa ?? • Information does notalwaysmap fullyfrom one language into another • Conceptual worlds may be different
Differences ? Words English Hindi Telugu boys laDake/laDakoN <n,pl> <n,sg/pl,case> He/she/it vaha atanu/aame/adi is/am/are hai/huuN/haiN/ho is going jaa rahaa hai/rahii hai/rahe haiN
Indian Languages • Relatively flexible word order 1. a) baccaa phala khaataa hai ‘child’ ‘fruit’ ‘eat+hab’ ‘pres’ The child eats fruits b) phala baccaa khaataa hai c) phala khaataa hai baccaa d) baccaa khaataa hai phala
Some structural differences English Declarative : Ravi is coming today Interrogative : Is Ravi coming today ? Change in the position of ‘is’ brings the change in meaning Hindi Declarative : ravi aaj aa rahaa hai Interrogative : kyaa ravi aaj aa rahaa hai ? Word ‘kyaa’ encodes the question information Alternatively, more natural spoken form in Hindi ravi aaj aa rahaa hai ? (with appropriate intonation) OR Ravi aaj aa rahaa haikyaa?
Post nominal modification • 'ing' clauses I know [the manplaying guitar] Hindi, on the other handmaiN [giTaar bajaa rahevyakti ko] jaanataa huuN
Clauses having 'un-' negative constructions EnglishUnless you reach there the job will not be doneHindijab tak tum vahaaN nahiiN pahuNcate , kaam nahiiN hogaa
Languages Differ • Different languages have different mechanisms/devices to encode information • Some devices are common across certain languages and some are different • There are alternative ways of expressing the same meaning within the same language • Languages show preferences for one device over the others • English exploits ‘position’ for encoding information • Hindi uses ‘words’ more effectively Thus, differences in grammatical structures
Ambiguity in Natural Language (1/2) Look at the word 'plot' in the following examples(a) The plot having rocks and boulders is not good.(b) The plot having twists and turns is interesting. 'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'
Ambiguity in Natural Language (2/2) • Lexical level • Sentence level • Structural differences between SL and TL in a Machine Translation system.
Lexical ambiguity • Lexical ambiguity can be both for • Content words – nouns, verbs etc • Function words – prepositions, TAMs etc • Content words' ambiguity is of two types • Homonymy • Polysemy
Homonymy A word has two or more unrelated senses Example : I was walking on the bank (river-bank) I deposited the money in the bank (money-bank)
Polysemy A word having two or more related senses Example : English word 'issue', noun 1. The issue is under discussion (muddaa) 2. The latest issue of the journal is out (aNka) 3. He buys stamps on the day of the issue (vimocan) 4. The couple has no issue even after five years of marriage (saNtaan)
Information Flow and Ambiguity 1. He scratched a figure on the rock (engrave) 2. She scratchedthe figure on the rock (scrape) • Other words in the context make a difference • Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched'
Function words can also pose problems (1/4) Function words can also be ambiguous For example – English preposition 'in' (a) I met him in the garden maiN usase bagiice meiN milaa (b) I met him in the morning maiN usase subaha0 milaa 'Ambiguity' here refers to the 'appropriate correspondence' in the target language.
Function words can also pose problems (2/4) 1. He bought a shirt with tiny collars. usane chote kaular vaalii kamiiz khariidii ‘he tiny collars with shirt bought’ ‘with’ gets translated as ‘vaalii’ in Hindi 2. He washed a shirt with soap. usane saabun se kamiiz dhoii ‘he soap with shirt washed’ ‘with’ gets translated as ‘se’ .
Function words can also pose problems (3/4) TAM Markers mark tense, aspect and modality • Consist of inflections and/or auxiliary verbs in Hindi • An important source of information • Narrow down the meaning of a verb (eg. lied, lay)
Function words can also pose problems (4/4) English Simple Past vs Habitual' 1a. He stayed in the guest house during his visit to our University in Jan (rahaa) 1b. He stayed in the guest house whenever he visited us (rahataa thaa) 2a. He went to the school just now (gayaa) 2b. He went to the school everyday (jaataa thaa)
Sentence level ambiguity I met the girl in the store + Possible readingsa) I met the girl who works in the store b)I met the girl while I was in the storeTime flies like an arrow. + Possible parses: a) Time flies like an arrow (N V Prep Det N)b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)
Thus, Languages encode information differently Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading toinherent ambiguityat different levels
Human beings use • World knowledge • Context (both linguistic and extra-linguistic) • Cultural knowledge and • Language conventions to resolve ambiguities • Can all this knowledge be provided to the machine ? • Computational Linguistics aims for this.
How to provide this knowledge ? (1/2) • Analyse language at various levels (word, phrase, sentence etc) • Build Tools for analysing the natural language at various levels in a text • POS tagger (category marking) • Morphological analysers (analysis of a word) • Morphological generators (word generators) • Chunkers (shallow parsers) • Parsers (syntactic analysis) • Filters (markers for special expressions) • Sense Disambiguation Algorithms • Etc The tools need linguistic knowledge
How to provide this knowledge ? (2/2) Build language resources • Machine Readable Lexicon • Rules for various levels of linguistic analysis • Computational Grammars • Mapping rules for the concerned language pair for an MT system • Sense Disambiguation Rules • Annotated corpora • Etc
POS Tagger What is a POS? Take the following English sentence My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop . • Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS)) • The class to which a word may belong is based on its morphological and syntactic behavior • Morphological Kind of affixes a word takes, for example, boy, boys; girl, girls; book, books (noun class) • Syntactic How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun)
Why is POS relevant in CL/NLP ? (1/2) • Word class information of a given word in a sentence helps to predict its neighbour • WSD He runs a mile every day (verb) Their team made 250 runs (noun) Time flies like an arrow (n v prep det n) • Helps in further processing – chunking, morph pruning, sentence parsing • IR A POS tagger automatically marks the POS of all the words in a text
POS tagged sentence My possesive pronoun old adjective friend noun Ram proper noun recently adverb bought verb a determiner book noun on preposition Indian adjective snakes noun for preposition his possesive pronoun cousin noun from preposition London proper noun , punctuation from preposition the determiner new adjective bookshopnoun in preposition town noun
POS Tagging Approaches • Rule Based • Statistical • Transformation Based
Rule Based POS Tagging • Two staged architecture algorithms (Harris, 1962; Klein and Simmons, 1963; Green and Rubin, 1971) • Stage 1 assign POS by referring to the dictionary Eg Dictionary entry for Eng word that that Conj, Adv, Pronoun • Stage 2 disambiguate, using manually crafted rules
Statistical • Taggers use probabilities for tagging • The tagger picks the most likely tag for a given word in a context • HMM based algorithms are most commonly used for POS tagging task • Requires manually tagged corpus
Annotating Corpus for POS • Annotated corpora is useful for developing statistical POS taggers • Tagging scheme • Set of POS Tags • Guidelines for the annotators • The tagged corpora should be • High quality (in terms of tagging accuracy) • Consistent
POS Tags for English • English • Penn Tree Bank – 45 tags • C5 - Lancaster – 61 tags – used in CLAWS Basic tagset used for BNC http://view.byu.edu/bnc_tags.htm - C7 – 147 tags – Leech http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
his PP$ cousin NN from IN London NNP , , from IN the DT new JJ bookshopNN in IN town NN Pen Treebank Tags My PP$ old JJ friend NN Ram NNP recently RB bought VBD a DT book NN on IN Indian JJ snakes NNS for IN
Objective To arrive at a standard POS and Chunk tagging scheme for all Indian languages Assumption Commonality in Indian Languages POS Tags for Indian Languages
Issues in Tag Set Design (1/2) • Linguistic knowledge coarse vs fine • Syntactic function vs lexical category (for POS tags) • New tags vs tags close to existing English tags • Should be comprehensive/complete
Issues in Tag Set Design (2/2) • Simple • Less effort in manual tagging • Number of tags • Common for all Indian languages
Linguistic Knowledge :Fine vs Coarse (1/2) Example Only noun(NN)laDakA, laDake, laDakoM, laDakI, laDakiyAM, ladakiyoM OR Noun with gender, number, case information (NNM)ladakA, ladAke, laDakoM, (NNMS)ladakA, laDake (NNMP)laDake, laDkoM, (NNMSD)laDakA, (NNMSO)laDake, (NNMPD)laDake, (NNMPO)laDakoM The decision has implications for the size of corpora and machine learning
Linguistic Knowledge :Fine vs Coarse (2/2) • Alternatives • Coarse - NN (advantages/disadvantages) • Fine - NNMSD (advantages/disadvantages) • Hierarchical Example: NN_m_sg_d Hierarchical tag set provides the possibility for underspecification