410 likes | 423 Views
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions.
E N D
Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 10: Natural Language Processing and IR. Syntax and structural disambiguation Alexander Gelbukh www.Gelbukh.com
Previous Chapter: Conclusions • Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning • Useful in translation, information retrieval, and textundertanding • Dictionary-based methods • good but expensive • Statistical methods • cheap and sometimes imperfect... but not always (if verylarge corpora are available)
Previous Chapter: Research topics • Too many to list • New methods • Lexical resources (dictionaries) • = Computational linguistics
Contents • Language levels • Syntax • Dependency approach • Constituency-based approach • Head-driven approach • Grammars and parsing • Ambiguity and disambiguation
Language levels • Letters are built up into words • Words into sentences • Sentences into <...> text • Each level has its own representation • This allows for modular processing • A module describes one levelor transforms from one level to another
General scheme of text processing • Linguistic processor uses linguistic knowledge • Applied system uses other types of knowledge(e.g., Artificial Intelligence)
Language levels • Morphological: words • Syntactic: sentences • Semantic: meaning • Pragmatic: intention • ...?
Example of text “Science is important for our country. The Government pays it much attention.”
Textual representation Text is a sequence of letter. S c i e n c e i s i m p o r t a n t f o r o u r c o u n t r y . T h e G o v e r n m e n t p a y s i t m u c h a t t e n t i o n .
Morfological analysis Morphologicalanalysis
Morphological representation A sequence of words.
Syntactic parsing Syntacticparsing
Syntactic representation A sequence of syntactic trees.
Syntactic representation • What happened? • With whom happened? • ... their details
Semantic analysis Semanticanalysis Next lecture...
Syntax • The structure describing the relationships between words in a sentence • Describes the relationships implied by grammatical characteristics • not by meaning • Often allows for simple paraphrasing • John reads the book • The book is read by John
Early approach: Dependency syntax • Tree • Nodes: words • Arcs: modified by • Modifies means adds details,clarifies, chooses of many...makes more specific • Arcs are typed • Types are: subject, object, attribute, ... Recipient Subject Object Attribute
... Dependency syntax • General situation: pay • More specifically: the onewhere: • who pays is government • what is paid is attention • to whom it is paid is it • More specifically: attention that is much Recipient Subject Object Attribute
Advantages/disadvantages of Dependency Syntax Advantages • Solid linguistic base • Rather direct translation into semantics • Easily applicable to languages with free word order • Korean? Russian, Latin • This is why solid linguistic base: good for classical languages! Disadvantages • No nice mathematical base • No simple algorithms
Most popular approach: Constituency (Phrase Structure grammars) • Tree • Nodes: nested segments of the phrase • Cannot intersect, only nested • Usually are labeled with part-of-speech names • Arcs: nesting • In classical approach, arcs are not labeled [[Our Government ][pays [ much attention][to it ]]]
Constituency [[Our Government ][pays [ much attention][to it ]]] Our Government pays much attention to it
Constituency [[OurR GovernmentN]NP [paysV[ muchA attentionN]NP[toP itR]PP]VP]S R: pronoun NP: noun phrase N: noun VP: verb phrase V: verb PP: prepositional phrase A: adjective S: sentence
Constituency: graphical representation [[Our Government ]NP[pays [ much attention]NP[to it ]PP]VP]S S VP NP NP PP NP VP NP NP R N V A N P R Our Government pays much attention to it
Phrase structure grammar • Enumerates possible configurations at nodes • Usually recursive S NP VP NP A NP NP R NP NP P NP NP N VP VP NP PP VP V S VP NP NP PP NP VP NP NP R N V A N P R Our Government pays much attention to it
Context-independency hypothesis • A configuration is possible or not,regardless of where it is used • Wherever you find VP NP PP, it can be VP • Wherever you find NP VP, it can be S • If you can put together S that covers all the sentence,it is a grammatically correct description • With this, given a suitable grammar, you can • List all sentences of a language • List only correct sentences of that language • List all and only correct structures • Correctness means a native speaker’s intuition
Generative idea • Find a grammar to list all and only correct sentences (with their structures) of a language • This is a complete description of that language! • How can be useful in analysis? • Reverse the grammar
Parsing • Given a grammar and a sentence • Find all possible structures • That describe this sentence with this grammar • Many methods. Not discussed today.A lot of research. Very fast algorithms • Complexity: cubic in the number of words in the sentence (there are better methods, up to 2.8) • Problem: combinatorics of variants
Advantages and disadvantages of consitituency approach Advantages • Nice mathematics, very well understood • Efficient analysis algorithms, very well-elaborated • Good for languages with fixed word order • English. Chinese? Disadvantages • Difficult translation into semantics • Bad when it comes to freer word order • Even in English! Worse in other languages
Head-driven approaches • Combine some advantages of dependency-based and constituency-based approaches • Syntax is still fixed-order. But word dependency information is added • Easier translation into semantics • More linguistically-based • How? • In each constituent, the main word (head) is marked • It modifies the head of the larger constituent [[Our Government][pays[ much attention][to it]]]
Syntactic ambiguity • I see a cat with a telescope • I see [a cat][with a telescope] • I use a telescope to see a cat • I see [a cat [with a telescope]] • I see a cat that has a telescope • Nearly any preposition causes ambiguity • Dozens, thousands, millions of variants for a sentence! • Because their numbers multiply • I see a cat with a telescope in a garden at the shore of a river
Ambiguity resolution • Syntactic means are not enough • Is telescope more related to see or to cat? • Statistical methods: is it used with see or cat? • Dictionary-based methods: does it share more meaning with see or cat? • Path length in a dictionary of semantic relationships • Ideally, context should be analyzed, and reasoning applied: • I see a cat with a telescope. It keeps the telescope in its left paw. • Now no good methods for this.
Shallow parsing • Due to the HUGE problems in resolving ambiguity • Do not resolve it! • Do what you can de well I see [a cat][with a telescope][in a garden][at the shore][of a river] • Better than nothing • Can be done well
Evaluation • PARSEVAL international contents • A practical parser usually gives only one variant • Implies disambiguation! • Manually built corpora (treebanks) • Compare what the program did with what humans did
One of the uses in IR:Lexical ambiguity resolution • Syntactic analysis helps in POS disambiguation: • Oil is used well in Mexico. • Oil well is used in Mexico. • Well = ? • But does not help in WSD: • I deposited my money in an international bank. • I live on a beautiful bank of Han river.
Research topics • Faster algorithms • E.g. parallel • Handling linguistic phenomena not handled bycurrent approaches • Ambiguity resolution! • Statistical methods • A lot can be done
Syntax structure is one of intermediate representationsof a text for its processing Helps text understanding Thus reasoning, question answering, ... Directly helps POS tagging Resolves lexical ambiguity of part of speech But not WSD-type ambiguities A big science in itself, with 50 (2000?) years of history Conclusions
Thank you! Till June 8? 6 pm Semantics