360 likes | 373 Views
Explore the general nature and significance of syntax, context-free grammars, syntactic analysis methods, and syntactic structure of sentences. Delve into parsing, active chart parsing, and motivation behind syntax analysis. Understand how syntax influences meaning in sentences. Discover the importance of constituent structure and recursive structuring in language. Gain insights into sentence structuring, movement of phrases, and syntactic units in different languages.
E N D
Syntax and Processing it(slide set 1) John Barnden School of Computer Science University of Birmingham Natural Language Processing 1 2014/15 Semester 2
Our Concerns • The general nature of syntax (grammatical structure of sentences) • Its significance • Context-free grammars S NP VP etc. etc. • Syntactic structure of a sentence, e.g.: • Parsing: using a grammar to compute a syntax (syntactic structure) for a given sentence. S NP VP
Overview • The motivation for the idea of syntax • Review of some types of phrase (syntactic categories) • (Review of) notion of context-free grammar • How separate/separable is syntax from other aspects of language? • Methods for syntactic analysis: • Active chart parsing (REQUIRED, covered in lectures) • Dependency Parsing (in outline; REQUIRED, but not covered in lectures) • Definite Clause Grammars [as an extension to Prolog] [OPTIONAL, not in lecs] • Recursive transition networks [OPTIONAL, not in lecs]
Motivation • Sentences seem to be able to be correctly or incorrectly structured in a way that seems largely independent of the meaning or the plausibility of the meaning. • The following seem correctly structured: • The monkey ate the banana. • The banana ate the monkey. • The car hit the building. • The building hit the car. • And moreover they seem to have the same structure. • The following seem incorrectly structured: • Banana the ate monkey the. • The the car building hit. • And we notice that incorrectness irrespective of considering meaning.
Motivation, contd • At the same time, what particular structure we see in a sentence affects meaning in important ways. • Classic example: The girl saw the woman with the telescope in the park. • Some possible rough structures and interpretations: • [The girl] saw [the woman] [with the telescope] [in the park] • The seeing used the telescope and happened in the park • [The girl] saw [the woman] [with [the telescope in the park]] • The seeing used the telescope that was in the park • The girl [saw [the woman with the telescope] [in the park]] • The seeing happened in the park; the woman had a telescope • The girl [saw [the woman [with [the telescope in the park]]]] • The woman was with the telescope that was in the park. • Any others?
Motivation, contd • We also see that phrases of certain sorts can be moved around in certain ways. • E.g., we can do this to • The girl saw the woman with the telescope in the park. • to get • The woman in the park with the telescope saw the girl. • But we cannot move a phrase just anywhere and we cannot move just any phrase around: e.g. cannot change the original to get • The with the telescope girl saw the woman in the park. • The girl saw the telescope the woman with in the park.
Motivation, contd • Recursive structuring: phrases can be seen to have other phrases as constituents: • We can see • The woman with the telescope in the park eating sandwiches • as • [The woman [with [the telescope [in the park]]] [eating sandwiches]] • There is no in-principle limit to the depth of structuring or to its breadth. • Hence there is no in principle limit to sentence length.
Motivation, contd • Some sorts of constituent of a sentence generally require the presence (in suitable positions) of other constituents of particular types. E.g. • A preposition needs to be followed by a specific sort of phrase • A verb may need to have surrounding constituents playing the role of subject, direct object, etc. • Any phrase of an appropriate type, no matter how internally complex the phrase is, can play a given role.
Motivation, contd: a Caveat • English grammar is very much based on contiguity (adjacency) of lexical forms and their order in the sentence. • This is partly because it has very little in the way of agreement between words (e.g. a plural subject requiring a plural form of the verb) and very little in the way of case endings(e.g. nominative versus accusative forms of a pronoun) or verb inflections. • But languages with more pervasive use of agreement, case endings, verb inflections, etc. can have widely separated words that form units. E.g., an adjective and a noun can be associated with each other by agreement rather than contiguity. • English arguably has examples of this: notably verb-particlephrases such as “take up” meaning “adopt”. The verb and the particle are often separated, as in “He tookthat marvellous plan of Julie’s up”. But they arguably form a syntactic unit. • We can even imagine a language where order and adjacency are completely irrelevant
a Caveat, contd. • So, in general, talk of “phrases” and “constituents” needs to be taken as encompassing non-contiguous sets of lexical forms, even though in the following they will always be contiguous sequences.
Motivation: Summary • Language has constituent structure: the expressions of the language can be structured into constituents of particular types, where a constituent of a particular type can often have, or may need to have, sub-constituents of particular types. • Language has syntactic compositionality: there are composition rules that, whenever they are given constituents of particular types, produce constituents (or whole sentences) of particular types, where the resulting types are dependent only on the rule and the constituents given to it. • The compositionality is linked to the constituent structure: to at least a first approximation, the resulting constituent contains the original constituents as parts. • E.g. Putting a determiner and a noun together to get a (simple sort of) “noun phrase”; putting a preposition together with a noun phrase to get a “prepositional phrase”; putting a noun phrase and a prepositional phrase together to get a noun phrase. • So, the constituent structure is systematic by virtue of the compositionality. • Through the syntactic compositionality, language is (in principle) productive: the composition rules can be applied successively to any extent, leading to infinitely many possible constituents and sentences. • And constituents/sentences can be of any structural depth and length, in principle.
Performance versus Competence • The ability freely to apply a grammar when speaking or understanding is often taken to be a sort of competence. • But often practical considerations place limits on what one can do: e.g. what amount of complexity one can tolerate in a sentence. • Notice the in principle bits on previous slide. • And one can make mistakes. • These perturbations of competence are labelled as being matters of performance. • But some performance matters can be built into grammars, so it’s ultimately rather arbitrary which effects are relegated to “performance.” • Performance also encompasses graded effects like relative perceived naturalness of structures in specific cases. E.g.: although “He gave the crying boy a nice red pencil” and “He gave a nice red pencil to the crying boy” are both equally OK, consider: • He gave the crying boy who had spots all over his face and a hole in his shirt a nice red pencil • He gave a nice red pencil to the crying boy who had spots all over his face and a hole in his shirt.
Syntactic Categories • Syntactic Categories: the categories of constituent: e.g. Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase (PP), etc., • and I include all the lexical categories (Noun, Adjective, etc.). • Sentence (S): Phrases in S are the largest lexical-form sequences that are subject to syntactic characterization, usually. • But people have proposed grammar-like structuring of discourse more broadly. • A special case of this are patterns of “turn taking” in conversation. • Of course, can trivially go beyond sentences, as e.g. defining a paragraph to be a sequence of sentences. • A sentence can in principle contain a smaller sentence as a constituent, although a grammar often classifies the smaller sentence as a “clause” instead. • A clause is a sentence-like constituent; often can’t stand as a sentence in its own right – e.g. Relative clause “that is in the park” following “the telescope”. • S needs to cover declarative (= statement-making), interrogative, and imperative sentences.
Syntactic Categories: S contd • To cope with language in practice, we have to allow S to be broader than normally thought of, or allow a discourse to include complete utterances that are non-sentences. E.g., consider the following complete utterances: • Yes. / Hallo. / Hmm? / Oh, damn. / How about it. • Brilliant. / Or tomorrow, for that matter. / A beer, please. / You silly idiot. • [Could be considered to be cases of ellipsis.] • Abbreviated language of various other sorts, such as: • the expression in square brackets just above • newspaper headlines • in road signs • in telegrams
Syntactic Categories: General Problems • Ellipsis more generally needs to be taken care of: either within the definition of S and other syntactic categories, or as a special side-issue that can distort constituents. E.g. • John wants to have snail and seaweed ice-cream on his cake, but Mary just seaweed. • Some idioms or other fixed phrases, and some special constructions, have strange syntax: • By and large, ... / By the by, ... • The more I walk the more relaxed I get. [Can also be just part of a larger unit.] • Punctuation: typically not handled in a grammar, partly because too inconsistent and too style-dependent. • Direct speech/thought is often not quoted, leading to special structuring, e.g.: • No! I’m not up for that, he said/thought.
Syntactic Categories: Noun Phrase (NP) • An NP is a constituent that can act like a noun. • Examples [with illustrative context in square brackets]: • any noun, any proper name • any non-adjectival pronoun (I, him, these [when pronoun], which [when pronoun], ...) • the horse / that horse / my horse / John’s horse • three huge red cars • one hundred and three huge, very red cars • the horse with a long tail and a glossy mane (includes a PP) • the horse that had a long tail and a glossy mane (includes a relative clause) • the man called Horse [lived with the Algonquins] • the baking (gerund) • baking a cake [is a tricky business] (gerund) • which horse [do you want to ride?] • for John to shout at a policeman [is unusual] (for-complement used as noun) • to shout at a policeman [is to run a huge risk] (to-complement used as noun) • that John shouted at a policeman [amazes me] (that-complement used as noun) • the Riviera firefighter arson scandal (noun-noun compound)
Syntactic Categories: Prepositional Phrase (PP) • a PP is a preposition followed by an NP • on the table • with metal legs • The preposition may be more than one word (e.g. In: it can cost up tofive pounds). • The NP can itself include a PP (e.g.: on the table with the metal legs ). • In that example, “with the metal legs” serves an adjectival function: • it’s a form of adjectival phrase • He broke the table on Saturday: the PP serves an adverbial function: • it’s a form of adverbial phrase • Should “in back of”, “by means of”, “with respect to” etc. be counted as multi-word prepositions? Or should “back”, “means” and “respect” be treated as separate nouns?
Syntactic Categories: Verb Phrase (VP) • a VP is a verb possibly preceded by auxiliary verbs and possibly followed by (non-subject) complements and adjuncts • washed • washed the dishes • wash the dishes (could be a command) • do wash the dishes • don’t / do not wash the dishes • [we] wash the dishes • was putting the cloth on the table at 5pm • “the cloth” and “on the table” are complements of the verb: necessary parameter values • “at 5pm” is an adjunct: an additional, optional qualification • believed that John was laying the table • “that John was laying the table” is a that-complement, containing a clause/sentence • wanted to lay the table / Mary wanted John to lay the table: • here we have a to-complement, containing a VP or a clause (a sentence-like phrase)
Syntactic Categories: Relative Clauses • A relative clause is a clause with an adjectival function, hence a form of adjectival clause. • A restrictive relative clause: a clause that has an adjectival function within a noun phrase, further restricting what is being talked about: • The goat that/which had long ears [ate my shoelaces] (pedantically which is held to be incorrect) • The goat what/wot ’ad long ears [ate my shoelaces] (incorrect!) • The man that/who had long ears [ate my lunch]. (pedantically who is held to be incorrect) • The man with whom I had lunch [had long ears]. • The man I had lunch with [had long ears] • The man that/who I had lunch with [had long ears] • A non-restrictive relative clause: a clause that is in apposition with (= accompanies) a noun phrase, and adds extra information to something already specified by the noun-phrase: • The goat, which had long ears, [ate my shoelaces] • The man, with whom I had lunch by the way, [had long ears] • The man, who I did have lunch with by the way, [had long ears] • Restrictive case: no commas. Non-restrictive case: commas.
Syntactic Categories: Relative Clauses, etc. • The goat really did eat my shoelaces. [Needed for exam ] • [Mary knows] who laid the table: a who-complement, with the same internal syntax as a relative clause. • Apposition is a broader phenomenon than non-restrictive relative clauses: • Theodosia Kirkbride, a most refined landlady, never mentioned the rent to me. • Theodosia Kirkbride – landlady, mud wrestler and quantum physicist – ate my lunch. • Theodosia Kirkbride, the landlady, never mentioned the rent to me. • The landlady, Theodosia Kirkbride, never mentioned the rent to me. • The vile boy, singing an annoying little X Factor tune, kicked my dustbin over.
What Now? • Particular syntactic structures as syntax trees. • NB: Somewhat restrictive notationally. Not the only method used, but is a standard tool. • Context-free grammar for capturing rules of syntactic structure (that is generally expressed as trees). • Active Chart Parsing – a major, instructive traditional parsing mechanism, based on CF grammar • Dependency Parsing: a major current, statistical learning based technique, with a different view of syntactic structure [slides to be linked from module webpages; see also the textbook] • Optional reading (slides to be linked from module webpages): • Definite Clause Grammars: based on traditional CF grammar in Prolog. • Recursive Transition Networks: based on traditional CF grammar, in network form
a Syntax Tree XY : non-terminal syntactic categories Xyz : terminal syntactic categories abc: wordforms S NP VP Det Noun Verb PP PP Prep NP Prep NP Det Adj Noun Noun The dog sat on a big mat with dignity
Ambiguous Syntax: Another Parse S NP VP Det Noun Verb PP Prep NP Det Adj Noun PP Prep NP Noun The dog sat on a big mat with dignity
Diagnostic Syntax Exercise(optional homework, not assessed) • For each of the following, annotate each word with a POS (can just be simple ones like Noun, Verb, etc.) and provide a syntax tree. If you can’t do the latter, at least annotate some phrases as to what sort of thing they are: “noun phrase”, “verb phrase”, “prepositional phrase”, “relative clause” or whatever. • Dogs hate cats. • The dog bit three cats. • The dog bit the cat with the long tail. • The girl saw the woman with the telescope in the park. • The man who came to dinner was my uncle. • Shouting at a policeman is not a sensible thing to do. • For John to shout at a policeman is unusual. • The more I walk the thinner I get. • By and large, the British are confused about their identity. • Mary believes that Cadillacs are a sort of fruit. • Mary wants to eat a Cadillac, boiled then fried.
Context-Free Grammars (CFGs) • A CF grammar consists of productions(a form of rule). A production is of form: • <non-terminal symbol> <seq of one or more symbols (terminalor non-terminal)> • Non-CF grammars: more symbols allowed on LHS. • Non-terminal symbols: non-terminal syntactic categories (NP, etc.) and special categories • (e.g., AdjSeq, PPseq – though these could be regarded as new syntactic categories). • Terminal symbols: lexical categories (Noun, etc.) and sometimes specific lexical forms that play a special syntactic function (e.g.: to{preceding infinitive verb}, by {in passive voice}). • Examples of productions: • S NP VP S NP VP PP S VP • NP Noun NP Det AdjSeq Noun • AdjSeq Adj AdjSeq AdjSeq Adj
Context-Free Grammars (CFGs), contd. • Some other refinements: • Kleene star for zero or more occurrences: S NP VP PP* • Related + symbol for one or more occurrences: NNcompound Noun+ • Square brackets for optional elements: NP [Det] Adj* Noun • Upright bar for alternatives: MOD PP | RELCLAUSE • Such refinements reduce the number of productions needed, but are not formally necessary (they don’t increase the coverage of the grammar). • But can affect precise syntactic structure: PP* causes possibly more branches at parent node, • whereas PPseq leads to binary branching. • NB: Traditionally, “N”, “V” are used instead of our “Noun”, “Verb”.
Tree view of productions VP Verb NP VP Verb NP
Grammatical Categories • A grammatical category is a dimension along which (some) lexical or syntactic consistuents can vary in limited, systematic ways, such as (in English): • Number singular or plural: lexically, nouns, verbs, determiners, numerals • Person first, second and third: lexically, only for verbs, nouns and some pronouns • Tense present, past (various forms), future: lexically, only for verbs • Gender M, F, N [neither/neuter]: lexically, only some pronouns and some nouns • Syntactic constituents can sometimes inherit grammatical category values from their components, e.g. (without showing all possible GC values): • the big dog: 3rd person M/F/N singular NP // the big dogs: 3rd person M/F/N plural NP • we in the carpet trade: 1st person M/F plural NP // you silly idiot: 2nd person M/F singular NP • eloped with the gym teacher:past-tense VP // will go:future-tense VP • the woman with the long hair: female NP // the radio with the red knobs: neuter NP • A lexical or syntactic constituent can be ambiguous as to a GC value: • e.g. sheep: singular/plural; manage: singular/plural 1st/2nd person
((Aside)) • Syntactic constituents can sometimes have grammatical category values arising from the combination of components, not inherited from (though affected by those of) individual components: • Will have gone: future-tense & perfect-aspect VP {will have: future; gone: past participle}
Grammatical Categories, contd • Why worry about grammatical categories? Because there needs to be agreement of certain sorts. We can’t have the following (the * indicates incorrectness): • Within NPs as to number: • * a flowers // * this flowers // * many flower // * three big flower • Between VPs and their NP subjects as to number & person, and case when NP is pronoun: • * he eagerly eat [when not subjunctive] • * the women in the theatre eats • * all the women was • * I is // us are // * them went • Between VPs and their pronoun direct objects as to case: • * the guy robbed we • Between NPs and pronouns referring to them, as to number and gender: • * the woman took out their gun and shot his dog [when it’s her gun and her dog]
Centre Embedding • CFGs can deal with unrestricted centre embedding: The book that the professor recommended is good. 1 1 The book that the professor that the students like recommended is good. 1 2 2 1 The book that the professor that the students that dogs hate like recommended is good. 1 2 3 3 2 1 • English grammar supposedly allows this to unrestricted depth. • But this is disputed .......
First, some variant examples • Without the “that”s: The book is good. The book the professor recommended is good. 1 1 The book the professor the students like recommended is good. 1 2 2 1 The book the professor the students dogs hate like recommended is good. 1 2 3 3 2 1 • Formally similar but much more difficult yet (even if you put back the thats): Dogs fight. Dogs dogs fight fight. Dogs dogs dogs fight fight fight. Dogs dogs dogs dogs fight fight fight fight.
The Dispute about such examples • We feel increasing difficulty with increasing depth of centre embedding, especially but not exclusively when semantic cues don’t serve to keep the levels distinct. • It’s traditional to say that this difficulty is “just” a matter of performance: i.e. just a matter of practical restrictions on what we can handle by way of grammar, and doesn’t affect our competence: i.e. the knowledge of grammar that we actually have. • And certainly if we think hard and long enough we can make sense of any amount of embedding, it might be argued. • But another view is that such competence is illusory in the sense that it’s not part of our natural, unreflective use of language: we can only go beyond the limitations by special, conscious effort. So the grammar involved in natural communication does not have things like unrestricted centre embedding. • Many connectionists (neural net theorists), and others, have views like this.
Finite State Automata/Networks • FSAs cannot achieve unrestricted centre embedding. They are therefore traditionally said to be inadequate for capturing grammars of natural languages. • A variant form of network called Recursive Transition Networks (see separate optional slides) cures the problem.
Finite State Automata/Networks(Optional) • Notice the basic structure that the centre-embedded examples I gave have is • An B Cnwhere A, B, C are syntactic categories of some sort, and both occurrences of n are indeed the same value. (See the numerals annotating the examples: they show the n layers.) • It can be proved that FSA cannot recognize such structures for arbitrary n while excluding structures of form An B Cmwhere m is different from n. • Intuitively, the basic problem with FSAs is that they can’t remember a count that could get indefinitely large, in this case n. • Since the automaton can only be in finitely many different states, it’s impossible for it to reliably know, once something of form An B has been consumed, what n was in all cases. So there’s no way it could possibly consume just the right number of Cs in all cases.
FSAs, contd. (Optional) • Exercise: satisfy yourself that if there is some limit on n, then an FSA can work. • Simple exercise: be sure you know how an FSA could recognize things of form An B Cmwhere m may be different from or equal ton. • In both those tasks, you can take A, B and C to be single lexical forms a, b and c, for simplicity. • The point of those exercises is that they show that the problem is: not being able to remember an indefinitely large count.