1 / 44

A Brief History of the Penn Treebank

A Brief History of the Penn Treebank. Mitch Marcus. For those who know nothing about the Treebank. VP. have. been. VP. expecting. SBAR. VP. NP. NP. NP. in. NP. the US car maker. an eventual 30% stake. a GM-Jaguar pact. the British company. WHNP -1. give. that.

bebe
Download Presentation

A Brief History of the Penn Treebank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief History of the Penn Treebank Mitch Marcus

  2. For those who know nothing about the Treebank... History of Penn Treebank

  3. VP have been VP expecting SBAR VP NP NP NP in NP the US car maker an eventual 30% stake a GM-Jaguar pact the British company WHNP-1 give that The Penn Treebank: A Syntactially Annotated Corpus S • Wall Street Journal: 1.3 million words • Brown Corpus: 1 million words • Switchboard: 1 million words • All Tagged with Part-of-Speech & Syntactic Structure • Developed ’88-’94 VP NP-SBJ Analysts NP S VP NP-SBJ *T*-1 would NP PP-LOC

  4. The • founder • of • Pakistan’s • nuclear department • Abdul Qadeer Khan • has • admitted • he • transferred • nuclear technology • to • Iran, • Libya, • and • North Korea NP NP PP NP S NP NP VP VP SBAR NP S VP NP PP NP NP NP NP 1995: A breakthrough in parsing 106 words ofTreebank Annotation + Machine Learning = Robust Parsers (Magerman ’95) Training Program training sentences answers Models The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya and North Korea Trees Parser • 1990 Best hand-built parsers: ~40-60% accuracy (guess) • 1995+ Statistical parsers: >90% accuracy • (both on short sentences)

  5. Measuring Progress with Parseval: ’95-’05

  6. Current Google Scholar Cites: 3222 Using measures of semantic relatedness for word sense disambiguation Instance-based natural language generation Re-structuring, re-labeling, and re-aligning for syntax-based machine translation Predicting syntax: Processing dative constructions in American and Australian varieties of English Efficient third-order dependency parsers Modeling reading development: Cumulative, incremental learning in a computational model of word naming Improving syntactic coordination resolution using language modeling Minimized models and grammar-informed initialization for supertagging with highly ambiguous lexicons Posterior regularization for structured latent variable models Unsupervised Acquisition of Lexical Knowledge From N-grams: Final Report of the 2009 JHU CLSP Workshop History of Penn Treebank

  7. A VERY PARTIAL HISTORY ( 1 of 2) BROWN CORPUS (1964) 1 million words balanced between a wide range of text genre American English Annotated with parts of speech (completed in 1979). IBM/LANCASTER "TREE BANK" (1986-) Now over 2 million words AP Newswire and Canadian parliamentary debates (English only). Part of speech AND skeletal parsing PENN TREE BANK Phase 1(1989-1992) • 3 million words parsed; 4.5 million words tagged.

  8. A VERY PARTIAL HISTORY (2of 2) ACL/DCI CD-ROM PRESSED - 1991 • 600Meg of text data available at nominal cost BRITISH NATIONAL CORPUS (1991 -) • Goal: 100 million words tagged, some parsed U.S. LINGUISTICS DATA CONSORTIUM (4/92 -) • $ 5+ million to collect annotated and unannotated speech and text.

  9. CONTEXT

  10. Treebank Born here

  11. Fred Jelinek – ACL Talk “We were not satisfied with the crude n-gram language model we were using and were “sure” that an appropriate grammatical approach would be better. Because we wanted to stick to our data-centric philosophy, we thought that what was needed as training material was a large collection of parses of English sentences. We found out that researchers at the University of Lancaster had hand-constructed a “treebank” under the guidance of Professors Geoff Leach and Geoff Sampson (Garside, Leech, and Sampson 1987). Because we wanted more of this annotation, we commissioned Lancaster in 1987 to create a treebank for us. Our view was that what we needed above all was quantity, possibly at some expense of quality: We wanted to extract the grammatical language model statistically and so a large amount of data was required. Another belief of ours was that the parse annotation should be carried out by intelligent native speakers of English, not by linguistic experts, and that any inaccuracies would naturally cancel each other. Indeed, the work was done by Lancaster housewives led by a high school drop-out.”

  12. Geoffrey Leech – University of Lancaster • Lancaster-Oslo-Bergen corpus 1978 • IBM-Lancaster treebank • 1987 -1992 • “Skeleton parsing” 1988 • Corpus never made available

  13. On the other hand... “Each assigned parse was checked against this grammar. A discrepancy could either signal an error in the assigned parse, which the treebanker would then correct, or a discrepancy in the grammar...” – Roger Garside, Anthony McEnery. Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach, Ezra Black, Roger Garside, and Geoffrey Leech (editors)

  14. Jelinek ACL Lifetime Achievement Talk “Meanwhile, Geoff Leech set out to assemble the British National Corpus and we thought that the United States should have something like that as well. So in 1987 I arranged to visit Jack Schwartz, who was the boss of Charles Wayne at DARPA, and I explained to him what was needed and why. He immediately entrusted Charles with the creation of the appropriate organization. One of the problems was where the eventual corpus should reside. Deep-pocketed IBM would be unsuitable: Possessors of desirable corpora would charge immoderate sums for the acquisition of rights. I thought that only a university would do. So I inquired of Aravind Joshi and Mitch Marcus (and perhaps even Mark Liberman) at the 1988 Conference of Applied Natural Language Processing in Austin whether the required site could be the University of Pennsylvania. My colleagues were interested, and Charles Wayne invited appropriate people to a meeting at the Lake Mohunk Mountain House to discuss the matter. That is how the Linguistic Data Consortium was born.” (Fred conflates two different stories – Mitch Marcus)

  15. My (and other’s) recollections DARPA Mohonk Workshop May 1988 – Allen Sears asks me if I’m take on a syntactically annotated corpus Right before workshop – Fred Jelinek calls me out of the blue & asks if I’d be interested in annotating a corpus [Charles Wayne at Fred’s memorial service: Fred visited him and asked him to diagram sentences...] Aravind: At an earlier session at Mohonk, Fred talked to Aravind about my interest... (LDC created a year or two later...)

  16. Treebank Proposal

  17. Draft of original DARPA proposal 7/88 This paper proposes the development of a national "tree bank" of both written and spoken sentences in context to be done at the University of Pennsylvania. Both written and spoken materials will be annotated with part of speech and skeletal syntactic parsings; spoken material will additionally include at least self-correction markers and major intonational boundaries. This data base will be developed over a ten year period, and will consist of over 100 million words, predominantly of written text. The total cost of this project is likely to be about $10 million. This proposal should be viewed as extending a relatively small pilot project in Lancaster England, funded by IBM, that will have annotated one million words of text from exactly one source. To begin this project, we request that DARPA commit funds for an initial two-year pilot program at the University of Pennsylvania, which will develop tools and techniques for building this tree bank as well as immediately beginning the task of acquiring and annotating written and spoken language.

  18. Draft of original DARPA proposal 7/88 On Tuesday, May 10, 1988, a group of fifteen researchers from various research institutes, industrial research labs and government agencies met with several of us at the University of Pennsylvania and concluded that it was both feasible and desirable to construct two very large parallel data bases of judiciously annotated written and (transcribed) spoken American English.... The cornerstone of both data bases should be a linguistic annotation, primarily of syntactic structure, ... consistent with the principle that the resulting corpus should (a) ... accurately reflect the distribution of grammatical phenomena in English, (b) accurately reflect deeply felt intuitions of native speakers as to the structure of English, and (c) be theory neutral between a wide range of approaches to grammar. In what follows, we will concentrate on describing an initial two-year pilot phase which will result in a robust methodology to jointly maximize (a) the consistency of the annotation scheme, and (b) the richness of the annotation. It is essential that the research strategy pursued must yield immediate output, and we intend to do so.

  19. Draft of original DARPA proposal 7/88 The immediate goal will be to develop a protocol that elicits consistent information quickly. We will begin by asking annotators to do a fairly shallow analysis, and accertain that this is being done quickly and consistently. Only then will we attempt deeper annotation, repeating this cycle until either consistency cannot be maintained or the analysis takes too long.

  20. THE TREEBANK I PROJECT

  21. MOTIVATIONS EMPIRICAL LINGUISTIC STUDIES "GLASS BOX" EVALUATION OF NLP SYSTEMS AUTOMATIC EXTRACTION OF LINGUISTIC STRUCTURE • Various approaches possible • Statistical modeling • Symbolic learning • Connectionist STOCHASTIC NATURAL LANGUAGE PROCESSING IN: • Spoken Language Systems • Message Understanding Systems

  22. GOALS OF THE PENN TREEBANK PROJECT - 1 To annotate a corpus of approx. 5 million words of text with part of speech and "skeletal" syntactic information To develop, test and automate approaches and techniques for annotating very large linguistic corpora To develop, test and automate techniques for analyzing annotated corpora To show the plausibility of ultimately annotating 100 million words

  23. GOALS OF THE PENN TREEBANK - 2 THE CORPUS SHOULD BE LARGE ENOUGH TO CAPTURE AND ACCURATELY REFLECT DISTRIBUTION OF ALL MAJOR GRAMMATICAL PHENOMENA. THE ANNOTATION OF THE CORPUS SHOULD • Reflect deeply felt intuition of native speakers • Allow annotators to say all and only what they are sure of. • Remain neutral between a wide range of linguistic theories -- The annotation must encode the common, shared pretheoreticdescriptive grammatical account of working syntacticians.

  24. The Challenge of Treebank I • IBM Group under Jelinek wants quantity, quantity, quantity • “There’s no data like more data” – Bob Mercer • IBM Question: Can the Penn Group, or any group in U.S. NLP tradition, produce useful data quickly without elaborate theoretical ponderings? • Our strategy: • Do the best we can quickly. • Get data into the hands of users • Hope that it’s useful enough to support revolutionary change • Hope that the shortcomings become clear to everyone

  25. SIMPLIFYING A TAG SET RATIONALE BEHIND BRITISH AND EUROPEAN EFFORTS: • To provide "distinct codings for all classes of words having distinct grammatical behaviour" - Garside et al. 1987 <Examples left out> WHY A SMALL TAG SET? • Recoverability - Many category distinctions are recoverable from context given either:-- Identity of the lexical item -- The particulars of syntactic context • Consistency - Some tags are used inconsistently by taggers, for reasons we believe are fundamental.

  26. PART OF SPEECH TAGGING SYSTEMS TAG SETS VARY ENORMOUSLY IN COMPLEXITY Brown Corpus: 87 Simple Tags. 179 Simple and Complex Tags LOB Corpus: 132/136 Tags Lancaster Tag Set: 166 Tags Lund Tag Set for London-Lund Corpus: 197 Tags Penn Treebank: 47 Tags. BUT: Brown uses complex tags for: can't/MD*, he's/ PS+BEZvs.Penn, (Lancaster): ca-/MD n’t/RB, he/PPS 's/BEZ

  27. 9/10/09

  28. SYNTACTIC RECOVERABILITY Tutorial from 1992 PREPOSITIONS VS. SUBORDINATING CONJUNCTIONS "Since the last meeting, things have changed plenty" "Since we first heard about stochastic methods, things have changed plently. " WE TAG BOTH AS IN. SUBJECT VS. OBJECT PRONOUNS RECOVERABLE FROM POSITION IN TREE TO AS PREPOSITION VS TO AS AUXILIARY CAN BE RECOVERED BY POSITION IN PARSE TREE BIG MISTAKE - THE PARSER NEEDS THIS INFORMATION. History of Penn Treebank

  29. History of Penn Treebank

  30. SKELETAL BRACKETING - Automated stage • Input: Output of POS tagging phase • Automatic mapping of Penn Treebank POS tags to POS tags of Don Hindle's FIDDITCH • Automatic assignment of syntactic bracketing by FIDDITCH - FIDDITCH provides a single analysis for any given sentence - If uncertain, FIDDITCH is able to provide only a partial structure - FIDDITCH "chunks" are fairly accurate . • Output: Initial bracketing History of Penn Treebank

  31. History of Penn Treebank

  32. SOME FORMS OF LINGUISTIC STRUCTURE PARTS OF SPEECH: Brown Corpus on NOUN PHRASES UP TO HEAD: Easy CLAUSE BOUNDARIES: A few robust parsers for "unconstrained text", with errors. By hand, not too hard. PREPOSITIONAL PHRASES: Not too hard GAP LOCATIONS: ?? PP ATIACHMENT: 10% undecidable by humans on real texts! ADJUNCTS VS ARGUMENTS: Subtle issues ... PREDICATE/ARGUMENT STRUCTURE: ?? COREFERENCE OF NPS AND GAPS: ?? COREFERENCE OF ALL NPS: ?? SCOPE OF CONJUNCTIONS: ?? Q: WHICH OF THESE CAN BE DONE EITHER AUTOMATICALLY OR SEMI-AUTOMATICALLY History of Penn Treebank

  33. THE TREEBANK II PROJECT

  34. Treebank II CL article ‘94 THE PENN TREEBANK: A REVISED CORPUS DESIGN FOR EXTRACTING PREDICATE ARGUMENT STRUCTURE Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Mark Ferguson, Karen Katz, Britta Schasberger, Ann Bies Department of Computer and Information Science University of Pennsylvania Philadelphia, PA, USA

  35. Treebank II: Enabling Predicate/Argument Extraction Talk from 1994 Desired: A level of representation which resembles LFG f-structure in information content, allowing easy automatic determination of: The main predicate The logical subject The logical object Easily distinguished semantic arguments and semantic adjuncts To do this, the annotation scheme must: Incorporate a consistent treatment of related grammatical phenomena. Provide co-indexed null elements in "underlying" syntactic positions for wh-movement, passives, etc. Provide a notation to allow the recovery of the structure of discontinuous constituents. Provide a labeling scheme for semantic arguments and adjuncts, where these are clear. History of Penn Treebank

  36. CONSISTENT GRAMMATICAL ANALYSES • Underlying structure: • consider(I, fool(Kris)) • Notes: • -SBJ marks the SuBJect . • - TMP marks time (TeMPoral) phrases. Talk from 1994 Problem: Predicates and predications are inconsistently annotated in current scheme. Solution: Every S maps into a single predication. The predicate is either • The head of the lowest VP • Immediately under copular BE. • Tagged -PRD (PReDicate). (S (NP-SBJ I) (VP consider (S (NP-SBJ Kris) (NP-PRD a fool)))) (SQ Was (NP-SBJ he) (ADVP-TMP ever) (ADJP-PRD successful) ?) History of Penn Treebank

  37. NULL ELEMENTS • Predicate Argument Structure: eat (Tim, what) • Notes: • SBARQ marks WH-questions. • SQ marks auxiliary inversion. • WHNP) WHADVP) WHPP) ... marks fronted WH-moved element. • WHxxalways leaves a co-indexed trace. Talk from 1994 • Two null elements: - *T* - marks WH-movement and topicalization. - * - marks everything else. Null elements are co-indexed using integer tags. Predicate argument structure is recovered by replacing the null element with the lexical material it is co-indexed with. (SBARQ (WHNP-1 What) (SQ is (NP-SBJ Tim) (VP eating (NP *T*-1))) ?) History of Penn Treebank

  38. THE FUNCTIONAL TAG SET Talk from 1994 9/10/09

  39. History of Penn Treebank

  40. CONCLUSIONS Talk from 1994 • Annotation now underway. • Annotators workstation minimizes effort per annotation. • Expect 1500 words per hour per annotator productivity • First goal: Reannotate • 1 million word Brown corpus. • 1 million words of Wall Street Journal corpus History of Penn Treebank

More Related