1 / 34

Do we still need corpora (now that we have the Web)?

Do we still need corpora (now that we have the Web)?. Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it. Postgraduate Conference in Corpus linguistics 22 May 2008. The corpus.

aislin
Download Presentation

Do we still need corpora (now that we have the Web)?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Do we still need corpora (now that we have the Web)? Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it Postgraduate Conference in Corpus linguistics 22 May 2008

  2. The corpus • A collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis. (Francis 1992(1982):17) • A collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (Sinclair 1991:171) • A closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria. (Engwall 1992:167) • Finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery and Wilson 1996:23) • A collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety. (McEnery et al. 2006:5)

  3. The Web • A mine of language data of unprecedented richness (Lüdeling et al 2007) • A fabulous linguists’ playground (Kilgarriff and Grefenstette 2003) • [a] cheerful anarchy (Sinclair 2004) • A helluva lot of text, stored on computers… (Leech 1992:106)

  4. Is the Web a corpus? Yes! The definition of corpus should be broad. We define a corpus simply as “a collection of texts”. If that seems too broad, the one qualification we allow relates to the domains and contexts in which the word is used […]: A corpus is a collection of texts when considered as an object of language or literary study. The answer to the question “Is the web a corpus?” is yes. Kilgarriff and Grefenstette (2003:334)

  5. Is the Web a corpus? No! The cheerful anarchy of the Web thus places a burden of care on a user, and slows down the process of corpus building. The organisation and discipline has to be put in by the corpus builder. […] users of a corpus assume that there is a consistency of selection, processing and management of the texts in the corpus. Corpora should be designed and constructed exclusively on external criteria. (Sinclair 2005)

  6. This talk • The Web and the corpus • Disambiguating the WaC acronym • Where the Web wins out • Where the corpus holds its ground • Web as Corpus initiatives @ Forlì • The BootCaT way • The WaCky! way • Open issues and ways forward

  7. Web as Corpus? • (The Web corpus “proper”) • The Web as a corpus surrogate • The Web as a corpus supermarket • The mega-corpus (or mini-Web)

  8. The Web as a corpus surrogate • Googleology… • e.g.: Keller and Lapata (2003) • Predicate-argument bigrams • adj-noun, noun-noun, verb-noun • not attested in the BNC “Web counts correlate reliably with [human plausibility] judgments, for all three types of predicate-argument bigrams tested, both seen and unseen. For the seen bigrams, […] the Web frequencies correlate better with judged plausibility than corpus frequencies” (ibid: 481). • … is bad science “Working with commercial search engines makes us develop workarounds. We become experts in the syntax and constraints of Google, Yahoo!, Altavista, and so on. We become ‘googleologists’” (Kilgarriff 2007:147)

  9. Google… • Unreplicable • Véronis (2005): 5 billion "the" have disappeared overnight • Kilgarriff (2007:148): “queries are sent to different computers, at different points in the update cycle, and with different data in their caches” • Uncontrollable • Asterisk treated as placeholder for 1 word or more than 1 word • Punctuation and capitalisation disregarded (even in phrases) • Search hits are per page • Ranking criteria and result sorting (popularity, geographic relevance, …) • Linguistically naïve • No morphosyntactic annotation • 36 queries to extract fulfill + obligation (Keller and Lapata 2003) • Impossible to extract fulfill + NOUN • Unsophisticated query language • No sub-string matching • No span options

  10. SE post-processors? • e.g. WebCorp, KWiCFinder • Wildcards and tamecards • Concordance output • Collocation • Not a solution, really • Slow • Same limits as SE

  11. The Web as a corpus supermarket • Selecting and downloading texts • General or specialized • Can be automatised (infra) • e.g. (general): • Leeds Internet corpora (Sharoff 2006) • English, Chinese, Finnish, French, German, Italian, Japanese • Lemmatised and pos-tagged • Indexed with the CWB and searchable online (CQP) • Fletcher’s WaC (Fletcher 2007) • ~500M words of English (AU, CA, GB, IE, NZ, US) • will be pos-tagged

  12. Pros • “Traditional” corpus => • Replicable results • Control over corpus contents • In principle • Control over search methods • Linguistically sophisticated searches supported

  13. BUT… • Compromise btwn Web and corpus => • Relying on SE (Google, LiveSearch) • Size • Up-to-dateness • Understanding of corpus contents/structure • Variety of corpus contents • Noise

  14. The mega-corpus/miniweb • Baroni (2007): Effort spent by NLP community in developing Google-skills would be better spent building our own Google-sized corpora • None available so far, but: • WebCorp (Renouf et al. 2007) • The WaCky! effort (infra) • Ultimate objective, build a linguist’s search engine for the Web

  15. Where the Web wins out • Up-to-dateness • Size • Convenience • Cost • Ease of collection • Under-resourced languages • Web-specific genres • Reference purposes

  16. Where the corpus holds its ground • Selection on external criteria • Cf.: a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research (Sinclair 2005) • Register/genre control • Representativeness and documentation • Pre- or non-Web genres

  17. e.g.: McEnery et al 2007 • Collocation information for learners’ dictionaries • “Help”: Full or bare infinitive? • Varieties of English, language change, syntactic environment • Acquisition of grammatical morphemes • Learner language • Swearing in modern British English • writing vs. speaking • sociolinguistic variables • Conversation vs. formal speech in AmEng • Aspect marking in English-Chinese translation • Parallel corpora • Cf. Resnik and Smith (2003)

  18. Two approaches to the Web as corpus • The BootCaT way • Select initial seeds (terms) • Query SE for random seed combinations • Retrieve pages and format as text (corpus) • Extract new seeds via corpus comparison • Iterate • Designed for translation students • Also used for reference corpus building • Leeds Internet Corpora

  19. BootCaT pros… • Implemented in perl as a set of simple command-line scripts • Freely available (http://sslmit.unibo.it/~baroni/bootcat.html) • documented • Integrated into the Sketch Enginepipeline • Community effort • WebBootCaT • JBootCaT

  20. acetic acid acidity aftertaste aged alcohol appley aroma ascescence astringent … wine rich unfiltered attractive wine stylish "malolactic fermentation" sour wine meager harsh spritzy wine dumb tobacco direct wine watery grapey tears wine hazy breed nouveau wine spicy flat body wine vinous spritzy unfined wine fleshy cigarbox easy wine puckery sharp nutty … An example: wine tastingAutomatic query generation

  21. “vanilla” collocates (span=1R) BootCaT wine tasting corpus (English, 1.5M words) BNC

  22. …and BootCaT cons • Relies on SE=> same limits (cf. supra) • …and Google no longer gives out API keys • Not really an option for very large corpus building projects

  23. A more ambitious alternative The Wacky way • Aim: produce very large (~2bn words) web-derived corpora for several languages • Collaborative effort, using existing open tools, making developed tools publicly available • http://wacky.sslmit.unibo.it/ • Wacky corpora currently available: • deWaC, itWaC, ukWaC, frWaC

  24. The Wacky pipeline • Submit random word combinations to Google and obtain list of URLs (seeding) • Crawling (Heritrix) • Code removal and boilerplate stripping • Language filtering • Near-duplicate detection • Tokenization, POS-tagging and lemmatisation • Indexing and querying

  25. An example: constructing ukWaC • Seeding: mid-frequency content words (BNC); words from spoken text (BNC); vocabulary list for foreign learners • Crawl limited to UK domain and html • Processing • Only files btwn 5 and 200kb kept • Perfect duplicates discarded • Code, boilerplate, files with unconnected text and pornographic pages removed • Near-duplicates removed

  26. UkWaC: Details and size • 2,000 seed word pairs • 6,528 seed URLs • 351 GB raw crawl size • 19 GB after document filtering • 5.69 M of documents after filtering • 12 GB after near-duplicate cleaning • 2.69 M of documents after near-duplicate cleaning • 30 GB size with annotation • 1,914,150,197 tokens • 3,798,106 types • Further info and availability: http://wacky.sslmit.unibo.it/

  27. A wacky exampleResults for wacky+NOUN (>2), Baroni et al. submitted • BNC 3 ideas 2 roles 2 photo 2 items 2 humour 2 characters • UkWaC 71 world, 44 ideas, 43 wigglers, 42 wiggler, 28 characters, 27 sense, 22 comedy, 21 stuff, 21 races, 20 things, 19 idea, 15 humour, 13 games, 12 race, 11 backy, 10 baccy, 10 fun, 10 game, 10 inventions, 10 names, 10 uses

  28. WaC: What the future holds • Have WaC replaced “traditional” corpora? • Not really… • Challenges • Cleaning techniques • Web-tuned annotation tools • Indexing and querying systems • (Automatic) text classification

  29. Approaches to Web text classification • Biber and Kurjan (2007) • Search engine categories not well defined for purposes of linguistic analysis • Google directory • Multidimensional analysis • text type approach • Register approach • future work

  30. Approaches to Web text classification • Sharoff forthcoming • Genre typology based on EAGLES recommendations • “Communicative intentions” • Discussion, information, instruction, propaganda, recreation, regulations reporting • SVMs to automatically categorise texts in Web corpus • Classifiers trained on manually-classified texts • BNC + subset of Web corpus

  31. WaC challenges • Representativeness Without representativeness, whatever is found to be true of a corpus, is simply true of that corpus – and cannot be extended to anything else (Leech 2007:135)

  32. WaC challenges Compilers make the best corpus they can in the circumstances, and their proper stance is to be detailed and honest about the contents. From their description of the corpus, the research community can judge how far to trust their results, and future users of the same corpus can estimate its reliability for their purposes. (Sinclair 2005) • Documentation

  33. Thank you Silvia Bernardini University of Bologna, Italy silvia.bernardini@unibo.it Postgraduate Conference in Corpus linguistics 22 May 2008

More Related