1 / 56

Web Corpora

Adam Kilgarriff. Web Corpora. You can’t help noticing. Replaceable or replacable? http://googlefight.com. Very very large 2006 estimates for duplicate free, linguistic, Google-indexed web German: 44 billion words Italian: 25 billion words English: 1,000 billion -10,000 billion words

cachet
Download Presentation

Web Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kilgarriff: Web corpora Adam Kilgarriff Web Corpora

  2. Kilgarriff: Web corpora You can’t help noticing • Replaceable or replacable? • http://googlefight.com

  3. Kilgarriff: Web corpora • Very very large • 2006 estimates for duplicate free, linguistic, Google-indexed web • German: 44 billion words • Italian: 25 billion words • English: 1,000 billion -10,000 billion words • Most languages • Most language types • Up-to-date • Free • Instant access

  4. Kilgarriff: Web corpora Overview • Is the web a corpus? • Representativeness • What is out there? • Web1T • Googleology • Web corpus types • Targeted sites: Oxford English Corpus • General: WaC family • WebBootCaT

  5. Kilgarriff: Web corpora Is the web a corpus? • Sinclair • in “Developing linguistic corpora, a guide to good practice. Corpus and Text – Basic Principles” “…not a corpus because • dimensions unknown, constantly changing • not designed from a linguistic perpective • But • We can find out dimensions • Many corpora are not designed • “as much chatroom dialogue as I can get” • Def: a corpus is a collection of texts • when viewed as an object of language research

  6. Kilgarriff: Web corpora Is the web a corpus? Yes

  7. Kilgarriff: Web corpora but it’s not representative

  8. Kilgarriff: Web corpora Theory A random sample of a population is representative of it. Observations on sample support inferences about population (within confidence bounds)‏

  9. Kilgarriff: Web corpora Theory A random sample of a population is … • What is the population? • production and reception • speech and text • copying

  10. Kilgarriff: Web corpora Theory • Population not defined • Representative sample not possible

  11. Kilgarriff: Web corpora sublanguage • Language = core + sublanguages • Options for corpus construction • none • some • all • None • impoverished view of language • Some: BNC • cake recipes and gastro-uterine disease • not car repair manuals or astronomy or … • All: until recently, not viable

  12. Kilgarriff: Web corpora Representativeness • The web is not representative • but nor is anything else • Text type variation • under-researched, lacking in theory • Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001 • Text type is an issue across NLP • Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there

  13. Kilgarriff: Web corpora What is out there? • What text types are there on the web? • some are new: chatroom • proportions • is it overwhelmed by porn? How much? • Hard question

  14. Kilgarriff: Web corpora • The web • a social, cultural, political phenomenon • new, little understood • a legitimate object of science • mostly language • we are well placed • a lot of people will be interested • Let’s • study the web • source of language data • apply our tools for web use (dictionaries, MT)‏ • use the web as infrastructure

  15. Kilgarriff: Web corpora Using Search Engines No setup costs Start querying today Methods • Hit counts • ‘snippets’ • Metasearch engines, WebCorp • Find pages and download

  16. Kilgarriff: Web corpora Googleology • Google hit counts for language modelling • Example: (Keller & Lapata 2003) • 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista • Very interesting work • Great interest in query syntax

  17. Kilgarriff: Web corpora The Trouble with Google • not enough instances • max 1000 • not enough queries • max 1000 per day with API • not enough context • 10-word snippet around search term • sort order • search term in titles and headings • untrustworthy hit counts • limited search options • linguistically dumb, eg not lemmatised • aime/aimer/aimes/aimons/aimez/aiment …

  18. Kilgarriff: Web corpora • Appeal • Zero-cost entry, just start googling • Reality • High-quality work: high-cost methodology

  19. Kilgarriff: Web corpora Also: • No replicability • Methods, stats not published • At mercy of commercial corporation • Googleology is bad science

  20. Kilgarriff: Web corpora Better: web-sourced corpora • Gather pages • Google hits • Select and gather whole sites • General crawl • Filter • De-duplicate • Linguistic processing • Load into corpus tool

  21. Kilgarriff: Web corpora Oxford English Corpus • Whole domains chosen and harvested • control over text type • 2.3 billion words

  22. Kilgarriff: Web corpora Oxford English Corpus

  23. Kilgarriff: Web corpora WaC family • 1.5 B words each • Baroni and colleagues • Seeds: • mid-frequency words from ‘core vocab’ lists and corpora • Google on seed words, then crawl

  24. Kilgarriff: Web corpora TenTen Family • Processing chain • Spiderling, a lingustic crawler • A billion words a day • jusText for“cleaning”: removing non-text • Onion – remove duplicates (paragraph level) • All major world languages • 2-20 billion words • Lexical Computing • All available in Sketch Engine

  25. Kilgarriff: Web corpora Small, specialised corpora • Terminologists • Translators needing target-language domain-specific vocab • Specialist dictionaries • Don’t exist • Expensive/inaccessible • Out of date

  26. Kilgarriff: Web corpora BootCat (Bootstrapping Corpora and Terms) • Put in seed terms • Google/Yahoo search • Retrieve Google/Yahoo hits • Remove duplicates, boilerplate • Small instant corpora • Baroni and Bernardini, LREC 2004 • Web version • WebBootCaT • At Sketch Engine site

  27. Kilgarriff: Web corpora But did I make a good corpus?

  28. Kilgarriff: Web corpora Bad Science • Ben Goldacre

  29. Kilgarriff: Web corpora Bad Science • Ben Goldacre • Biases in samples • A quarter of the people who tested positive had just been on holiday in Mexico • But the research team didn’t notice

  30. Kilgarriff: Web corpora Bad linguistics • Our corpus study shows X • But what was in the corpus?

  31. Kilgarriff: Web corpora Bad linguistics • Our corpus study shows X • But what was in the corpus? • Moral: • Get to know your corpus

  32. Kilgarriff: Web corpora How? • Read it? • Too big to read • Not designed to be read

  33. Kilgarriff: Web corpora How? • Compare it with other(s) • Keyword lists

  34. Kilgarriff: Web corpora UKWaC vs. enTenTen12

  35. Kilgarriff: Web corpora enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviourbuilding centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisationorganisepage partnership please pm poker ppprogramme project pub pupil quality range rdrealiserecognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www

  36. Kilgarriff: Web corpora enTenTen vs. UKWaC accord actually amendment among bad because behaviorbelieve bill blog cacenter citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know laborlaw let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor programrealizerecognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviourbuilding centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisationorganisepage partnership please pm poker ppprogramme project pub pupil quality range rdrealiserecogniseroad route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www

  37. Kilgarriff: Web corpora enTenTen vs. UKWaC accord actually amendment among bad because behaviorbelieve bill blog cacenter citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor programrealizerecognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available band behaviourbuilding centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facilityfavouritefull further garden guidance guide holiday improve information insurance joinlink local main manage management match mm nd offer opportunity organisationorganisepage partnership please pm poker ppprogramme project pub pupil quality range rdrealiserecogniseroad route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www

  38. Kilgarriff: Web corpora enTenTen vs. UKWaC accord actually amendmentamong bad because behaviorbelievebill blog cacentercitizencolor defense determine do dollar earth effort election even evil fact faculty favor favoritefederal foreign forth guess guy he her him himself his honor human kid kill kind know laborlaw letliberallike man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor programrealizerecognize say shall she sin soul speakstatesuppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes accommodation achieve advice aim area assessment available bandbehaviourbuilding centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facilityfavouritefull further garden guidance guide holiday improve information insurance joinlink local main manage management match mm nd offer opportunity organisationorganisepage partnership please pm poker ppprogramme project pub pupil quality range rdrealiserecogniseroad route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www

  39. Kilgarriff: Web corpora enTenTen vs. UKWaC • Core verbs • be determine do guess know let say shall suppose tell think • Pronouns • he her him his me my she • Biber: more informal

  40. Kilgarriff: Web corpora Judgements • Not all or nothing • Both have (lots of) AmE and BrE • Observing patterns • Not right or wrong • Where does ‘believe’ belong? • Bible or core verbs? • No right answer, could be both • The better you know the data, the better you understand why words are there

  41. Kilgarriff: Web corpora The maths “this word is twice as common here as there” • Simplest approach • Normalise frequencies • Per thousand, or per million • Take ratio • For examples • Assume two 1m-word corpora • Normalisation not needed • Fc=focus corpus • Rc= reference corpus

  42. Kilgarriff: Web corpora Problem 1: You can’t divide by zero • Standard solution: add one • Problem solved

  43. Kilgarriff: Web corpora Problem 2: High ratios more common, less interesting for rarer words • ratio is not enough: frequency matters too Also • some researchers: grammar, grammar words • some researchers: lexis, content words No right answer Slider?

  44. Kilgarriff: Web corpora Solution • Don’t just add 1, add n: • n=1 • n=100

  45. Kilgarriff: Web corpora • n=1000

  46. Kilgarriff: Web corpora Summary

  47. Kilgarriff: Web corpora But what about • Mutual information • Log-likelihood • Chi-square • Fisher’s test • … • Don’t they use cleverer maths?

  48. Kilgarriff: Web corpora Yes but • Clever maths is for hypothesis testing • Can you defeat null hypothesis? • Language is not random, so • … you always can • Null hypothesis never true • Hypothesis-testing not informative • Clever maths irrelevant • Kilgarriff 2006, CLLT

  49. Kilgarriff: Web corpora Varying the parameter • BAWE • British Academic Written English • Nesi and Thompson 2008 • Student essays • Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences • fc: ArtsHum, rc: SocSci • With n=10 and n=1000

  50. Kilgarriff: Web corpora

More Related