How Useful Is the Web as a Linguistic Corpus?

How Useful Is the Webas a Linguistic Corpus? William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American Association of Applied Corpus Linguistics Indianapolis, IN, 1-3 November 2002

Making the Web More Useful as a Corpus Objective of this ongoing study To develop and evaluate linguistic methods and PC tools to identify domain-relevant and linguistically representative documents more efficiently Long-range goal To establish the Web both as a "corpus of first resort" and as a supplementary corpus for language professionals and learners

Advantages of Web • Virtually comprehensive coverage of major languages and language varieties, content domains and written text types • Ready availability and low cost throughout developed world • Freshness and topicality: emerging usage and current issues well documented • Easy to compile an ad-hoc corpus to answer a specific question or meet a specialized information need • User familiarity with Web and independent motivation to become more proficient in using it

Disadvantages of Web • Generally unknown provenance and authorship, reliability and authorativeness of texts, both for content and linguistic form • Predominance of certain text types among coherent texts, especially legal, journalistic, commercial and academic prose • Overall lower standards of form and content verification than printed sources • Systematically accessible only through commercial search engines, which support only very rough search criteria • Counts of a given linguistic feature give only a general numeric indication, not statistical proof

“Noise” Filter for HRDs • Highly Repetitive Documents • Discussion groups where replies incorporate original post • Internal links • Boilerplate • Search engine Spam • Strategy: identify documents with frequent n-grams • 8-grams, 12-grams, 25-grams useful range • Either eliminate document or eliminate redundant text

“Noise” Filter for VIDs • Virtually Identical Documents • Mirrored documents with slight differences • News stories • Rank and absolute frequency of 3- to 5-grams alerts to VIDs

“Noise” Filter for IDs • (Fully) Identical Documents • Mirrored documents • Multiple URLs for same document • Server-generated error messages • MD5 SHA (Message Digest 5 Secure Hash Algorithm) reduces normalized text of any length to 20-byte code with high probability of uniqueness • MD5 codes from thousands of documents can be stored in binary tree for efficient comparison and elimination of redundant documents

? Unproven “Noise” Filters • Microsoft Word Spelling Checker to recognize, normalize ill-formed documents automatically • Some success; deserves further attention • Problem: large number of items (personal, commercial and place names, technological terms) not in default lexicon, so it rejects too many good documents. • Patterns of 1- and 2-grams to recognize PFDs (Primarily Fragmentary Documents) • Some high-frequency types (articles, copula) rare in fragments, others (common prepositions) frequent • Content words and special terms (see above) relatively prominent

Size as A Priori Filter • Webpages under 3 kB or over 150 kB have lower “signal to noise” ratio • In these extreme ranges documents consist of coherent text less frequently or to a lesser degree • Shorter files tend to have much lower ratio of text file size to HTML file size (49% vs. 64% overall) • Rule of thumb: download and process only pages larger than 5 kB or smaller than 200 kB (size before stripping HTML tags)

My Web Corpus 1 • Compiled one afternoon in October 2001 via KWiCFinder searches on the 20 most frequent words in English • Preliminary studies of 100 and 5859 webpages respectively revealed great bias towards commercial sites due to "paid positioning" on AltaVista; sites ranked highest for this reason were excluded from this study • Initially consisted of 11,201 online documents (OLDs) • Various "noise filters" were applied to make the results more useful • 7294 survived automatic elimination of IDs and VIDs • 256 HRDs were eliminated • Remaining documents were viewed individually and classified as • Primarily useful text • "Noisy" text • Primarily non-text (link lists, fragments, headers / footers predominated...)

My Web Corpus 2 • 4949 unique documents passed all automatic tests and human classification • 5.25 million tokens in 35 MB of files • Longer coherent texts from government, academic, legal, religious (Christian, Jewish, Muslim, Hindu), journalistic and commercial sources, plus many “hobbyist” pages on a wide range of topics • Compared to BNC as a standard to reference corpus (see appendix with annotated comparison of n-gram frequencies). • Generally quite comparable, but important differences: • UK vs. US bias in institutions, place names, spelling • BNC: bias toward third person, past tense, narrative style • WC: bias toward first (especially we) and second person, present tense, interactive style • Words referring to Internet concepts and information missing or rare in BNC, highly prominent in WC (and in contemporary English)

How Useful Is the Web as a Linguistic Corpus?