1 / 12

Definition of a corpus

Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.

mkucharski
Download Presentation

Definition of a corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways. In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition.

  2. Size • The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. • This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words.

  3. Helsinki Corpus. Text identifier Name of text Author's name Sub-period Date of original Date of manuscript Contemporaneity of original and manuscript Dialect Verse or prose Text type Relationship to foreign original Language of foreign original Relationship to spoken language Sex of author Age of author Author's social status Audience description Participant relationship Interactive/non-interactive Formal/informal Prototypical text category Sample

  4. Corpora in language teaching Resources and practices in the teaching of languages and linguistics tend to reflect the division between the empirical and rationalist approaches. Many textbooks contain only invented examples and their descriptions are based upon intuition or second-hand accounts. Other books, however, are explicitly empirical and use examples and descriptions from corpora or other sources of real life language data. Corpus examples are important in language learning as they expose students to the kinds of sentences that they will encounter when using the language in real life situations.

  5. Frequency counts • This is the most straight-forward approach to working with quantitative data. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme. For instance, we might set up a classification scheme to look at the frequency of the four major parts of speech: noun, verb, adjective and adverb. These four classes would constitute our types. Another example inolves the simple one-to-one mapping of form onto classification. In other words, we count the number of times each word appears in the corpus, resulting in a list which might look something like: • abandon: 5abandoned: 3abandons: 2ability: 5able: 28about: 128etc.....

  6. Proportions • Frequency counts are useful, but they have certain disadvantages. When one wishes to compare one data set with another, for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type, they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. This is not a problem when the two corpora that are being compared are of the same size, but when they are of different sizes frequency counts are little more than useless.

  7. Porportions cont • The following example compares two such corpora, looking at the frequency of the word boot • Type of corpusNumber of wordsNumber of instances of boot English Spoken 50,000 50English Written 500,000 500 • A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get: • spoken English: 50/50,000 X 100 = 0.1%written English: 500/500,000 X 100 = 0.1%

  8. Collocations • The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.

  9. Collocations 2 • Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. • For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship.

  10. Collocations 3 • We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word.

  11. Collocations 4 • We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerfulcollocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.

  12. the thing that started, at least to the naked eye the surface that showed itself to the naked eye smooth and featureless as glass to the naked eye almost too small to see with the naked eye colonies, often quite visible to the naked eye devoid of plants at least to the naked eye small, they are not always visible to the naked eye these marine plants, which the naked eye stars the size of Earth, invisible to the naked eye The tape was defective, even to the naked eye

More Related