1 / 35

Corpus 3

Corpus 3. Corpus-based Description. Aspects of corpus-based studies. lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based research on English. lexical description. The most obvious use of corpora for lexical description is in lexicography.

Download Presentation

Corpus 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus 3 Corpus-based Description

  2. Aspects of corpus-based studies • lexis, morphology, syntax and discourse. • fig. 3.1 A classification of corpus-based research on English

  3. lexical description • The most obvious use of corpora for lexical description is in lexicography. • Not only to identify the set of different words and show when new types enter the language, but to identify the various senses or uses of particular types and their relative frequencies. • e.g. London-Lund Corpus: polysemous word good • Table 3.1 • Identify neologisms

  4. Pre-Electronic Lexical Description for Pedagogical Purposes • Thondike (1921): word frequency on the basis of 4.5 million word corpus of literary works and books read by younger children. • The principle of vocabulary control in the design and editing of reading materials owes much to Thorndike's pioneering work. • Michael West: General Service List of English Words (1953)

  5. Pre-Electronic Lexical Description for Pedagogical Purposes • Description of the most frequent 2,000 words n the written English of the time, supplemented by information on the frequency of the meanings or uses of these words, based on the work of Lorge. • Fig. 3.2 • Thorndike-Lorge corpus was biased towards more literary and formal styles of writing, and did not include speech at all.

  6. Computer-based studies of lexicon • With a computerized corpus and appropriate software, both significant and more trivial but interesting facts about the lexicon of a language can be uncovered. • Table 3.2 The rank ordering of the 50 most frequent words in various corpora shows remarkable consistency and systematic differences.

  7. Computer-based studies of lexicon Consistence: all the words except said are function words.

  8. Word Occurrence 40% of the words in a corpus of over five million words occur only once show that a corpus of even that size is not a sound basis for lexicographical studies of low frequency words.

  9. Word Occurrence Sharman found that there was an almost linear relationship between vocabulary size and corpus size. A new word appeared in the text approximately every 30 words on average. The more narrowly focused the corpus, the more content words find their way into the higher frequency levels.

  10. Word Classes • Table 3.5 (written English): Relative proportions of major word classes in the Brown and LOB corpora • As shown in Table 3.6 (spoken English), fewer nouns and a considerable proportion of discourse items characteristic of spoken English are noteworthy.

  11. Word Classes • Table 3.7 shows that some sequences such as adjective + noun or noun + noun are very frequent indeed. • Johansson and Hofland: occurrence of the 40 most frequent sequences of word-class tags at the beginnings and ends of sentences. Findings: the ends of sentences may be more predictable grammatically than the beginnings.

  12. Register studies • Table 3.8 • There are certain characteristics of the vocabulary of scientific English. Certain relational words are disproportionately more frequent in scientific English. Comparative adjectives and adverbs are similarly disproportionately frequent, whereas locative adverbs of space or time are disproportionately less frequent in scientific t4xts than in general written American English. • Items witch occur in one variety but are highly unlikely to occur in the other.

  13. Semantic information • Longman Dictionary of Contemporary English • noun entries 23,800 • 67% one sense 15946 • 20% two sense 4760 • 6.5% three senses 1547 • 2.5% four senses 595

  14. Semantic information • Verb 7921 • 55% one sense 4357 • 23.8% 2 senses 1885 • 10% three senses 792 • 4.4% four senses 348

  15. Collocation • Some words can have a tendency to occur in the company of other words in certain contexts, e.g. Pouring rain, statistically significant, intrinsic value, strong tendency • Lexicalized unit: set phrase, idiomatic usage, cliché

  16. Collocation • Interest in recurring word combinations: • Wong-Fillmore (1976): The strategy of acquiring formulaic speech is central to the learning of language.

  17. Collocation • Peters (1980): unanalyzed sequences of words had a significant role among the units of language acquisition and proposed ways for identifying such unanalyzed sequences. • Nattinger and De Carrico (1992): since first language learners can be seen to use varying, apparently unanalyzed, prefabricated chunks of speech, then second language teaching might similarly be concentrated around the establishment of what they call lexical phrases.

  18. Collocation • Different characteristics of the sequence: • Allow for no alteration: it's as easy as falling off a log. • Allow certain changes (at the moment/at certain moments) • Relatively free within a framework (too...to, n ... Of)

  19. Collocation • Problems in the definition of collocation: • How often does a combination have to recur to be habitual? • Who decides what sounds natural? • Does a combination have t be well-formed or canonical to be a collocation? • Do collocations have o be syntactic or are they primarily semantic? • Do collocations have to consist of adjac4en words or can they be discontinuous?

  20. Collocation • can a sequence which occurs only once in a particular corpus but which is intuitively recognized by native speakers as a sequence they have heard before be listed as a collocation? • How big does a corpus have to be in order to establish that a collocation does exist? • Are there degrees of collocationality based on the flexibility of the bonding between words?

  21. Collocation • Can we lemmatize collocations so that similar or inflectionally related sequences are coned as a single collocation type? • Are degrees of colocationality able to be established on the basis of the number of tokens of a type in a particular corpus

  22. Collocation • Sinclair(1991) suggested that a span of up to four words each side of a word is the environment in which collocation is most likely to occur although, of course, computer software makes it possible to explore much larger spans, including size of a whole text.

  23. Tense and aspect of verbs • Table 3.14 Rank order of the most frequent simple and complex finite verb forms • Table 3.15 Relative frequencies of use of finite verb forms • Table 3.16 Perfect and progressive verb forms in the Brown Corpus • Table 3.17 Finite and non-finite verb forms • Table 3.18 past participle

  24. Modals • Tale 3.19 frequency of nine modals • Table 3.20 use of models • Table 3.21 use of modals in verb-phrase structures

  25. Voice • Table 3.22 active and passive predications • Table 3.23 use of passives in different regisgters • Table 3.24 verb-phrase structure of agentive passives

  26. Verb and particle use • Subjunctive • Prepositions • Conjunctions

  27. Grammatical studies • Corpus-based grammatical studies revealed considerable genre differences in the use of syntactic patterns and in sentence length. • Syntactic constructions are not in free variation. • Grammatical study is more of a challenge than lexical study because the tagging and parsing to facilitate the automatic analysis of texts and the development of softwares has not been widely available or user-friendly.

  28. Sentence length • Sentence length is related to genre. The mean number of words per sentence in Informative categories is much greater than imaginative prose. • There is much closer consistency in the number of predications per sentence regardless of genre. • Table 3.4.1 Sentence length and predications

  29. Syntactic processes • Clause patterning • Table 3.42 Distribution of recurrent verb-complement patterns • SVC (adj.) 45% • SVO 20.9%

  30. Syntactic processes • About half of the clauses are matrix clauses and half are embedded. Of the matrix clauses,97.8% are finite, 1.5% are nonfinite, and 0.7% are elliptical. • The vast majority of all informational subject clauses are extraposed (it is necessary that), reflecting a principle of end-focus from a functional sentence perspective or preferences in sentence organization for processing purposes.

  31. Syntactic processes • In informative prose the verb which precedes a finite that clause is more likely to be a communication verb such as say, state, whereas in spoken conversation affective or cooperative verbs such as think, fee, hope, tend to predominate.

  32. Noun modification • 98% of postmodifying clauses had one or other of the simpler clause patterns SVO(37%), SVO (38%), SVC (38%). Suggesting that embedding tend to favor less complex sentence patterns. • 70% of noun phrases function as subjects or prepositional complements and noun phrases with postmodifying clauses tend to be disfavoured in subject functions.

  33. Noun modification • Postmodification is less frequent in nonfinal positions of sentences. This is because the subject or topic is familiar enough not to need identification or elaboration through postmodification, or because brief subjects are easier to process.

  34. Causation • The marking of causation can be lexicalized ( because, cause), syntactic structure (because of) or implicature. • Choice for expressing causation is seldom free, but is influenced by various semantic, pragmatic, stylistic, cognitive and textual variables.

  35. Pragmatics • Table 3.5 Distribution of discourse items • Comparisons of spoken and writing English • Table 3.58 pretty • Table 3.59 really just right

More Related