430 likes | 618 Views
Computer Corpora and What They Can Tell Us about How People Use Language. 情報科学入門 11 July 2011. “Corpus”?. Latin “corpus” = body . Latin “corpora” = bodies . English “corpus” = collection of texts English “corpora” = collections of texts Japanese “ コーパス ” = 文書などの集大成.
E N D
Computer Corpora and What They Can Tell Us about How People Use Language 情報科学入門 11 July 2011
“Corpus”? • Latin “corpus” = body. • Latin “corpora” = bodies. • English “corpus” = collection of texts • English “corpora” = collections of texts • Japanese “コーパス”= 文書などの集大成
What is a computer corpus? A corpus is a collection of texts stored on a computer. Books, magazines, letters, Internetpages, e-mails, or parts of these. Or transcriptions of speeches, phone calls, or radio programs. Often stored as a single file in simple text format.
How big is a computer corpus? • It can be very big or very small. • The biggest (e.g. the British National Corpus and the Corpus of Contemporary American English) have many millions of words. • A small corpus might have only a few hundred words.
Benefits of computer corpora • In what way do you think computer corpora might be useful? • Any ideas?
What are computer corpora for? • We can use corpora to study language. • What are the most common words? • What words are used together? • What words of a particular typeare used together (e.g., under + NOUN)? • If we compare two corpora (e.g. e-mail and textbooks), is a word more common in one? • How do people use words in sentences?
Computer corpora and dictionaries • All major English dictionaries are now based on computer corpora. • How common is a word? • How many different meanings does it have? • What are some examples of its use? • Is it used in a good or bad sense? • What grammatical patterns is it used with? • What other words is it used with?
Word frequencies • What do you think are the most common words in English? • Make a list of about five words.
The most common English words(Oxford English Corpus) • The • Be • To • Of • And • A • In • That • Have • I
Concordances • One of the most common ways to study computer corpora is to use a concordance. • A concordance finds all the instances of a word or phrase in a corpus. • It presents a list of the instances, often with the search word in the middle of the screen.
What does this tell us? • In the words beforeforget, there are • many examples of negative words: • not, won’t, don’t, couldn’t, shouldn’t, never, nobody • many contractions: • won’t, don’t, you’ll, couldn’t, shouldn’t, you’d • several examples of to
What does this tell us? • In the words afterforget, there are • several examples of to • several examples of –ing • several examples of what and that • several examples of the • several examples of he, she, you, it, and we • Notice also that forgetusually comes in the middle of a sentence, not at the beginning or end.
Open a concordance on your PC • Go to http://corpus.byu.edu/coca/. • This site allows you to access the Corpus of Contemporary American English (COCA). • The largest free corpus in the world: • 425 million words, 5 types of text • Spoken • Fiction • Magazine • Newspaper • Academic
Display • At the top left, you will see under Display: • List: Shows a list of words in the right column • Chart: Shows two charts in the right column • Types of text (spoken, fiction, magazine, etc.) • Time (1990-1994, 1995-1999, etc.) • KWIC (Key Words in Context) Shows nouns, verbs, etc. around the search string • Compare: Shows results for two words
Search String • Under Search String, you will see: • Word: Type a word (e.g. head). • Collocates: Type a word used nearhead. • The two boxes next to Collocates show • Maximum number of words beforehead • Maximum number of words afterhead • POS (Part Of Speech): Select a part of speech (e.g., noun, verb, etc.) used near head. • Random: This chooses a random search string. • Search: Click this to begin your search • Reset: Clear the left column
Sections • Show: Check this box to show charts for • Type of text (Spoken, Magazine, etc.) • Time • 1: Choose a type of text for the search string • Ignore (= all types) • Spoken • Magazine • Newspaper • Academic • 2: If you are comparing two search strings, choose the type of textfor the second string.
Search syntax • To find two words: • To find “good luck”, type “good luck” in Word(s). • To find the neighboring word: • To find what word comes after “dog”, type “dog *”. • To find what word comes before “dog”, type “*dog”. • To find two words with 1–4 words between: • Word(s): dog • Collocate: bark” “dog bark”, “dog will bark”, “dog will often bark”, “dog will not always bark”, “dog will in no situation bark”. 0 5
Query syntax (2) • To find different forms of a word: • Word(s): [blow] away “blow away”, “blows away”, “blew away”, “blowing away”, “blown away” • To find all the words that begin the same way: • Word(s): comp* “compare”, “compute”, “computer”, “compiler”, “comply”, etc. • To find all of a set of words: • Word(s): cut|cuts|cutting “cut”, “cuts”, “cutting”.
Try the COCA concordance • In the top right corner, type • Your e-mail address. • Your password. • In the Word(s)box, type “played”. • Click on “Search” • In the top right column, click on “PLAYED”. • What topics are most of the examples about?
Findings for “played” • Acted • Played a role, played a key role, played in the movie • Sports • Played football, played 158 games, played his last game • Other games • Played cards • Music • The Paris orchestra played, bands played, pianist played • 遊んだ • Played among easels
Word frequency • At the top right, under TOT, you see “52589”. • The corpus contains 52589 examples of played. • Under Display, select CHART. • Click the Search button. • The right column shows the frequency of played in different types of text. • In which type is it most common? Why? • You can also see the frequency for 5-year periods. • In which period was it most common?
Try a two-word search • Click the Reset button. • In the Word(s) box, type “* friend of *”. • Click on “Search”. • Notice the words before and after “friend of”. What did you find?
Findings for “friend of” • Before • “a” • “good” • “close” • “old” • After • “mine” • “the” • “his” • “hers” • “ours” • “theirs”
Two words with an optional gap • Click the Reset button. • In the Word(s) box, type “a”. • Click on Concordance. • In the Concordance box, type “teacher” . • Click “Search”. • In the top right column, click on “TEACHER” • Notice the words between “a” and “teacher”. What did you find? 0 5
Findings for “a . . . teacher” • “a former teacher” • “a retired teacher” • “a 10-year teacher” • “a young teacher” • “a physical education teacher” • “a full-time coach and teacher” • “a social studies teacher” • “a job as an English teacher”, etc.
Word + Part of Speech (POS) • You can also search for a word with a POS. • E.g., made me+ VERB(動詞) • Click on the POS button in the left column. noun.ALL: all common nouns (名詞) verb.ALL: all verbs (動詞) adj.ALL: all adjectives (形容詞) adv.ALL: all adverbs (副詞) neg.ALL all instances of “not”, “n’t” art.ALL all articles (“a”, “an”, “the”) det.ALL all determiners (“this”, “these”, etc.) pron.ALL all pronouns (代名詞) poss.ALL all possessive pronouns (“my”, “your”, etc.) prep.ALL all prepositions (前置詞) conj.ALL all conjunctions (接続詞) noun.ALL+ all common and proper nouns (名詞) noun.SG: singular noun (単数の名詞) noun.PL: plural noun (複数の名詞) noun.CMN common noun (普通名詞) noun.+PROP proper nouns (固有名詞) verb.BASE base form of verb (“know”, “think”, etc.) verb.INF infinitive form of verb (“be”, “have”, etc.) verb.MODAL modal form of verb (“may”, “might”, etc.) verb.3SG 3rd person singular verb (“has”, “goes”, etc.) verb.ED past tense verb (“went”, “played”, etc.) verb.ING “ing” form of verb (“going”, “playing”, etc.) etc. PUNC all punctuation marks (. , ; : ! ? - etc.)
Search for a word + POS • In the left column, click “Reset” • In the Word(s) box, type “made me”. • Click “POS” • In the POS box, type VERB(ALL) • Click “Search”. • Notice the words after “me”. What did you find?
Findings for “made me” • All the words after “me” were bare infinitives. • The most common verb was “want” (299). • There were many “thinking” verbs, e.g., “realize”, “see”, “believe”, “think”, “understand”. • There were also some “action” verbs, e.g., “do”, “look”, “take”, “get”.
Inflected forms • Click “Reset” • In the Word(s) box, type “I wish I [be]”. • Click “Search”. • Notice the word after “I wish I”. What did you find?
Findings for “I wish I” • “I wish I was” (202 cases) • “I wish I were” (197 cases) • Grammatically, “I wish I were” is correct. • Native English speakers do not always use English “correctly”.
Pre-lecture quiz What answers did you get? • happy ______ • What a _______ • I haven’t a _______ • as good as ______ • ______ the winter • I’ve _____ arrived • Don’t be a ______ • a ______ breakfast • He didn’t take any _______ • She ______ her head
Answers to the pre-lecture quiz • happy [to, with, and, about, birthday] • What a [great, lot, good, wonderful, difference] • I haven’t a [clue, thing, single, choice] • as good as[the, it, they, any, a, I, you] • [in, during, for, of, through] the winter • I’ve [just, now, finally, always, already] arrived • Don’t be a [fool, stranger, hero, jerk, baby] • A [big, good, hearty, late, quick] breakfast • He didn’t take any [questions, of, shit, precautions] • She [shook, shakes, turned, tilted] her head
Summary • We can learn a lot about language from computer corpora. • In particular, concordances can show us how peoplereallyuse language in practice. • Concordances are usefulfor students of English • To check how vocabulary is used. • To check grammatical constructions.
Some other online concordances • Michigan Corpus of Academic Spoken English (MICASE) • http://quod.lib.umich.edu/m/micase/ • Web Concordancer (English) • http://www.edict.com.hk/concordance/WWWConcappE.htm • Corpus Concordance English • http://www.lextutor.ca/concordancers/concord_e.html
Post-lecture quiz • Please complete the quiz paper I gave you today. • Submit it to me by tomorrow evening. • If you don’t submit it, you will not get any points for attending this lecture. That’s it, folks!