440 likes | 599 Views
Intro to corpus linguistics. Tyler Schnoebelen http://www.stanford.edu/~tylers http :// corpuslinguistics.WordPress.com. What’s a corpus?. A collection of texts Some folks will distinguish between “corpus”, “data archive”, and “collection”. Ignore them. What’s a text?
E N D
Intro to corpus linguistics Tyler Schnoebelen http://www.stanford.edu/~tylers http://corpuslinguistics.WordPress.com
What’s a corpus? • A collection of texts • Some folks will distinguish between “corpus”, “data archive”, and “collection”. Ignore them. • What’s a text? • Really, we’re talking about anything language-y • So in principle, we don’t make a distinction between “every episode of Friends”, “the Book of Genesis in 25 unrelated languages”, and “Usenet forums, 2005-2010”. (There are practical distinctions to be made, of course.) • What’s a collection? • Even the most internally-diverse corpora has something holding it together. • This is actually pretty important to reflect upon • Whatever brought the texts together is going to have direct bearing on whether your research question and the corpus are going to get along
What’s a corpus good for? • Exploration and hypothesis generation • A friend says something striking • The car needs washing • John smokes a lot anymore • Ohmigod, totes • The woman brought a ham sandwich tripped • Who else says this? When do they say it? • More data will give you a better feel for what’s going on • Hypothesis testing • The key thing about corpora is that they give you counts • Counts let you compare
What can you count? • Variations • Talkin’ vs. talking • She picked up the paper clip / she picked the paper clip up • Presence/absence • I heard (that) you went to Egypt • In this case, you can’t just count when that is there, you need to count when it isn’t. So you need something syntactically parsed (the you were coming in this example has to be different than You went to Egypt, didn’t you? • Some corpora have phonetic/phonlogical marking—for example, there are child language corpora that mark prominent stress. • The presence of speech errors tells you something about what’s going on in the brain and socially—whether you’re looking at false starts or uh/um’s for children acquiring language, non-native speakers, adults talking to family members…or undergraduates on speed dates • Co-occurrence • Which emoticon(s) does lmao go with the most? • Do the verbs that occur most with he differ from those that occur most with she? • You can take a word like butter and make it an adjective by adding –y: buttery. You can then talk about this quality as a noun by adding –ness: butteriness. How many morphemes can you string together? Are there rules about the order they get added? • Conservative commentators say Obama uses I/me/my more than any other president in recent memory. Are they right? (And if they are, what relationship does the first person singular have to personality/stance?)
What can you count? • Your hypothesis may require you to do annotation • Let’s say you want to look at gender and swear words • You’ll need to find a corpus that has gender annotated already (or makes it easy for you to do) • The corpus needs to have enough swear words to be worth your while • You need to decide what counts as a swear word—fuck and shit are obvious, but what about darn and damn? What about dick? Both when talking about a penis and when talking about a jerk? Or just one of those? If you are doing automatic annotation, how do you make sure not to get guys named Dick or people talking about private detectives?
Words • Whether you’re interested in the use of freedom by politicians over time or the acquisition of intransitive verbs by children, you’re likely to be counting words. • The definition of a “word” is tricky. • In Internet corpora we probably want to treat !!! as a word, for example. Is mother-in-law one word or three? • Lemmas—do you want to lump book and books together? Went and going? • Part-of-speech tagged—do you need to distinguish between fish_verb and fish_noun? Mean_verb and mean_adj?
The corpus you choose matters • Size: An effect has to be really strong for you to discern it in a small corpus • Sampling: What does the corpus claim to represent? Does it? • Stance: Is the corpus “naturalistic” (‘Hey, you can call a friend across the world for free if you let me tape you’ vs. (‘Here, read this word list’)? • Word lists are not inherently good or bad. Consider having a million people read from a word list—the data is likely to be a lot easier to deal with than finding a million people saying pin in conversation. • What is the speaker/author’s relationship to the fact that they are being recorded in some way? Does it limit who participates and/or what they say in a way that affects your research question? • There is no perfect corpus, so just think through what you’re getting and what you’re giving up by choosing a particular corpus.
Making your own corpus What gap does your corpus fill? What makes it better than existing corpora (easier to use, bigger, different population, different style, etc)? How did you come across your sample? Can you claim representativity? WHAT are you representative of? How are you annotating it? Btw, do you need human subjects approval? (How are you storing it and making it available to others are good questions, too.)
Whitney Corpus Let’s say you’ve been thinking about Whitney a lot. And you wonder what sort of words her lyrics conveyed. You’d go out and get lyrics for her songs—all her songs? If not all, which ones? Let’s say the hits. You know how to go find those lyrics. Do you have any reason to grab metadata like year? (Might as well if it doesn’t slow you down too much.) Here’s where a script to count words is going to be useful. But a ha! Do refrains get counted? Once? As many times as they are said? If you store the data in a way that marks the refrains, then you can see how much difference this choice makes.
Ask specific questions because you’ll only get specific answers • Imagine you found/conducted interviews with members of Stanford’s Black Student Union • You can ask and answer all sorts of interesting questions • For example, how these students are using AAVE (African American Vernacular English) over the course of the interviews • But it’d be weird for you to make giant global claims about AAVE given the inherent bias of your sample
Names • Stephanie Shih has been doing fascinating work on the phonology of first and last name pairs from Facebook data (slides!based on data from here!) • For example SUsan SMITH, which is “rhythmically well-formed” is far better than SuZANNE SMITH • We’d expect some effects to be gendered (different sounds are associated with masculinity/femininity) • This may be more exaggerated when you have people who have no constraints on their name choices—that is, performers
Porn names • Stephanie and I will look at how well the first-last name pairs correspond to rules for Facebook names (“avoid stress clashes”) • But I just got the porn name corpus together this week so, no results so far. • Fwiw, the most popular gay male first names (just by counts) are: • Steve, Chris, Tony, Scott, Mark, Mike, Brad, Eric, Jeff • The most popular last names are: • Thomas, Scott, Taylor, Stone, Hunter, Michaels, Adams, and Williams • But note that what we really want to know is whether any of those are over/under represented compared to the overall population (say, “American males born since 1960”)
Words over time http://books.google.com/ngrams; also play around with Hebrew, Russian, Spanish, Chinese, German, and French: http://books.google.com/ngrams/datasets
“Looking at” My partner hates when I say I’m looking at something He’d rather me investigate, research, fiddle with, tinker with—ANYTHING but look at. He says it’s pretentious and academic. How can I prove him wrong?
A few resources The Corpus of Contemporary American English has a section for “academic” but it’s written, so maybe not the greatest The Michigan Corpus of Academic Spoken English, on the other hand, is academics (labeled for discipline, gender, etc), so we might try that. What are we comparing to? Spoken portions of COCA (basically, talk shows)? Who else looks at things? Politicians? We could go to Congressional transcripts, perhaps.(Lillian Lee has this and much more: http://www.cs.cornell.edu/home/llee/data/).
Quick note on stats • Let’s look at words that occur with emoticons on Twitter (a total of 21,891,914 words) • There are 5,175 tweets that have ugh • There are 101,463 tweets that have ;P • So the null hypothesis—that there’s nothing special bringing these two things or keeping them apart—says we should expect: • (ugh/total)*(;P/total)*total=24 tweets with both ugh and ;P • In fact, there are 14 tokens of ugh used with ;P • Is there something special or not?
You’re just getting started The point here is to use corpora to get out of your head For this course, don’t worry about using stats Although stats will turn out to be pretty useful in a lot of different parts of your life, so think about acquiring at least the basics
Conversations (sociolinguistic interviews) • Folks at OSU interviewed 40 long-time residents of Columbus, Ohio trying to get a balanced sample (20 old, 20 young, 20 male, 20 female, all Caucasian). • Two interviewers—one male, one female, each seeing equal proportions of age/gender subjects. • Target interview length was 60 minutes. • Focused on the subjects’ life and the area they lived/grew up around
Little Little is used at least twice by all but one male speaker (the average number of little tokens is 10.2 per speaker). Women use little more (220 times versus 188) and they use it significantly more than expected (p=0.0276 by chi-squared test).
Observed Expected O/E Female-to-female 93 99 0.936 Female-to-male 127 98 1.290 Male-to-male 89 101 0.879 Male-to-female 99 109 0.908 Token counts from the Buckeye corpus (the difference between observed and expected is significant: p=0.0112 by chi-squared test).
Not about self-minimization for women, though • yknow they're just trying to spread out a little bit and eventually it'll be something a little more nationwide • yknowit's little more problem solving less running around and shooting things • and they have honors classes which are separate and they're supposed to be with like better faculty and um smaller class size and a little more challenging • oxleysis probably the closest place and those little hot dog stands, those really don't count • they had this little apartment
Conversations (strangers) Switchboard: 2,400 conversations Fisher: 11,699 conversations Both: between strangers, talking about an assigned topic for about ten minutes Both: Balanced for dialect regions in the US Fisher: Also allows in non-native English speakers (they are marked as such) Switchboard: A lot more annotation has been done on it (syntactic parsing, for example, but you name it, linguists have added it)
Observed Expected OE Female 8,076 7,998 1.010 Male 5,843 5,921 0.987 Counts for little by gender show no real difference (p=0.1803 by chi-squared test) in Fisher. But things change rather dramatically when we look at who is talking to whom: ObservedExpected OE Female-to-female 6,047 5,844 1.035 Female-to-male 2,012 2,139 0.941 Male-to-male 4,003 3,782 1.058 Male-to-female 1,820 2,117 0.860 Mixed-gender conversations in the Fisher corpus use less little (p=6.566 x e-15 by chi-squared test)—all speakers.
Conversations (friends/families) There are lots of these sorts of corpora in a bunch of different major world languages—look for “CALLFRIEND” and “CALLHOME” The English CALLHOME gives people free long distance in exchange for letting their conversations be recorded. They last 30 minutes and the last 10 minutes are transcribed. 120 telephone conversations
In keeping with the general trend, women do seem to use little more than men. ObservedExpectedOE Females 312 291 1.073 Males 56 77 0.725 Differences in little in CALLHOME (p=0.00664 in chi-squared test). Focusing on conversations that only have two participants (i.e., excluding the multiparty calls), we can see that the strongest pressures seem to be happening among men. Observed Expected OE Female to female 234 215 1.088 Female to male 40 38 1.064 Male to male 25 36 0.685 Male to female 26 36 0.727 Gender interaction differences in little (p=0.0441 by chi-squared test).
Topic ActualExpected OE Hobbies 543 264 2.059 Computer games 263 143 1.834 Outdoor activities 446 246 1.813 Health and fitness 522 289 1.809 Current events 362 231 1.565 Food 442 297 1.486 Airport security 292 203 1.439 Friends 326 229 1.424 Family values 168 119 1.416 Pets 745 528 1.412 … Arms inspections in Iraq 159 260 0.611 Affirmative action 104 172 0.604 Personal habits 139 237 0.586 Life partners 227 437 0.520 Comedy 214 414 0.517 Issues in the Middle East 92 191 0.481 Hypothetical : Perjury 51 113 0.450 Minimum wage 204 476 0.428 Hypothetical : Time travel 125 315 0.397 Hypothetical : Own biz 88 242 0.364
Talking about terrorism Let’s look more closely at the Terrorism topic: "Do you think most people would remain calm, or panic during a terrorist attack? How do you think each of you would react?“ Overall people weren’t using little with this topic…except women talking to women: F2F 125 observed vs. 67 expected Everyone else: 46 observed vs. 84 expected
A lot of F2F emotional regulation • no i think i'd be i'di'd have to probably tell myself to be a little bit calm because we do have a two year old • yeah i mean before a- you know all this happened i was a little bit scared • um i know that i would be probably a little terrified [laughter] • and ii think i would just like to be a little more aware • [laughter] but ii would probably be a little bit more philosophical about it the same i would with a tornado warning or anything else i would think well is this it is my time up • it just makes you a little more secure
The women speaking to men about terrorism • The women speaking to men talk about being a little… • more prepared • more frightened • more cautious • more progressive • half-asleep • In other words, they are also using little to express affective states, but they are doing it much less often overall.
Health and Fitness In general, there’s a lot of use of little between people talking about Health and Fitness—except for men talking to men: M2M 24 observed, 72 expected Everyone else 498 observed, 217 expected
First: men talking to women • yeah so I'm about two hundred and ten pounds a little bit over I'm six foot four • well like I said I'm a little bit medium build so I got a bit of a belly there • so I've been I know I'm really really fortunate although coming home to visit my family and my mom and stuff has been feeding me a lot so I think that now I've got a little tummy so I've got to start doing something about it • I had a little bit of a pudge when I was like nine ten years old too then puberty kicked in and I was like rail thin after that [laughter] • I get a little bit lazy and start to get a little bit flabby around the middle then I start doing pushups
Now: men talking to men • The examples of men talking to women have a lot of “body” talk. Only two uses of little for the men-to-men have bodies involved: • so every time she would yell fat then I would s- stay on the bike for a little bit longer • like if I work at it I can put on a little bit of muscle mass • And notice that they are not at all about flab or pudge but the opposite
ICSI The data come from 75 meetings collected at the International Computer Science Institute at Berkeley between 2000 and 2002 Generally the regular weekly meetings of various teams—each meeting has between 3 to 10 participants (average of 6). The meetings range from 17 to 103 minutes, but are usually just shy of an hour each, for a total of 72 hours of data.
In ICSI It isn’t gender that matters—there’s no statistically significant difference between men and women in the rates of little use. But there IS by education. Speakers Observed Expected OE Undergrad 6 (30 yo) 59 34 1.734 Grad 14 (29 yo) 234 223 1.049 Postdoc 1 (not given) 51 75 0.676 Ph.D. 11 (37 yo) 152 228 0.667 Professor 4 (52 yo) 278 213 1.302 Natively-born American speakers in the ICSI corpus.
Education/age and little • But the folks with undergraduate educations are using little in talking about themselves: • Sometimes the German accents can get a little bit daunting. • I was getting a little frustrated. • So I've like learned a little bit. • I was just gonna say maybe fifteen minutes later would help me a little bit. • That’s not how the professors are using it, though: • So one thing you could do is build a little system that, said, <em>wheneve</em> you got a question like that I've got… • A <em>lot</em> of discriminatory power and then just have a little section in your belief-net that said, "pppt!" • Add an- a little thi- eh a thing for them to initial. • Actually it's a <em>little</em> tricky. • The little note I sent said that.
Education/age vs. gender • Btw, just as education/age matters in ICSI and gender doesn’t • Education/age don’t really matter in Fisher or Buckeye, though gender does • Caveat—education/age do matter when it comes to non-native English speakers in Fisher • Why the differences? • Our best answers come from the nature of the corpora!
More on stats • Stanford offers lots of courses that will help you do quantitative analysis of data—our linguistics department is big into this • Also consider: • HaraldBaayen’s book on quantitative analyses of language data • Stefan Gries’ book on corpus linguistics, also take a look at the slides here, especially the intro slides: • http://www.linguistics.ucsb.edu/faculty/stgries/teaching/ucsb_ling201.html
Some steps Write the first draft of your research question—maybe just as simple as filling in the blank: “I’m curious about ____” Identify potential corpora. Start by thinking of what an ideal data source would be and use that to figure out what sort of characteristics matter to you. Find examples in your potential corpora. Explore what’s going on. Brainstorm. Share with friends. Come up with a hypothesis. Make it specific enough so that you can test it: “If I’m right, then I should find lots of X and much less/no Y”. I should be able to disprove your hypothesis, which means it needs to be specific. Add any annotations you need. Do the counting. Were you right? What are the things that could be giving your hypothesis an unfair advantage or could be messing your hypothesis up? Are there subsets of the data where your hypothesis is/is not true? (Iterate as necessary.)
What’s corpus linguistics? Really it’s just linguistics That is, you cannot do linguistics without language data The reason people distinguish “corpus linguistics” is usually to say: if you are just using the intuitions in your head, none of the rest of us can double-check your work
Corpus links • Audio: • Switchboard is a standard among lingiusts and has been richly annotated. But Fisher provides much more data—so if you don’t need syntactic parsing, this is the way to go. • The most studied dialect in the world is African American Vernacular English, but most of the sociolinguistic corpora are inaccessible. You might try the SLAAP project, however. • The Santa Barbara Corpus is a rich source of discourse and does have ethnicity/age/gender/occupation/social class differences marked. Also take a look at TalkBank. • CALLHOME offers friends and family chatting long-distance. • CHILDES has parent/child interactions (with audio and video). See the blog post herefor getting started. • Internet language • You can use the Twitter API or various services like HootSuite that make it easy. • For Usenet archives (2005-2011): http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html(follow the link to Amazon) • See also: • Our corpus linguistics blog: http://corpuslinguistics.blogspot.com • The Language Log is full of examples: http://languagelog.com • The LDC list of corpora: http://www.ldc.upenn.edu/Catalog/ • Our non-LDC corpora: http://linguistics.stanford.edu/department-resources/corpora/inventory/ • Other corpora around the web: http://www.uow.edu.au/~dlee/CBLLinks.htm as well as http://www.athel.com/corpus.html • Geoffrey Sampson and Diana McCarthy. 2004. Corpus Linguistics: Readings in a Widening Discipline. • Relevant journals (and other resources): http://home.uchicago.edu/~sclancy/corpling/ • James Pennebaker does fascinating work with linguistics from the psychology side. Look him up. (He also has a site that will tell you the personality of people from their Twitter accounts: http://analyzewords.com/
What do I need to do? • If you need access to the corpora we house, follow steps here (basically, you send me an email): • http://linguistics.stanford.edu/department-resources/corpora/get-access/ • Here’s a VERY rough outline of what happens next. • Then you want to log in to the “AFS” space where everything is stored. • Use SecureCRT for access • Use SecureFX for file transfer • Get both from http://itservices.stanford.edu/service/ess • You open SecureCRT, sign in with your account info to cardinal.stanford.edu. • Go to your corpus: • cd /afs/data/linguistic-data/{wherever} • You need to find the stuff you’re interested in. • One useful tool is grep (or egrep or fgrep). There are lots of tutorials all over the web. • grep –wi “little” *.txt > ~/littlefromwhatevercorpus.txt • That’s going to find all whole words, ignoring case that match “little” in any file in the directory you’re in that are txt files. It will store all those matches in a file in your home directory (not the current directory) in a file called littlefromwhatevercorpus.txt • It’s probably easiest to manipulate the files on your computer, so open but SecureFX and use it to get the files
Creating your own corpus • If you’re creating your own corpus, you want an easy way to get word counts and to find words-in-context. • If you have any programming abilities, this stuff is easy. If not, consider learning Python. The “Natural Language Toolkit” module makes language research programming really easy! • (Free) off-the-shelf “concordancers” include: • TextStat: http://neon.niederlandistik.fu-berlin.de/en/textstat/ • WordAndPhrase.Info (by the guy who did COCA): http://www.wordandphrase.info/analyzeText.asp • You’ll also hear about WordSmith and MonoConc, but they cost money so you’ll probably skip ‘em. • Other issues for developing corpora: http://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm
Last minute thoughts • I’ve said very little about multilingual corpora (UN Parallel Texts, for example) • Or historical corpora • The Archer Corpus is a multi-genre corpus of British and American English covering 1650-1999: http://www.llc.manchester.ac.uk/research/projects/archer/. • The same place you go to get COCA will give you COHA, the Corpus of Historical American English • And the Google Books NGram is a great resource for this, too • Or non-English corpora (see previous slide and the our corpus blog for more) • All of these exist and are worthy of your attention!