190 likes | 288 Views
Introduction to Corpora@Stanford. Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd , 2003. Some basic questions. Where are our corpora? Where is the software? Is there a list of all the stuff we have? How can I access the software?
E N D
Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3rd, 2003
Some basic questions • Where are our corpora? Where is the software? • Is there a list of all the stuff we have? • How can I access the software? • Where do I start? What information is available where? • Are there tutorials for the available software? • What kind of corpus work is supported at Stanford? • Corpora are only for those computational folks … ;-) • And the most important question:
Why bother at all … • Because we are often wrong with our (ad-hoc) intuitions – linguistic methodology is … • well, let’s not go there. • While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities. • To illustrate my point, a little case study …
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.” • 0.5 apples/apple • 1.0 apples/apple • 1.5 apples/apple • zero apples/apple
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Hagit Borer’s judgments: • 0.5 apples/*apple • 1.0 apples/*apple • 1.5 apples/*apple • zero apples/*apple
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Google’s count: • 0.5 apples (120)/*apple (179) • 1.0 apples (42)/*apple (23,600) • 1.5 apples (59)/*apple (362) • zero apples (194)/*apple (124) • This also makes clear, some of the problems, so let’s take pears
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Google’s count: • 0.1 pears (32)/*pear (118) • 0.5 pears (37)/*pear (50) • 0.7 pears (9)/*pear (14) • 1.0 pears (14)/*pear (24,000) • 1 pears (14)/?pear (7,480) • One pears (1,130)/?pear (3,060) • 1.5 pears (28)/*pear (316) • zero pears (3)/*pear (0) • Conclusion: • It is amazing how many programs or computers products use fruit names. • The original judgments seem questionable. • BUT: can we trust Google?
Looking for a corpus • There are several sites on the web that can help you to find out if what you are looking for exists: • Databases like David Lee’s site (see also our Top 10 list) • The LDC database • Our list of corpora (next page) • Email lists, see our site under ‘Support’ • Local: corpora@csli.stanford.edu • Global: MAJORDOMO@UIB.NO
Types of corpora • Different languages • Different media (speech, video, text) • Different levels of annotation • No annotation • Transcribed speech or video • Sociological annotation (gender of speaker, average age of audience, dialect of speaker, etc.) • Discourse and textual information (publication date, number of discourse participants, discussion panel vs. novel, etc.) • Linguistic annotation (phonemes, prosody, syntax, morpho-syntax, lexemes, phonological segments & syllables, etc.)
Looking for a specific corpus • List of available corpora • If the corpus is on AFS • If the corpus in on the Corpus Computer • If the corpus is on CD • If the corpus is on the WWW • If the corpus has special license conditions • If we don’t have the corpus
Tools & software • General • Where to start: • Local online tutorials (see also external references and manuals) • The corpus TA • corpora@csli.stanford.edu • Little helpers
A brief look at some tools • BNC Web • Problem: Superiority “who the hell …” • Problem: Distribution of “… is like …” – age dependent? • General information • Age (easy export to e.g. Excel) • Crosstabs • TGrep2 and Tgrep • Tutorial • Examples: • tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'
Note: Tgrep is right-headed • The following pattern matches an S which has a child A and another child that is a C and that the A has a child B: • S < (A < B) < C • However, this pattern means that S has child A and that A has children B and C: • S < ((A < B) < C) • It is equivalent to this: • S < (A < B < C)
Some more Tgrep2 syntax • A < B A is the parent of (immediately dominates) B. • A > B A is the child of B. • A <N B B is the Nth child of A (the rst child is <1). • A >N B A is the Nth child of B (the rst child is >1). • A <, B Synonymous with A <1 B. • A >, B Synonymous with A >1 B. • A <-N B B is the Nth-to-last child of A (the last child is <-1). • A >-N B A is the Nth-to-last child of B (the last child is >-1). • A <- B B is the last child of A (synonymous with A <-1 B). • A >- B A is the last child of B (synonymous with A >-1 B). • A <` B B is the last child of A (also synonymous with A <-1 B). • A >` B A is the last child of B (also synonymous with A >-1 B). • A <: B B is the only child of A • A >: B A is the only child of B • A << B A dominates B (A is an ancestor of B).
Some more TGrep2 syntax • A >> B A is dominated by B (A is a descendant of B). • A <<, B B is a left-most descendant of A. • A >>, B A is a left-most descendant of B. • A <<` B B is a right-most descendant of A. • A >>` B A is a right-most descendant of B. • A <<: B There is a single path of descent from A and B is on it. • A >>: B There is a single path of descent from B and A is on it. • A . B A immediately precedes B. • A , B A immediately follows B. • A .. B A precedes B. • A ,, B A follows B. • A $ B A is a sister of B (and A 6= B). • A $. B A is a sister of and immediately precedes B. • A $, B A is a sister of and immediately follows B. • A $.. B A is a sister of and precedes B. • A $,, B A is a sister of and follows B. • A = B The node matched by A is also matched by B.
The alternative with windows • TigerSearch 2.1; screen shots: • Grammar search • Collocation search
The end my friends • Want to help? • The website can always use additions (short blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.) • Tschuessi!