230 likes | 399 Views
INFO 340. Information Retrieval Words in a Language – Cardinality and Enumeration. What do we know about words in a language ?. Consider some language and all of the words in that language We can count them We can identify the probability of occurrence of any particular word
E N D
INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration
What do we know about words in a language ? • Consider some language and all of the words in that language • We can count them • We can identify the probability of occurrence of any particular word • We can categorize them according to some scheme • We can order them according to some scheme
Body of Words Reference Point • Reference point for these words – • A ‘corpus’ • A particular body of text • Usually documents of text • What is a document? • Wiki says (at least on Feb 5, 2009): • “A document (noun) is a bounded physical representation of body of information designed with the capacity (and usually intent) to communicate.”
Corpus documents within corpus text (words) within documents
Enumerating a Corpus • We can count: • the total # of words within a corpus • the total # of documents within a corpus • the number of occurrences of any particular word within a document • the number of occurrences of any particular word in the entire corpus -- how often certain words are in proximity to each other – word X is Y words from word Z
British National Corpus • From http://www.natcorp.ox.ac.uk/corpus/index.xml (non-link underlines added) • What is the BNC? • The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007. • The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins. • The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header. • Work on building the corpus began in 1991, and was completed in 1994. No new texts have been added after the completion of the project but the corpus was slightly revised prior to the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007). Since the completion of the project, two sub-corpora with material from the BNC have been released separately: the BNC Sampler (a general collection of one million written words, one million spoken) and the BNC Baby (four one-million word samples from four different genres). • Full technical documentation covering all aspects of the BNC including its design, markup, and contents are provided by the Reference Guide for the British National Corpus (XML Edition). For earlier versions of the Reference Guide and other documentation, see the BNC Archive page.
British National Corpus • From http://www.natcorp.ox.ac.uk/corpus/index.xml • What sort of corpus is the BNC? • Monolingual: It deals with modern British English, not other languages used in Britain. However non-British English and foreign language words do occur in the corpus. • Synchronic: It covers British English of the late twentieth century, rather than the historical development which produced it. • General: It includes many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language. • Sample: For written sources, samples of 45,000 words are taken from various parts of single-author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.
Top 50 most occurring From Kilgraff’s lemmatized* lists on BNC (http://www.kilgarriff.co.uk/BNClists/lemma.num) 1 6187267 the 2 4239632 be 3 3093444 of 4 2687863 and 5 2186369 a 6 1924315 in 7 1620850 to 8 1375636 have 9 1090186 it 10 1039323 to 11 887877 for 12 884599 i 13 760399 that 14 695498 you 15 681255 he 16 680739 on 17 675027 with 18 559596 do 19 534162 at 20 517171 by 21 465486 not 22 461945 this 23 459622 but 24 434532 from 25 433441 they 26 426896 his 27 384313 that 28 380257 she 29 373808 or 30 372031 which 31 364164 as 32 358039 we 33 343063 an 34 333518 say 35 297281 will 36 272345 would 37 266116 can 38 261089 if 39 260919 their 40 249540 go 41 249466 what 42 239460 there 43 230737 all 44 220940 get 45 218258 her 46 217268 make 47 205432 who 48 201968 as 49 201819 out 50 195426 up Lemmatised list There is a lemmatised frequency list for the 6,318 words with more than 800 occurrences in the whole 100M-word BNC. The definition of a 'word' approximates to a headword in an EFL dictionary such as Longman's Dictionary of Contemporary English: so, eg, nominal and verbal "help" are listed separately, and the count for verbal "help" is the sum of counts for verbal 'help', 'helps', 'helping', 'helped'.
Bottom 50 (least often occuring) From Kilgraff’s lemmatized* lists on BNC (http://www.kilgarriff.co.uk/BNClists/lemma.num) 6268 811 bail 6269 810 unwanted 6270 810 tight 6271 810 plausible 6272 810 midfield 6273 810 alert 6274 809 feminine 6275 809 drainage 6276 809 cruelty 6277 809 abnormal 6278 808 relate 6279 808 poison 6280 807 symmetry 6281 807 stake 6282 807 rotten 6283 807 prone 6284 807 marsh 6285 807 litigation 6286 807 curl 6287 806 urine 6288 806 latin 6289 806 hover 6290 806 greeting 6291 806 chase 6292 805 spouse 6293 805 produce 6294 805 forge 6295 804 salon 6296 804 handicapped 6297 803 sway 6298 803 homosexual 6299 803 handicap 6300 803 colon 6301 802 upstairs 6302 802 stimulation 6303 802 spray 6304 802 original 6305 802 lay 6306 802 garlic 6307 801 suitcase 6308 801 skipper 6309 801 moan 6310 801 manpower 6311 801 manifest 6312 801 incredibly 6313 801 historically 6314 801 decision-making 6315 800 wildly 6316 800 reformer 6317 800 quantum 6318 800 considering Lemmatised list There is a lemmatised frequency list for the 6,318 words with more than 800 occurrences in the whole 100M-word BNC. The definition of a 'word' approximates to a headword in an EFL dictionary such as Longman's Dictionary of Contemporary English: so, eg, nominal and verbal "help" are listed separately, and the count for verbal "help" is the sum of counts for verbal 'help', 'helps', 'helping', 'helped'.
Word Search on BNC • Results of your search • Your query was • dude • Here is a random selection of 50 solutions from the 68 found... • ABS 82 The show is hip and happening, dude: the audience looks as if it has just walked in off the King's Road, the post-modernish set is ultra-cool, the show's titles are dazzling, the best I've seen on British television. • AL3 1668 Paul Mansfield on a dude ranch finds great steaks and catfish. • AL3 1677 The most popular dude ranch states with British holidaymakers are Arizona, Colorado and Wyoming. • AL3 1692 Days on a dude ranch --; particularly after a night like that one --; tend to be restful. • AL3 1705 U.K. specialists in dude ranch holidays are Ranch America, 250 Imperial Drive, Rayners Lane, Harrow, Middx HA2 7HJ. • AL3 1707 Their 1992 programme features the Dixie Dude Ranch. • ASV 744 `;Killer board, dude';, said Callahan. • ASV 1290 a punk Mafia dude • ASV 1338 When I mentioned his name in Hawaii an informant who asked to remain anonymous said: `;Johnny Boy Gomes is the meanest, heaviest dude on the whole of the North Shore.'; • C87 421 Dude Power --; Makes Rufus invulnerable to aliens. • C9K 626 I'm not a great rhythm player, I don't look at myself as a great chops dude and arranging is definitely not my forte. • C9M 908 It was made by this guy who built guitars, some hippy dude, and it was the oddest-shaped guitar. • CD6 349 Amongst the news, views, product reviews and UK surf gossip, arrives Speng --; The Cool Ruler, a cartoon surfing dude who travels the world's beach breaks in the company of Dog Gorgon. • CEK 2502 11 (10) CALIFORNIA MAN: Wayne's World style comedy about a rock dude. • CGB 2016 Nutshell it for us, dude: `;There's these two loser kids from the Valley. • CH5 4935 He'd come a long way for a dude from Texas, and it had all been so very easy for the man with the Gary Cooper smile. • …
Word Search on BNC • Results of your search • Your query was • bloody • Here is a random selection of 50 solutions from the 6818 found... • A32 105 The film appears to query the notion of heroism in its battle scenes, muddy, bloody, in marked contrast to the sunny patriotism of Laurence Olivier's wartime version. • A74 824 Bloody, bloody, bloody. • A7H 1640 `;I got a bloody nose over that,'; he says. • A95 255 The mass demonstrations of the first year are rare these days and the army and Shin Bet security service have been chalking up success after bloody success in hunting down the masked youths who throw petrol bombs and kill collaborators. • AC2 548 At one such meeting a heckler had got a great round of cheers from the assembled throng when he had told Clasper to get off his bloody soap-box and do a day's work for a bloody change. • AC2 2312 You bloody shits….'; • B24 1996 S. H. Patrolling up Prescot Road during the war, if I saw a light on, I used to shout, `;Put that bloody light out,'; and if nobody put the light out, we used to let fly with a brick. • B24 2614 You'd look at the sergeant and if he O. K. d it, you'd have one but if he didn't, you bloody wouldn't.'; • CA0 1427 You can't do without it, you bloody old tart, can you? • CAF 1251 In fact, Pilsudski came to power in a bloody putsch and presided over gross human-rights violations, the brutal crushing of strikes and virtual civil war with the national minorities. • CJF 2396 `;Bloody murder, is it then? • CL2 2295 She was, in short, too bloody much, and not only that, she was totally ignoring me. • F9C 1708 Plumbers, builders, estate agents, the government, the council, bloody thieves… • FEE 2572 " You flaming women, you're so stuffed with bloody honesty it's a wonder you don't choke on it. • FP0 593 The short polar day died in bloody shadows. • FR5 366 It's so bloody easy. • FRS 26 `;Do you know how many firms of bloody architects I've traipsed round to in the past two months? • G1M 2346 It's bloody Piper.
Corpora Morphing Over Time From: http://courses.ischool.berkeley.edu/i202/f07/lectures/202-20071203.pdf
Enumeration leads to predictive capabilities • Given a sample of text made up from words within a particular corpus that has been enumerated, I can predict the probability that certain words will show up.
Zipf’s Law • Named after George Zipf (1902 – 1950), a Harvard linguist • Empirical – created from observation • The probability of a word’s occurrence is inversely proportional to it’s rank
Zipf’s Law • More formally, Zipf’s law is a power law function where • the observation that frequency of occurrence of some event ( P ), as a function of the rank ( i) when the rank is determined by the above frequency of occurrence, is a power-law function Pi ~ 1/ia with the exponent a close to unity (1). (http://www.nslij-genetics.org/wli/zipf/)
Zipf’s Law • Identify your corpus • Count the total number of words • Count the total number of each word • Rank each word by frequency of occurrence • i.e. the most frequently used word has the highest rank -- #1 and least frequently used word has the lowest rank (equal to the total number of words in the corpus) • Plot
Zipf’s LawEffect of taking the log of both axes linear scale (both axes) log scale (both axes)
Zipf’s Law in Different Languages From Multilingual Statistical Text Analysis, Zipf’s Law and Hungarian Speech Generation http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf
Class Exercise • Divide into groups of 5 – 6 • Each group will have one music category. Take 15 minutes to come up with 50 words that would fall predominantly into the following music genres. (Keep it clean). -- country -- classic rock -- heavy metal -- hip-hop/rap -- top 40/pop -- KEXP
Class Exercise Part II • In the same groups, name 20 words that you think would appear with particular frequency (higher than ‘predicted’ by the entire corpus of all newspapers) in the following newspapers: • The Wall Street Journal • Seattle Times • The National Enquirer • The New York Times • The Seattle PI • The Ballard Journal
Class Exercises Part III • Back to music – • Your corpus is the entire set of articles, web pages, books, interviews, etc written by critics and reviewers of the music industry • Within each of your groups: • Quietly, identify a musical artist • Identify 10 words that you think appeared with exceptionally frequency in this corpus • When done, you’ll write your 10 words on the board and the class will try to figure out who the artist is
Homework Choose 20 words and using the BNC or Kilgariff’s lists (http://www.kilgarriff.co.uk/BNClists/lemma.num) – and plot them on log-log paper. You can get log-log at the book store or print your own here: (http://incompetech.com/graphpaper/logarithmic/) Do your 20 words look Zipfian ?
Homework (continued) • Download & install Lucene on your iSchool lab computer account (Windows side) -- • http://www.apache.org/dyn/closer.cgi/lucene/java/ • Download lucene-2.4.0.zip • Install per README.txt instructions • Set CLASSPATH for lucene-core-2.4.0.jar & lucene-demos-2.4.0.jar • Test the demo IndexFiles. (Read demo.html in docs) • Download & install Luke (lukeall.jar) @ • http://www.getopt.org/luke/