700 likes | 1.09k Views
Wordsmith Tools. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and Terminology New Technologies in Translation - CAPES Universitat Rovira i Virgili-Universidade de São Paulo Tarragona July 8-11, 2008. How to use Wordsmith Tools to investigate a corpus. First.
E N D
Wordsmith Tools Stella E. O. Tagnin - USP Corpus Linguistics, Translation and Terminology New Technologies in Translation - CAPES Universitat Rovira i Virgili-Universidade de São Paulo Tarragona July 8-11, 2008
First Download demo version of Wordsmith Tools 5.0 from Mike Scott’s site: http://www.lexically.net/wordsmith/version5/index.html
Name: Tarragona and USP • > Other Details: • > Registration: SA00.3461.2978.3904.6880.9VVB • > • > When "Updating from Demo", please paste these details in • > EXACTLY as you see them here. • > Please see "readme.txt" for any further details. • > • > • > -- • > Mike Scott • >
WordSmith Tools WordList • S = General Statistics • F = Frequency • A = Alphabetical KeyWords • Study Corpus vs Reference Corpus Concord • KWIC = Key Word In Context • Collocates • Clusters
WordList • S = General Statistics: overview of corpus and texts • F = Frequency: most frequent words may point to topic • A = Alphabetical: make lemmatizing easier
WordList - Statistics Identifying peculiarities • corpus (Overall) • each text
Frequency WordList • Hint as to topic • Survey of most recurrent words in text/corpus
Alphabetical WordList • Spotting words • Lemmatizing word forms
KeyWords • Identifying prevailing vocabulary • Study Corpus vs Reference Corpus
Keywords N WORD FREQ. WCUPING.LST % FREQ. REFENG2.LST % KEYNESS P 1 CUP 1.024 0,77 1 3.291,6 0,000000 2 WORLD 1.197 0,90 301 0,06 2.496,9 0,000000 3 TEAM 575 0,43 48 1.538,4 0,000000 4 GAME 486 0,36 22 1.396,6 0,000000 5 HIS 714 0,53 257 0,05 1.296,2 0,000000 6 GERMANY 435 0,33 14 1.284,9 0,000000 7 SOCCER 374 0,28 0 1.206,4 0,000000 8 HE 778 0,58 429 0,08 1.130,6 0,000000 9 ITALY 332 0,25 5 1.021,0 0,000000 10 SAID 670 0,50 343 0,06 1.017,6 0,000000 11 WAS 892 0,67 716 0,13 987,2 0,000000 12 PLAYERS 337 0,25 15 969,6 0,000000 13 GOAL 352 0,26 51 851,9 0,000000 14 BALL 260 0,19 2 815,9 0,000000 15 IN 3.214 2,40 7.019 1,31 761,0 0,000000 16 COACH 229 0,17 0 738,5 0,000000 17 TOURNAMENT 205 0,15 0 661,0 0,000000 18 SPORTS 234 0,18 13 658,5 0,000000 19 PLAY 264 0,20 37 643,5 0,000000 20 FRANCE 208 0,16 6 618,7 0,000000 21 FANS 193 0,14 1 610,2 0,000000 22 MATCH 265 0,20 49 604,4 0,000000 23 MINUTE 206 0,15 9 593,5 0,000000 24 BRAZIL 209 0,16 19 551,6 0,000000 25 WIN 193 0,14 15 521,2 0,000000
Comparing 2 WordLists • Positive keywords (occurring vocabulary) • Negative keywords (NON-occurring vocabulary)
... and vocabulary that does NOT occur • Negative keywords
Compiling a Glossary Selecting Terms • Keywords – term candidates (terminology) • Concord - context • Collocates • Clusters – multiword combinations, not necessarily terms or phrases
Concord • KWIC = Key Word In Context • Collocates • Clusters
Identifying patterns Context • Concordance lines • Lexical patterns – collocations • Grammatical patterns – colligations
Collocates • Position of most frequent co-occurring words
Compiling a Glossary Selection of Terms • Keywords • Clusters
Clusters de 3 palavras 3-word clusters
WordList • With more than one term • > Settings • > Tab List • > WordList Tab • > Clusters • > Activated
How to ignore undesired text By tagging: • Title • Subtitle • Figure • Date • URL • etc.
Adjusting Settings Controller • Settings • Adjust Settings • Only part of file