140 likes | 240 Views
How to Tag a Corpus Using Stanford Tagger. Accuracy. All tokens: 97.32 % Unknown words: 90.79 %. What You Need. JRE: http://www.java.com/en/download/ie_manual.jsp?locale=en. To make sure that Windows can find the Java compiler and interpreter:.
E N D
Accuracy • All tokens: 97.32% • Unknown words: 90.79%
What You Need JRE: http://www.java.com/en/download/ie_manual.jsp?locale=en
To make sure that Windows can find the Java compiler and interpreter: • Select Start -> Computer -> System Properties -> Advanced system settings -> Environment Variables -> System variables -> PATH. • [ In Vista, select Start -> My Computer -> Properties -> Advanced -> Environment Variables -> System variables -> PATH. ] • [ In Windows XP, Select Start -> Control Panel -> System -> Advanced -> Environment Variables -> System variables -> PATH. ] • Prepend C:\Program Files\Java\jdk1.6.0_27\bin; to the beginning of the PATH variable. • Click OK three times.
Installing Java (JRE) on your computer • Click Start • type cmd and press enter • this will open the command prompt window • type java –version and press enter • you will get a message: java version “1.7.0” (or may be an older version) If you do not get this message it means you could not install Java correctly. Ask for help.
Install the Stanford POS Tagger Basic English Stanford Tagger Version 3.1.3: http://nlp.stanford.edu/software/stanford-postagger-2012-07-09.tgz
Installing Basic English Stanford Tagger Version 3.1.3 • Click on the link that I provided above download the zip file. • Unzip the file to Documents using an archive manager software, such as WinRAR, 7-Zip, or WinZip • You might want to change the name of this unzipped folder to stanTagger. I do this because the original name is too long:stanford-postagger-2012-07-09
Create a Corpus Folder • In stanTagger folder create two folders to hold your files. • I name them myCorpus and myTaggedCorpus • Now put some text files (or your corpus) in myCorpus • Make sure there are no spaces in your file names. For example, writtenArgument.txt instead of written Argument.txt • Carry your folder named stanTagger under C: so that you can find it easily.
Tagging Files • Start your command window as described above • Go to C: by typing the command cd.. twice • Go in stanTagger by typing cd stanTagger
Tagging files • To be able to use the Stanford-Tagger on every file automatically, we need to do some programming. • We can do this with Perl or other programming languages, such as Java, PHP, Python, and so on. • However, I found programming the Command Prompt to be the simplest and will share the code I prepared.
Tagging files • Code to be used in Command Prompt: • FOR%aIN (C:\stanTagger\myCorpus\*.txt) DOstanford-postagger models\left3words-wsj-0-18.tagger myCorpus\%~nxa>myTaggedCorpus\%~nxa • You can simply copy the above code and paste it in the Command Prompt
New Code! • FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO stanford-postagger models\wsj-0-18-left3words.tagger myCorpus\%~nxa >myTaggedCorpus\%~nxa
Newest Code! • FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO stanford-postagger models\english-left3words-distsim.tagger myCorpus\%~nxa >myTaggedCorpus\%~nxa
Each file may take about 2-3 seconds and at the end, you will see that myTaggedChineseFolder contains the tagged files.