220 likes | 230 Views
CIS392 Text Processing, Retrieval, and Mining Spring 03. Instructor: Dr. Y. F. Brook Wu BOW toolkit: http://www.cs.cmu.edu/~mccallum/bow. Login in to AFS. On campus: go to a computer lab in GITC 2305. At home: make sure the internet connection has been established.
E N D
CIS392 Text Processing, Retrieval, and MiningSpring 03 Instructor: Dr. Y. F. Brook Wu BOW toolkit: http://www.cs.cmu.edu/~mccallum/bow Assign#1
Login in to AFS • On campus: go to a computer lab in GITC 2305. • At home: make sure the internet connection has been established. • Assume everyone has Windows at home. Click on Start Run • Type in “telnet afs1.njit.edu” (without quotes; the first screen shows some useful information.) • Enter user name and password • What if your account doesn’t work: Call help desk 973.596.2900, they can reset your password for you. Assign#1
Useful UNIX commands • Note: All filenames and commands in UNIX system are case sensitive. • General syntax: Command [option] Argument • Options modify the way command works, and they are optional. • Arguments are usually files; sometimes they are optional too. • Ex: rm –r directory_name Assign#1
Note • Typing two “-” next to each other in MS PowerPoint will make them look like “—” . Those BOW and UNIX commands you see in these slides, therefore, are confusing. So, please refer to BOW help file and UNIX documentations for their actual usages. Assign#1
Useful UNIX commands • man (for manual) ex: man ls (manual for ls command) • cd (change directory) • ls (list files and attributes) • dir (list files) • mkdir (crete a directory) • rm (delete a file) • rm –fr directory_name (delete the whole directory and files inside it.) Assign#1
Useful UNIX commands • rmdir (remove directory) • cp (copy) • pwd (current working directory) • pico (a text editor) • more filename (read plain text file one screen at a time. Press space bar to continue and “q” to quit.) • quota (disk space) Assign#1
More useful UNIX commands • http://www.njit.edu/CSD/Docs/unixcmds.html • http://www.njit.edu/Directory/Admin/CSD/Academic_Computing/Manuals/UNIX/UNIX.html Assign#1
How to create your home page on AFS system? • Help info: http://www-ec.njit.edu/ec_info/newuser/web/web.html • Execute this command at the UNIX prompt: /usr/ec/bin/home.page.setup • Your URL: http://www-ec.njit.edu/~yourusername Assign#1
Overview of Retrieval Experiment • Create a sub-directory for CIS392 assignments under ~your_user_name/public_html • Create 3 sub-directories under the above directory for the 3 automatic indexing activities • Perform 3 automatic indexing activities with 3 different options Assign#1
Overview of Retrieval Experiment (cont) • Perform 3 retrievals for each of the above 3 auto indexing activities • Analyze how different indexing options affect retrieval • Make an html page to present your results. Assign#1
Creating sub directories • Change directory to public_html by typing: cd public_html • mkdir cis392 (now you’ve created a directory for your CIS392 retrieval assignments) • cd cis392 (go inside cis392 directory) Assign#1
Creating three sub-directories • mkdir model1 (this directory stores results from default settings: no stemming and stopped words removed.) • mkdir model2 (this directory stores results from the following settings: no stemming, and stopped words INCLUDED.) • mkdir model3 (this directory stores results from the following settings: stemming, and stopped words removed.) Assign#1
URL of your retrieval experiment • http://www-ec.njit.edu/~yourusername/cis392/cis392re.html • See a sample page created by Prof Wu: http://www-ec.njit.edu/~wu/cis392/cis392re.html Assign#1
Getting Access to BOW and Test Collection • there are three directories under ~wu/IR_Tools: • bow (for BOW system), to execute BOW, change directory to: ~wu/IR_Tools/bow/bin • som (for self-organizing map program. Do NOT use it now!) • tc (test collection, Library and Information Science Abstracts) the text is under ~wu/IR_Tools/tc/lisa/text/group0 to group5 Assign#1
Test Collection: LISA • The sample queries are stored in~wu/IR_Tools/tc/lisa/LISA.QUE • The relevant documents corresponding to queries are stored in:~wu/IR_Tools/tc/lisa/LISA.REL (“-1” marks the end of the entry.) Assign#1
Operating Arrow of BOW • Read information from BOW’s web site (again, the URL is list on the “Resources” section of the class syllabus) • Read Arrow’s help file (available on syllabus page; You should print a copy of the help file.) Assign#1
Automatic Indexing • To begin the retrieval tasks, first you need to index the whole document collection. • Specify lexing options (stopped words removal and/or stemming) at this time. • arrow -d ~yourusername/public_html/cis392 --index ~wu/IR_Tools/tc/lisa/text/* • The * sign is a wildcard represents all files and directories under ~wu/IR_Tools/tc/lisa/text Assign#1
Automatic Indexing • -d parameter specifies where you will store the statistics resulted from indexing. (You will have to specify this directory when you want to index and retrieve documents.) • The path after –index specifies the location of text collection. • The default lexing settings of the above task include: NO stemming performed, and stopped words REMOVED. Assign#1
Query assigned for retrieval • Please refer to retrieval experiment section of the online syllabus to see which query you get for the experiment. (http://web.njit.edu/~wu/teaching/sp03/CIS392/CIS392-Sp03.htm) Assign#1
Retrieval • First, please specify where the indexing statistics is stored, and then the query to be performed. • arrow –d ~yourusername/public_html/cis392/model1 --num-hits-to-show=25 –query > ~yourusername/public_html/cis392/model1/retrieved_docs • The greater-than sign (>) specifies the output filename and where it will be stored. Assign#1
Presenting your RE • create a page under your ~/public_html/cis392 directory named: cis392re.html • this page should contain several pieces of information, see: http://web.njit.edu/~wu/cis392/cis392re.html Assign#1
Presenting your RE • You can create this html page with the pico editor in UNIX (if you know basic html tags) , Microsoft Word (save the file in html format), or Netscape composer. • If you use an html editor, you might need FTP software. http://www.zdnet.com/downloads/stories/info/0,10615,30994,00.html • Before due date: Please check all items on your html page and make sure all of them are displayed properly. • After due date: do not make changes. I can check when the files were last updated. Assign#1