210 likes | 385 Views
Data Structure Project 2. Calculating Word Frequency in a Document. TA’s website & remainder. http://mpc.cs.nctu.edu.tw/forum/ 11/6( 四 ) 這個星期四小考 , 5. Threaded Binary Tree 不考 11/15( 六 ) 10:10~12:00 期中考!. About Project One…. 有關多一行的問題 .. >> version ifstream input( argv [1]);
E N D
Data Structure Project 2 Calculating Word Frequency in a Document
TA’s website & remainder • http://mpc.cs.nctu.edu.tw/forum/ • 11/6(四) 這個星期四小考, 5. Threaded Binary Tree 不考 • 11/15(六) 10:10~12:00 期中考!
About Project One… • 有關多一行的問題.. • >> version • ifstream input(argv[1]); • while (!input.eof() && input.peek() > 0) { • input >> buf; • cout << buf ; • input >> buf; • input.get(); /* 拿走 ‘\n’ 這個 character */ • cout << " " << buf << endl; • }
About Project One… • Getline version • ifstream input(argv[1]); • while (!input.eof()) { • input.getline(buf, 500); • if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */ • cout << buf << endl; • } • Another one • ifstream input(argv[1]); • while (input.getline(buf, 500)) { • cout << buf << endl; • }
About Project One… • 有關於出現 ^@ 的問題 • 看到 demo 時候出現 ^@ 就是你把 ‘\0’ (就是 0) output 到檔案中了.. • 以後多出這種 demo 程式就不會過, 就以錯誤計算 • How to fix ? • 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔案中. • int i; FILE* fw; char *a = "123"; • fw = fopen(argv[1], "w"); • /* 這樣不會output 出 ^@ */ • for(i=0; i<3; i++) fprintf(fw, "%c", a[i]); • /* 這樣就會output 出 ^@ */ • for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); • fclose(fw);
About Project One • 補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目錄. • 第一次 demo 成績: http://www.cs.nctu.edu.tw/~hhyou/ds.php
Project Two • Input: a text file and a stop words list • Using argc and argv • ./a.outstopwordtextfile • Output: pairs of word and the number of their occurrence • To stdout (the screen)
Project Two • Text file (without stop word) Hello, I’m Billy, not bi|ly or 6illy or b. • Output • Hello,:1 • I’m:1 • Billy,: • not:1 • bi|ly: 1 • or: 2 • 6illy: 1 • b.: 1
Project Two • Text file (same) • Stop word list • and • not • or • Output • Hello,:1 • I’m:1 • Billy,: • bi|ly: 1 • 6illy: 1 • b.: 1
Project Two • Text file • a b c d e f g h i j a b c d e • Stop words list • a b c d • Output • e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1
Project Two • Input • Text file • Every words are spited by ‘‘,’\t’, or ‘\n’. • Case sensitive. • Do and do are different words • There’s at most 2000 chars in one line. • There will be no Chinese input. • Not only one line in a text file. • There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’. • Program executive time are limited.
Project Two • Input • Stop words list • One word one line • No space,’\t’ in one line • No more than 2000 chars one line • Correct • Haha • Hehe • kerker • Incorrect • 囧oo • A b
Project Two • Word occurrence • String+’‘+number+’’\n’ A 3 B 5 • String orders won’t matter. B 5 A 3
Project Two • You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case) • One array for your string, another for the occurrence • Your data structure must be fast in insertion and selection (search).
Project Two • We’ll use program to judge your homework • Please take care about the I/O format • You can not read the whole file in one time • You have to read at most one line in one time • We’ll release some test data. • Due: 11/21 • Your bonus will depend on the efficiency of your program
Project Two • Large case • A lot of different words (more than 1000000) • A lot of words in a text file • 30% • One of them will be released • 10% per test case • We will release 2 normal test case and 1 large test case for testing.
Project Two • Some simple algorithm • Assume STOPWORD has N word, TEXTFILE has M word. • We build SW_LIST to store stop words, TXT_LIST to store text file words.
Project Two (Brute Force) O(N) • Read in STOPWORD, store it as SW_LIST • foreach ( word read from TEXTFILE ) • { • if ( the word is in SW_LIST ) • then continue to read another word. • else ( the word is not in SW_LIST ) • then • if ( the word is in TXT_LIST ) • then add count of the word 1 • else ( the word is not in TXT_LIST ) • then insert word into TXT_LIST • } O(N) O(M) O(M)
Project Two • 這個作業寫的比較快的會有 Bonus. • 到時候會把大家的程式拿到某台神秘的工作站上面跑, 看誰快誰慢. • 如果對於加分部份的公平性有疑問請在 11/6(四) 上課前提出.
Project Two – How to hand in • 先到 ftp://mpc.cs.nctu.edu.tw建立自己學號的資料夾. • 上傳可 compile, run 的 C/C++ source code 檔案到 ftp://mpc.cs.nctu.edu.tw
Q & A • Any questions ?