1 / 21

Data Structure Project 2

Data Structure Project 2. Calculating Word Frequency in a Document. TA’s website & remainder. http://mpc.cs.nctu.edu.tw/forum/ 11/6( 四 ) 這個星期四小考 , 5. Threaded Binary Tree 不考 11/15( 六 ) 10:10~12:00 期中考!. About Project One…. 有關多一行的問題 .. >> version ifstream input( argv [1]);

minya
Download Presentation

Data Structure Project 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Structure Project 2 Calculating Word Frequency in a Document

  2. TA’s website & remainder • http://mpc.cs.nctu.edu.tw/forum/ • 11/6(四) 這個星期四小考, 5. Threaded Binary Tree 不考 • 11/15(六) 10:10~12:00 期中考!

  3. About Project One… • 有關多一行的問題.. • >> version • ifstream input(argv[1]); • while (!input.eof() && input.peek() > 0) { • input >> buf; • cout << buf ; • input >> buf; • input.get(); /* 拿走 ‘\n’ 這個 character */ • cout << " " << buf << endl; • }

  4. About Project One… • Getline version • ifstream input(argv[1]); • while (!input.eof()) { • input.getline(buf, 500); • if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */ • cout << buf << endl; • } • Another one • ifstream input(argv[1]); • while (input.getline(buf, 500)) { • cout << buf << endl; • }

  5. About Project One… • 有關於出現 ^@ 的問題 • 看到 demo 時候出現 ^@ 就是你把 ‘\0’ (就是 0) output 到檔案中了.. • 以後多出這種 demo 程式就不會過, 就以錯誤計算 • How to fix ? • 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔案中. • int i; FILE* fw; char *a = "123"; • fw = fopen(argv[1], "w"); • /* 這樣不會output 出 ^@ */ • for(i=0; i<3; i++) fprintf(fw, "%c", a[i]); • /* 這樣就會output 出 ^@ */ • for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); • fclose(fw);

  6. About Project One • 補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目錄. • 第一次 demo 成績: http://www.cs.nctu.edu.tw/~hhyou/ds.php

  7. Project Two • Input: a text file and a stop words list • Using argc and argv • ./a.outstopwordtextfile • Output: pairs of word and the number of their occurrence • To stdout (the screen)

  8. Project Two • Text file (without stop word) Hello, I’m Billy, not bi|ly or 6illy or b. • Output • Hello,:1 • I’m:1 • Billy,: • not:1 • bi|ly: 1 • or: 2 • 6illy: 1 • b.: 1

  9. Project Two • Text file (same) • Stop word list • and • not • or • Output • Hello,:1 • I’m:1 • Billy,: • bi|ly: 1 • 6illy: 1 • b.: 1

  10. Project Two • Text file • a b c d e f g h i j a b c d e • Stop words list • a b c d • Output • e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1

  11. Project Two • Input • Text file • Every words are spited by ‘‘,’\t’, or ‘\n’. • Case sensitive. • Do and do are different words • There’s at most 2000 chars in one line. • There will be no Chinese input. • Not only one line in a text file. • There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’. • Program executive time are limited.

  12. Project Two • Input • Stop words list • One word one line • No space,’\t’ in one line • No more than 2000 chars one line • Correct • Haha • Hehe • kerker • Incorrect • 囧oo • A b

  13. Project Two • Word occurrence • String+’‘+number+’’\n’ A 3 B 5 • String orders won’t matter. B 5 A 3

  14. Project Two • You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case) • One array for your string, another for the occurrence • Your data structure must be fast in insertion and selection (search).

  15. Project Two • We’ll use program to judge your homework • Please take care about the I/O format • You can not read the whole file in one time • You have to read at most one line in one time • We’ll release some test data. • Due: 11/21 • Your bonus will depend on the efficiency of your program

  16. Project Two • Large case • A lot of different words (more than 1000000) • A lot of words in a text file • 30% • One of them will be released • 10% per test case • We will release 2 normal test case and 1 large test case for testing.

  17. Project Two • Some simple algorithm • Assume STOPWORD has N word, TEXTFILE has M word. • We build SW_LIST to store stop words, TXT_LIST to store text file words.

  18. Project Two (Brute Force) O(N) • Read in STOPWORD, store it as SW_LIST • foreach ( word read from TEXTFILE ) • { • if ( the word is in SW_LIST ) • then continue to read another word. • else ( the word is not in SW_LIST ) • then • if ( the word is in TXT_LIST ) • then add count of the word 1 • else ( the word is not in TXT_LIST ) • then insert word into TXT_LIST • } O(N) O(M) O(M)

  19. Project Two • 這個作業寫的比較快的會有 Bonus. • 到時候會把大家的程式拿到某台神秘的工作站上面跑, 看誰快誰慢. • 如果對於加分部份的公平性有疑問請在 11/6(四) 上課前提出.

  20. Project Two – How to hand in • 先到 ftp://mpc.cs.nctu.edu.tw建立自己學號的資料夾. • 上傳可 compile, run 的 C/C++ source code 檔案到 ftp://mpc.cs.nctu.edu.tw

  21. Q & A • Any questions ?

More Related