70 likes | 214 Views
IR Project. 黃楹芸 90522017 孫怡明 90522026. Reference Collections. The TREC Collection Built under the TIPSTER program Documents from all sub-collections are tagged with SGML to allow easy parsing. FBIS ( Foreign Broadcast Information Service) Size : 470 Mb Number : 130,471 Docs
E N D
IR Project 黃楹芸90522017 孫怡明90522026
Reference Collections • The TREC Collection • Built under the TIPSTER program • Documents from all sub-collections are tagged with SGML to allow easy parsing. • FBIS (Foreign Broadcast Information Service) • Size : 470 Mb • Number : 130,471 Docs • Words/Doc. (median) : 322 • Words/Docs. (mean) : 543.6
Document Parsing • Process each document to extract: • Document ID • Segment the text into tokens • In our case, separate the text by white-spaces and newlines • Case conversion (make all tokens lowercase) • Discard stopwords and other non-content words (e.g. numbers) • Word stemming • Count term frequencies, record positions • Update indices • Write out the index to file, according to alphabetical order from a to z
Project Introduction • 作業平台 • a. CPU :Celeron 450 MHz • b. RAM 大小:256 RAM • c. 作業系統:Win 2000 Server • d. 處理程式:Java + JDBC • e. 資料儲存:SQL Server 2000 • 使用的Indexing方法 • Inverted indexing
Implement • Our Use Interface • http://140.115.156.81/IR/ • Indexing Time • 120 sec ~ 140 sec per file • Total ~ 16 Hour • Searching Time • “Information”- 13999 Records ~ 15 sec • “mobilize” – 866 Records ~ 3 sec • Indexing File • 850 MB