1 / 12

Modern Information Retrieval

Modern Information Retrieval. 第三組 87070300 陳國富 87068800 王俊傑 87070600 夏希璿. Our Environment. CPU : Duron 700 RAM : 320MB OS : Microsoft XP Professional Database : Mysql Database Program Language : PHP Script Language Store Device : 30GB 7200rpm HardDisk. FrameWork. WWW. 文件. 讀取. 查詢.

julius
Download Presentation

Modern Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modern Information Retrieval 第三組 87070300 陳國富 87068800 王俊傑 87070600 夏希璿

  2. Our Environment • CPU : Duron 700 • RAM : 320MB • OS : Microsoft XP Professional • Database : Mysql Database • Program Language : PHP Script Language • Store Device : 30GB 7200rpm HardDisk

  3. FrameWork WWW 文件 讀取 查詢 回報 Index Processor DataBase Ranking Processor 檢索結果 儲存 圖表1:系統架構圖

  4. Indexing Processing 讀取文件 去除Stop Word 產生單字Weight 去除Weight過小的單字

  5. Indexing Processing(Cont.) 計算方式: Weight = Wn + loge(T/Ts) • Wn : index在該文章中出現的次數 • T :文章總數 • Ts :含有該index的文章總數 • 去除weight過小的index

  6. Indexing Processing(Cont.) • 舉例- • “Play” Weight : 5.965 • “Taiwan” Weight : 25.745 • “Stock” Weight : 13.922 • 每個index在不同文件中,其weight都不同

  7. Search & Ranking • 假設Query = (Q1 , Q2 ,Q3 …..Qn)為使用者輸入的檢索,n為Query的單字數目;(D1,D2,D3….Dm)為檢索出來的文章,m為檢索結果的數目 wij為Qi在Dj中的weight值 W j = Σ wij = 所有檢索在Dj中的weight總和 DOCW j = 在文章j 中,所有index的weight總和

  8. Search & Ranking(Cont.) • 依照下列來決定排名 • 文章中含有Query數目的多寡 • Query在該篇文章中所佔的比例重 = ( Wj / DOCWj ) • 文章的index weight總和( DOCWj ) • 當Query 單字 > 2時,才作為排名的依據

  9. Recall Precision 不使用文章加權 0% 44.50% 10% 29.03% 20.39% 20% 16.89% 30% 40% 15.10% 50% 12.47% 9.02% 60% 70% 6.20% 80% 4.35% 2.42% 90% 0.29% 100% Precision 使用文章加權 47.77% 28.80% 21.26% 17.10% 15.97% 12.79% 9.00% 6.29% 4.38% 2.58% 0.29% Search & Ranking(Cont.)

  10. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 用文章加權 38.16 16.78 10.03 7.18 6.45 3.97 3.05 2.43 1.99 0.97 0.01 無文章加權 32.39 17.17 8.52 6.82 4.96 3.43 3.04 2.27 1.97 0.69 0.01 Search & Ranking(Cont.)

  11. Recall 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Precision 61.22% 45.64% 36.99% 30.98% 29.30% 25.13% 17.34% 11.69% 7.73% 4.83% 0.68% Search & Ranking(Cont.)

  12. 結論 • 對於較長的Query 準確率較低 • 無法分析Query中的每個word • Query中每個Word的比重不一定相同 • “Actions Against International Terrorists” • 重點在於 “Against”及 “Terrorists”兩字 • Index的選擇

More Related