140 likes | 226 Views
Crawling and Aligning Scholarly Presentations and Documents from the Web. By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan. Motivation. More articles more users Searching for documents is difficult Aim: Find pairs of presentations and documents automatically.
E N D
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan
Motivation • More articles more users • Searching for documents is difficult • Aim: Find pairs of presentations and documents automatically
System Architecture Used Yee Fan’s Search Engine Wrapper – just Google subsystem Google Query Search Engine Wrapper Query “File type”( PDF, PPT or PS) operator is added with the user query Before sending it to Google. Output:(3-way) 1.Exact URL 2.Message for No-free files 3.No result Top results of Google Re-Ranking
Methodology (1) Re-Ranking • Computed similarity between user query and documents retrieved for re-ranking. • Methods used for computing similarity are Jaccard co-efficient, Bilingual Evaluation Understudy (BLEU). • Threshold value is used to restrict the system from considering low similarity scored documents. Re-Ranking Results Based on Similarity Score Google’s Top Results Similarity Score Computation Similarity is computer between Query Title and each Google’s result Title, Snippet, URL.
Jaccard Measure • Jaccard measure is used to compute similarity between Query Title and Google’s result Title, Snippet, and URL. • Simple word by word matching. Problems are: • Snippets have more words than title. • Union in Jaccard increases while intersection remains same. Sentence1: Finding related pages in the world wide web. Sentence2: Finding Related pages using the Link structure of the WWW.
BLEU metric Why BLEU?? • n-gram similarity of words. • Helps in accessing the sequential order of the words when finding similarity between two sets. • Sequential order of words matters with snippetquery terms may appear in a random position.
Rules Special rules are used for better matching: Rule1: Removing special symbols. (On/Off) Rule2: Stop-words removal (On/Off) Rule3: URL filter by .edu (On/Off) Rule4: Stemming (Porter stemming algorithm) (On/Off) All these rules are used with both the methodologies.
Methodology MIME-types: • To differentiate free PDF from subscription type, I used the MIME-types. It returns the content-type of the URL. Dataset collection: Queries from, • Computer science. • Medical science. • Architecture. • Mathematics.
Experiment • Experiments on • Jaccard Measure.(All special rules are tested with On/Off). • BLEU measure (All special rules are tested with On/Off). • Query set with about 50 queries. • Threshold is set from 0.1 to 1.0 range for all experiments. • Highest recall with high threshold is considered. • Experiment results • Jaccard similarity. • BLEU similarity.
Experiment result of Jaccard Best F-score achieved
Experiment result of BLEU Best F-score achieved
Related Work Base Reference: • SlideSeer: a digital library of aligned document and presentation pairs, [Kan, JCDL’07]. • Learning to Rank for Information Retrieval. [Liu et al., WWW’09]. • Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites.[Hänse, ICADL’09] Approaches to Similarity Computation • BLEU: a Method for Automatic Evaluation of Machine Translation. [Papineni et al., ACL July’02]. • BLEU algorithm for evaluation machine translations implementation.[Payson et al.]
Conclusion • Matching documents based on similarity score • Jaccard measure -- Jaccard similarity computed over Query title and Document title with rule special symbol removed retrieves best articles. -- Threshold:0.7 -- F-score:0.9473 • BLEU metric -- BLEU similarity computed over Query title and Document title with rule special symbol removed retrieves best articles. -- Threshold 0.5 -- F-score:0.8947
Thank you Comments are welcome