150 likes | 355 Views
MOVIE QUOTES SEARCH ENGINE. Industrial Project – Final Presentation. Students: Meytal Bialik Zvi Cahana. Technion – Israel Institute Of Technology Computer Science Department. Supervisors: Hayim Makabee Oren Somekh. MQSE. 3. 19.6.12. Introduction.
E N D
MOVIE QUOTES SEARCH ENGINE Industrial Project – Final Presentation Students: MeytalBialik ZviCahana Technion – Israel Institute Of Technology Computer Science Department Supervisors: HayimMakabee Oren Somekh MQSE 3 19.6.12
Introduction The Movie Quotes Search Engine project focuses on the creation of a search engine allowing a user to search for terms that appear in the dialogues of a movie. The project consists of two main components: • A web application used as a user interface to the search engine. • A crawling engine used to maintain a searchable index and a content database. • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Goals • Relevant search results • Modern UI design • Rich search options • Video play option • Browser agnostic website • Large-scale movies database • Incremental, priority-based crawling • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Methodology • IMDb& OpenSubtitles.org dump files • SRT subtitle files • OpenSubtitles.org XML-RPC API • SQLite database • Apache Lucene • Java Servlets / JSP • HTML5 / CSS / JavaScript • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
System Diagram • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Achievements • Crawling • Command-line tool • Dump files parsing • OpenSubtitles.org API based • Subtitles downloading & indexing • Cover art downloading • Multithreaded pipelined execution • Priority based • Index recovery • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Achievements • Storage • SQLite-based database • Movies metadata (popularity, rating, IMDb link...) • Cover art • ~20000 subtitles downloaded & indexed • Local videos repository • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Achievements • Indexing • SRT files parsing & validating • SRT files filtering • Translator comments • Hearing impaired comments • Format tags • Partitioning into overlapping search units • Indexing using Lucene core • Stemming • Stop words removal • Actual indexing of the search units • ~250ms per average SRT file • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Achievements • Searching • Searching using Lucene core • Query parsing • Search operators support • Stemming • Stop words removal • Relevant buckets retrieval & ranking • Aggregating buckets to movies • Merging of overlapping buckets • Highlighting search words using Lucene core • Buckets trimming to most relevant text • Configurable weighted movie ranking • Lucene rank • Popularity • Rating • Year • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Achievements • Web Application • JSP/HTML5/CSS/JavaScript based • Full support for IE9 • Modern UI design • Search results snippets • Multiple hits per movie • Paging • Video play option • Per result snippet • Relevant scene • Captions • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Testing A testing platform enables comparing search results “quality” against different system configurations. • In each test, the search engine is queried with famous quotes • A test passes if relevant movie is found in the top-K results • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Testing We tested the system with a set of ~100 famous movie quotes. With biased system configuration and K=9, we acquired ~90% pass rate. • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Screenshots • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Screenshots • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions
Conclusions • Luceneis a powerful search platform • Optimal search results are difficult to define • Subtitles files from public sources should be further validated • HTML5 video support is still limited & browser dependent • Source control systems make life easier • Introduction • Goals • Methodology • System Diagram • Achievements • Testing • Screenshots • Conclusions