30 likes | 143 Views
IST 441 Example Projects. Undergrad Project. Find a customer – interest in xbox game forum Build a search engine for Xbox game forums etc. Compare two approaches: Google CSE and LucidWorks . Steps: Crawl websites (at most 5).
E N D
Undergrad Project • Find a customer – interest in xboxgame forum • Build a search engine for Xbox game forums etc. • Compare two approaches: Google CSE and LucidWorks. • Steps: • Crawl websites (at most 5). • Determine crawl depth, how to include/exclude certain pages, filetypes. • Extract information and build the index. • Experiment with different rankings (see “relevancy workbench” app in your LucidWorks installation). • http://ist441.ist.psu.edu:8988/relevancy/experiment • Perform search and compare the precision@K values.
Graduate Project • Crawling academic institution webpages in Qatar (it’s a small domain). • Integrating a more powerful crawler such as Nutch/heritrix with LucidWorks system. • Focused crawling i.e. crawling for specific type of pages such as researchers’ home pages. • Modifying the parser to extract specific information such as email address, phone numbers in a web page. • Modifying Solr schema and/or ranking functions. • Comparing search results with Google CSE. • Discuss with instructor for more information.