200 likes | 204 Views
FilipinianaWeb is a research project aiming to develop a grid-based search engine that focuses on Philippine-related web content. It incorporates intelligent document discovery mechanisms through general-purpose and focused web crawlers. The system includes filters for domain, language, geolocation, and Bayesian analysis. Future plans include integrating focused crawling and supporting other object formats like documents, images, and XML.
E N D
FilipinianaWeb Nestor Michael C. Tiglao Computer Networks Lab (CNL) University of the Philippines 17th APAN Meetings & Joint Techs Workshop Jan. 30, 2004
World Wide Web • Enormous growth (10 billion pages) • Imagine the Web without search engines • Need for intelligent document discovery mechanisms
Web Crawlers • Programs that retrieve Web pages Two kinds: • General-purpose crawlers • Focused crawlers
Focused Crawler • Selectively seek out pages that are relevant to a pre-defined set of topics • Topics are specified by sample documents
Research on Search Engines • Implemented the focused crawler on a Linux cluster using Beowulf and MPI (2002) • Philippine-specific search engine using the openMosix platform (2003)
Focused Crawler Architecture User Interface Results Sample Document Classifier Crawl Tables Distiller Crawler
Why another search engine? • Existing Philippine search engines: Yehey.com, Alleba, Tanikalang Ginto, Pugad.com and EdsaWorld • actually web directories • We need a better search engine
Unique Situation • Many Philippine-related sites are not registered under the .ph domains • Many sites are hosted outside the Philippines • English as the de facto language
System Design (Gagambot)
Filters • ph Domain filter • gov.ph, edu.ph • Language filter • iso 639, iso-8859-1/latin1 and windows-1252 • subset of Unicode characters utf-8 and us-ascii
Filters 2 • GeoURL filter • Location-to-URL reverse directory • Finds URLs by their proximity to a given location (www.geourl.org) • Bayesian filter • Analyzes the textual content of the HTML document
Current Plans • Develop FilipinianaWeb on a grid platform • Better filtering techniques • Integrate focused crawling • Support for other object formats: documents, images, XML, etc.
Conclusion • FilipinianaWeb is a work-in-progress and a proof-of-application • Grid infrastructure will help provide the computational and resource requirements of a production-level search engine