120 likes | 248 Views
What won’t change. Harvest’s basic design SOIF for inter-component communication Development model. General Goals. Increase search speed Shift focus to HTTP and HTML Internationalisation Improve scalability Increase availability Improve access control. General Goals.
E N D
What won’t change • Harvest’s basic design • SOIF for inter-component communication • Development model http://harvest.sourceforge.net/
General Goals • Increase search speed • Shift focus to HTTP and HTML • Internationalisation • Improve scalability • Increase availability • Improve access control http://harvest.sourceforge.net/
General Goals • Integration of other search systems into Harvest system • Remove all non GPLed components • Improve ranking • Promote Harvest to attract more users and developers http://harvest.sourceforge.net/
Gatherer • Shift focus to HTTP • Improve gathering over slow connection • Improve HTTP gatherer • Create multiple Gatherers “on the fly” where possible • Evaluate larbin and curl • Migrate from GDBM to Sleepycat’s DB for local disc cache management http://harvest.sourceforge.net/
Gatherer • Remove local disc cache • Implement candidate selection filter for HTTP enumerator based on mime type • Trust mime type sent by HTTP servers • Add HTTPS support • Evaluate improvements of HTTP 1.1 over HTTP 1.0 • Replace unnesters with exploders http://harvest.sourceforge.net/
Gatherer • Improve object storage system • Improve expiring objects • Evaluate viability of an expire daemon • Split file: and news: rootnodes into leafnodes • Remove All-Templates • Make SOIF objects shareable between Gatherer and Broker if possible http://harvest.sourceforge.net/
Summarizer • Shift focus to HTML • Improve existing HTML summarizers • Create HTML summarizer which “understands” HTML • Improve support for Microsoft Office documents http://harvest.sourceforge.net/
Broker • Add Indexdata’s Zebra as fulltext indexer • Implement method to retrieve an SOIF object by URL • Improve temporary file/directory handling used for paging search results • Improve SOIF object storage • Extend “shell indexer” functionality http://harvest.sourceforge.net/
Broker • Implement an user interface in PHP • Separate data from metadata when storing SOIF objects • Minimise size of Registry • Use cookies to save user preferences of the search interface • Evaluate and write SOIF filter for Namazu http://harvest.sourceforge.net/
Broker • Evaluate RDBMS (Postgresql, MySQL) • Evaluate Xquery and SOAP http://harvest.sourceforge.net/
Documentation • Switch from linuxdoc to docbook for manual and FAQ http://harvest.sourceforge.net/
Problems • PostScript and PDF summarizers • Apache’s multiviews • IMS Gathering • Stemming and Soundex are language dependant • Language recognition • No free thesauri available http://harvest.sourceforge.net/