40 likes | 153 Views
Search Bootstrapping. How / Where to get started. Crawling. Start with Nutch http:// nutch.apache.org / Index directly to SOLR http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr / Create a seed list from DMOZ rdf http://www.dmoz.org/ rdf.html
E N D
Search Bootstrapping How / Where to get started
Crawling • Start with Nutch • http://nutch.apache.org/ • Index directly to SOLR • http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ • Create a seed list from DMOZ rdf • http://www.dmoz.org/rdf.html • http://wiki.apache.org/nutch/NutchTutorial
Understanding Content • Entity Extraction • LingPipehttp://alias-i.com/lingpipe/ • OpenNLPhttp://incubator.apache.org/opennlp/ • Entity Identification / Taxonomies • Freebase http://www.freebase.com/
Some Additional Links • Basic Web Page Parser • https://github.com/pjaol/Webcrawler • Example of OpenNLP usage • https://github.com/pjaol/entity_extractor