1 / 38

Apache Solr Beyond The Box

Apache Solr Beyond The Box. Chris Hostetter 2008-11-05 http://people.apache.org/~hossman/apachecon2008us/ http://lucene.apache.org/solr/. Why Are We Here?. Plugins! What, How, Where, When, Why? Solr Internals In A Nutshell Real World Examples Testing Questions.

Download Presentation

Apache Solr Beyond The Box

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache SolrBeyond The Box Chris Hostetter 2008-11-05 http://people.apache.org/~hossman/apachecon2008us/ http://lucene.apache.org/solr/

  2. Why Are We Here? Plugins! What, How, Where, When, Why? Solr Internals In A Nutshell Real World Examples Testing Questions

  3. What, How, Where, Who, When, Why?

  4. What Is Solr (To Users) Information Retrieval Application Index/Query Via HTTP Comprehensive HTML Administration Interfaces Scalability - Efficient Replication To Other Solr Search Servers Highly Configurable Caching Flexible And Adaptable With XML Configuration Customizable Request Handlers And Response Writers Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And TokenFilters

  5. What Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish API Extensible Plugin Architecture MVC-ish Framework Around The Java Lucene Search Library Allows Custom Business Logic and Text Analysis Rules To Live Close To The Data Abstracts Away The Tricky Stuff: Index Consistency Data Replication Cache Management

  6. How It Started

  7. When/Why To Write A Plugin “X can be done more efficiently closer to the data.” OR “To force X for all clients.”

  8. Solr Internals In A Nutshell

  9. 50,000' View HTTP Java EmbeddedSolrServer SolrDispatchFilter SolrCore SolrCore CoreContainer SolrCore QueryResponseWriter SolrQuery(Request/Response) SolrRequestHandler

  10. MVC-ish SolrRequestHandler ... A Controller handleRequest( SolrQueryRequest, SolrQueryResponse ) SolrQueryRequest ... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References SolrQueryResponse ... Model Tree of "Simple" Objects and DocLists ResponseWriter ... View write(Writer, SolrQueryRequest, SolrQueryResponse)

  11. public class HelloWorld extends RequestHandlerBase { public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { String name = req.getParams().get("name"); Integer age = req.getParams().getInt("age"); rsp.add("greeting", "Hello " + name); rsp.add("yourage", age); } public String getVersion() { return "$Revision:$"; } public String getSource() { return "$Id:$"; } public String getSourceId() { return "$URL:$"; } public String getDescription() { return "Says Hello"; } } Hello World

  12. http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> </lst> <str name="greeting">Hello Hoss</str> <int name="yourage">32</int> </response> http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json { "responseHeader":{ "status":0, "Qtime":1}, "greeting":"Hello Hoss", "yourage":32 } Hello World Output

  13. Types Of Plugins SolrRequestHandler SearchComponent QparserPlugin ValueSourceParser SolrHighlighter SolrFragmenter SolrFormatter UpdateRequestProcessorFactory QueryResponseWriter Italics: Only One Per SolrCore Color: Likelihood Of Needing To Write Your Own Similarity(Factory) Analyzer TokenizerFactory TokenFilterFactory FieldType SolrCache CacheRegenerator SolrEventListener UpdateHandler

  14. Real World Examples

  15. Tibetan And Himalayan Digital Library Tools

  16. public class TshegBarTokenizerFactory extends BaseTokenizerFactory { public TokenStream create(Reader input) { return new TshegBarTokenizer(input); } } public class EdgeTshegTrimmerFactory extends BaseTokenFilterFactory { public TokenStream create(TokenStream input) { return new EdgeTshegTrimmer(input); } } Tsheg Analysis Factories

  17. DFLL

  18. DFLL: Faceted Browsing

  19. DFLL Category Metadata Category ID and Label: 3126 == “Tablet PCs” Category Query: tablet_form:[* TO *] Ordered List of Facets Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints Constraint ID and Label: 111536 == “Apple OS X” Constraint Query: os:(“OSX10.1” “OSX10.2” ...)

  20. Document catMetaDoc = searcher.getFirstMatch(catDocId) Metadata m = parseAndCacheMetadata(catMetaDoc, searcher) m = m.clone() DocListAndSet results = searcher.getDocListAndSet(m.catQuery, ...) response.add(“products”, results.docList) foreach (Facet f : m) { foreach (Constraint c : f) { c.setCount(searcher.numDocs(c.query, results.docSet)) } } response.add(“metadata”, m.asSimpleObjects()) DfllHandler Psuedo-Code

  21. Conceptual Picture os:(“OSX10.1” “OSX10.2” ...) memory:[1GB TO *] proc_manu:Intel = 594 price asc tablet_form:[* TO *] proc_manu:AMD = 382 getDocListAndSet(Query,Query[],Sort,offset,n) price:[0 TO 500] = 247 Section of ordered results Unordered set of all results price:[500 TO 1000] = 689 manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo numDocs() = 75 Query Response

  22. <result name="products" numFound="394" start="0">...</results> <lst name="metadata"> ... <lst name="500016"> <int name="rankDir">0</int><int name="datatype">1</int> <int name="rating">88</int><str name="name">OS provided</str> <lst name="values"> <lst name="111536"> <int name="valueId">111536</int> <str name="label">Apple Mac OS X</str> <str name="rating">50</str> <int name="count">1</int> </lst> ... </lst> DFLL Response

  23. DfllCacheRegenerator SolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit). public interface CacheRegenerator { public boolean regenerateItem(SolrIndexSearcher newSearcher, SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal) throws IOException; }

  24. DataImportHandler

  25. Builds and incrementally updates indexes based on configured SQL or XPath queries. <entity name="item" pk="ID" query="select * from ITEM" deltaQuery="select ID ... where ITEMDATE > '${dataimporter.last_index_time}'"> <field column="NAME" name="name" /> ... <entity name="f" pk="ITEMID" query="select DESC from FEATURE where ITEMID='${item.ID}'" deltaQuery="select ITEMID from FEATURE where UPDATEDATE > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}"> <field name="features" column="DESC" /> ... DataImportHandler

  26. DataImportHandler Plugins DataSource FileDataSource HttpDataSource JdbcDataSource EntityProcessor FileListEntityProcessor SqlEntityProcessor CachedSqlEntityProcessor XPathEntityProcessor Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer

  27. LocalSolr

  28. LocalSolr

  29. LocalUpdateProcessorFactory Uses lat/lon fields to compute Cartesian Tier info Adds grid bodes of various sizes as new fields <updateRequestProcessorChain name="standard" default=”true”> <processor class="....LocalUpdateProcessorFactory"> <str name="latField">lat</str> <str name="lngField">lng</str> <int name="startTier">9</int> <int name="endTier">17</int> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>

  30. LocalSolr Cartesian Tiers

  31. LocalSolrQueryComponent Use in place of default QueryComponent Augments regular query with DistanceQuery and DistanceSortSource Can use a custom SolrCache for distances for commonly used points <searchComponent name="geoquery" class="....LocalSolrQueryComponent" /> <requestHandler name="geo" class="solr.SearchHandler"> <arr name="components"> <str>geoquery</str> ... </arr> </requestHandler>

  32. GuardianComponent

  33. GuardianComponent Goal When Searching Really Short Docs, Rule Out Matches That Are “Significantly” Longer Then Query Increase Precision At The Expense Of Recall q = Dance Party Dance Party (1995) Dance Party (2005) (V) Dance Party, USA (2006) Workout Party... Let's Dance! (2004) (V) Shrek in the Swamp Karaoke Dance Party (2001) (V)

  34. Implementation SearchComponent Configured To Run After QueryComponent Post-Processes DocList Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“

  35. Alternate Approach <copyField source=“title” dest=“titleLen”/> Write TokenCountingTokenFilter For titleLen Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses From Super Add +titleLen:[* TO MAX_LEN] Clause To Query

  36. Testing Your Plugins

  37. AbstractSolrTestCase public class YourTest extends AbstractSolrTestCase { ... public void testSomeStuff() throws Exception { assertU(adoc("id", "7", "description", "Travel Guide”, "title", "Paris in 10 Days")); assertU(adoc("id", "42", "description", "Cool Book", "title", "Hitch Hiker's Guide to the Galaxy")); assertU(commit()); assertQ("multi qf", req("q", "guide", "qt", "dismax", "qf", "title^2 description^1") ,"//*[@numFound='2']" ,"//result/doc[1]/int[@name='id'][.='42']" ,"//result/doc[2]/int[@name='id'][.='7']" ); }

  38. Questions?http://lucene.apache.org/solr/ ? Your Logo Here

More Related