380 likes | 394 Views
Explore Solr internals, including MVC framework, plugin architecture, and real-world examples. Learn how to enhance Solr functionality and customize data processing efficiently.
E N D
Apache SolrBeyond The Box Chris Hostetter 2008-11-05 http://people.apache.org/~hossman/apachecon2008us/ http://lucene.apache.org/solr/
Why Are We Here? Plugins! What, How, Where, When, Why? Solr Internals In A Nutshell Real World Examples Testing Questions
What Is Solr (To Users) Information Retrieval Application Index/Query Via HTTP Comprehensive HTML Administration Interfaces Scalability - Efficient Replication To Other Solr Search Servers Highly Configurable Caching Flexible And Adaptable With XML Configuration Customizable Request Handlers And Response Writers Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And TokenFilters
What Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish API Extensible Plugin Architecture MVC-ish Framework Around The Java Lucene Search Library Allows Custom Business Logic and Text Analysis Rules To Live Close To The Data Abstracts Away The Tricky Stuff: Index Consistency Data Replication Cache Management
When/Why To Write A Plugin “X can be done more efficiently closer to the data.” OR “To force X for all clients.”
50,000' View HTTP Java EmbeddedSolrServer SolrDispatchFilter SolrCore SolrCore CoreContainer SolrCore QueryResponseWriter SolrQuery(Request/Response) SolrRequestHandler
MVC-ish SolrRequestHandler ... A Controller handleRequest( SolrQueryRequest, SolrQueryResponse ) SolrQueryRequest ... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References SolrQueryResponse ... Model Tree of "Simple" Objects and DocLists ResponseWriter ... View write(Writer, SolrQueryRequest, SolrQueryResponse)
public class HelloWorld extends RequestHandlerBase { public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { String name = req.getParams().get("name"); Integer age = req.getParams().getInt("age"); rsp.add("greeting", "Hello " + name); rsp.add("yourage", age); } public String getVersion() { return "$Revision:$"; } public String getSource() { return "$Id:$"; } public String getSourceId() { return "$URL:$"; } public String getDescription() { return "Says Hello"; } } Hello World
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> </lst> <str name="greeting">Hello Hoss</str> <int name="yourage">32</int> </response> http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json { "responseHeader":{ "status":0, "Qtime":1}, "greeting":"Hello Hoss", "yourage":32 } Hello World Output
Types Of Plugins SolrRequestHandler SearchComponent QparserPlugin ValueSourceParser SolrHighlighter SolrFragmenter SolrFormatter UpdateRequestProcessorFactory QueryResponseWriter Italics: Only One Per SolrCore Color: Likelihood Of Needing To Write Your Own Similarity(Factory) Analyzer TokenizerFactory TokenFilterFactory FieldType SolrCache CacheRegenerator SolrEventListener UpdateHandler
public class TshegBarTokenizerFactory extends BaseTokenizerFactory { public TokenStream create(Reader input) { return new TshegBarTokenizer(input); } } public class EdgeTshegTrimmerFactory extends BaseTokenFilterFactory { public TokenStream create(TokenStream input) { return new EdgeTshegTrimmer(input); } } Tsheg Analysis Factories
DFLL Category Metadata Category ID and Label: 3126 == “Tablet PCs” Category Query: tablet_form:[* TO *] Ordered List of Facets Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints Constraint ID and Label: 111536 == “Apple OS X” Constraint Query: os:(“OSX10.1” “OSX10.2” ...)
Document catMetaDoc = searcher.getFirstMatch(catDocId) Metadata m = parseAndCacheMetadata(catMetaDoc, searcher) m = m.clone() DocListAndSet results = searcher.getDocListAndSet(m.catQuery, ...) response.add(“products”, results.docList) foreach (Facet f : m) { foreach (Constraint c : f) { c.setCount(searcher.numDocs(c.query, results.docSet)) } } response.add(“metadata”, m.asSimpleObjects()) DfllHandler Psuedo-Code
Conceptual Picture os:(“OSX10.1” “OSX10.2” ...) memory:[1GB TO *] proc_manu:Intel = 594 price asc tablet_form:[* TO *] proc_manu:AMD = 382 getDocListAndSet(Query,Query[],Sort,offset,n) price:[0 TO 500] = 247 Section of ordered results Unordered set of all results price:[500 TO 1000] = 689 manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo numDocs() = 75 Query Response
<result name="products" numFound="394" start="0">...</results> <lst name="metadata"> ... <lst name="500016"> <int name="rankDir">0</int><int name="datatype">1</int> <int name="rating">88</int><str name="name">OS provided</str> <lst name="values"> <lst name="111536"> <int name="valueId">111536</int> <str name="label">Apple Mac OS X</str> <str name="rating">50</str> <int name="count">1</int> </lst> ... </lst> DFLL Response
DfllCacheRegenerator SolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit). public interface CacheRegenerator { public boolean regenerateItem(SolrIndexSearcher newSearcher, SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal) throws IOException; }
Builds and incrementally updates indexes based on configured SQL or XPath queries. <entity name="item" pk="ID" query="select * from ITEM" deltaQuery="select ID ... where ITEMDATE > '${dataimporter.last_index_time}'"> <field column="NAME" name="name" /> ... <entity name="f" pk="ITEMID" query="select DESC from FEATURE where ITEMID='${item.ID}'" deltaQuery="select ITEMID from FEATURE where UPDATEDATE > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}"> <field name="features" column="DESC" /> ... DataImportHandler
DataImportHandler Plugins DataSource FileDataSource HttpDataSource JdbcDataSource EntityProcessor FileListEntityProcessor SqlEntityProcessor CachedSqlEntityProcessor XPathEntityProcessor Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer
LocalUpdateProcessorFactory Uses lat/lon fields to compute Cartesian Tier info Adds grid bodes of various sizes as new fields <updateRequestProcessorChain name="standard" default=”true”> <processor class="....LocalUpdateProcessorFactory"> <str name="latField">lat</str> <str name="lngField">lng</str> <int name="startTier">9</int> <int name="endTier">17</int> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
LocalSolrQueryComponent Use in place of default QueryComponent Augments regular query with DistanceQuery and DistanceSortSource Can use a custom SolrCache for distances for commonly used points <searchComponent name="geoquery" class="....LocalSolrQueryComponent" /> <requestHandler name="geo" class="solr.SearchHandler"> <arr name="components"> <str>geoquery</str> ... </arr> </requestHandler>
GuardianComponent Goal When Searching Really Short Docs, Rule Out Matches That Are “Significantly” Longer Then Query Increase Precision At The Expense Of Recall q = Dance Party Dance Party (1995) Dance Party (2005) (V) Dance Party, USA (2006) Workout Party... Let's Dance! (2004) (V) Shrek in the Swamp Karaoke Dance Party (2001) (V)
Implementation SearchComponent Configured To Run After QueryComponent Post-Processes DocList Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“
Alternate Approach <copyField source=“title” dest=“titleLen”/> Write TokenCountingTokenFilter For titleLen Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses From Super Add +titleLen:[* TO MAX_LEN] Clause To Query
AbstractSolrTestCase public class YourTest extends AbstractSolrTestCase { ... public void testSomeStuff() throws Exception { assertU(adoc("id", "7", "description", "Travel Guide”, "title", "Paris in 10 Days")); assertU(adoc("id", "42", "description", "Cool Book", "title", "Hitch Hiker's Guide to the Galaxy")); assertU(commit()); assertQ("multi qf", req("q", "guide", "qt", "dismax", "qf", "title^2 description^1") ,"//*[@numFound='2']" ,"//result/doc[1]/int[@name='id'][.='42']" ,"//result/doc[2]/int[@name='id'][.='7']" ); }
Questions?http://lucene.apache.org/solr/ ? Your Logo Here