170 likes | 430 Views
Lucene/SOLR 2: Lucene search API. TU Delft Library Digitale Productontwikkeling. voorgerecht: Searcher, Term, Sort, Filter hoofdgerecht: Query, Similarity, QueryParser toetje: Hits, Highlighter, SpellChecker. Egbert Gramsbergen. org.apache.lucene.search. Searcher. int i. int i.
E N D
Lucene/SOLR2: Lucene search API TU Delft Library Digitale Productontwikkeling • voorgerecht: Searcher, Term, Sort, Filter • hoofdgerecht: Query, Similarity, QueryParser • toetje: Hits, Highlighter, SpellChecker Egbert Gramsbergen
org.apache.lucene.search.Searcher int i int i class VerbasterdUMLclass diagram Document Document Searcher *docdocFreqexplainsearchgetSimilaritysetSimilarity +lower level methods(performance tuning) Term ([]) constructor int ([]) argument ---return value --> Explanation int doc Query optional ... Filter Sort methods Hits Similarity
FSDirectory org.apache.lucene.search.Searcher RAMDirectory DbDirectory JEDirectory IndexSearcher * Directory Searcher String path IndexReader MultiSearcher * FilterIndexReader MultiReader [] [] Searcheable ParallelReader ParallelMultiSearcher * RemoteSearcheable
org.apache.lucene.index.Term Term *createTermfieldtextcompareTo String field String text int Gebruik: o.a. bouwsteen van Query en Filter
org.apache.lucene.search.Sort N.B.Lucene kent geen strongly typed fields,SOLR wel Sort **setSort ([]) SortField int AUTO, CUSTOM, DOC, SCORE, INT, LONG, FLOAT, DOUBLE, STRING * String field boolean reverse ([]) [] String field setSortgetSort boolean reverse int type SortComparatorSource Locale * String languageString countryString variant
org.apache.lucene.search.Filter BooleanFilter ChainedFilter Filter DuplicateFilter PrefixFilter QueryWrapperFilter gebruik:bijv. infaceted search RangeFilter SpanFilter CachingWrapperFilter voorbeeld: TermsFilter * addTerm Term more…
org.apache.lucene.search.Query FuzzyQuery TermQuery WildcardQuery MultiTermQuery RegexQuery BooleanQuery Query PhraseQuery PrefixQuery SpanFirstQuery MultiPhraseQuery SpanNearQuery RangeQuery SpanNotQuery SpanQuery SpanOrQuery BoostingQuery SpanRegexQuery ConstantScoreQuery SpanTermQuery ConstantScoreRangeQuery DisjunctionMaxQuery BoostingTermQuery FilteredQuery FuzzyLikeThisQuery MatchAllDocsQuery ValueSourceQuery FieldScoreQuery MoreLikeThisQuery CustomScoreQuery
org.apache.lucene.search.Query Query setBoostgetBoostrewrite Float boost IndexReader TermQuery *getTerm Term PhraseQuery *addgetTermssetSlop [ ] int position int slop
org.apache.lucene.search.BooleanQuery BooleanQuery *addgetClausessetMinimumNumberShouldMatch boolean disableCoord BooleanClause * [ ] int Query and/or-ish query//exampleBooleanQuery bq;float andNess = 0.5; // 0.:OR(default), 1.:AND…BooleanClause[] clauses = bq.getClauses();int numOpt = 0;for (int 1 = 0; i<clauses.length; i++ { if (clauses[i].getOccur()==BooleanClause.Occur.SHOULD) numOpt++;}bq.setMinimumNumberShouldMatch(Math.round(numOpt*andNess));//NOTE: if there is no MUST clause at least 1 SHOULD clause must match BooleanClause.Occurint MUST, MUST_NOT, SHOULD
org.apache.lucene.search.tunction.CustomScoreQuery CustomScoreQuery *customScore Query ([]) ValueSourceQuery int docfloat subQueryScorefloat([]) valSrcScore(s) float FieldScoreQuery * String field Use cases:* Meewegen pub. type+jaar (bibliotheek)* Geografische nabijheid (search “pizza”) override FieldScoreQuery.Typeint BYTE, SHORT, INT, FLOAT Default:subQueryScore* valSrcScores[0] * valSrcScores[1]* … Pub.jaar: score = 1+a/(1+τ), τ=(t-tp)/t0 a 1 t0 t-tp
org.apache.lucene.search.Similarity Hier wordt het echte werk verricht: http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/Similarity.html Query, Document Scorevolgens Vector Space model
org.apache.lucene.queryParser.QueryParser String Query (hoera!)::=def. ()nesting *repetition []optional |or | | | | | Query ::= ( Clause )* | |Clause ::= ["+"|"-"] [<TERM> ":"] ( <TERM> | "(" Query ")" )| | | | | AND NOT field | nested query single term or phrase Voorbeelden:aaa bbb ccc year:[2000 TO 2005](inclusive) +aaa bbb –ccc price:{020 TO 100}(not inclusive)"aaa bbb" aaa^3 bbb (boost)title:aaa "aaa bbb"^0.5 title:(+aaa bbb) AND author:"ddd e f" 1/+1 (/ escape char)aaa* bb*b cc?caaa~0.8 (fuzzy/min.similarity)"aaa bbb"~10 (proximity/slop) gaat ook nog doorAnalyzer Strings: 20<100Lucene: alleen StringsSOLR: strongly typed fields! NIET: "aaa* bbb" NIET: *aaa, ?aaa
org.apache.lucene.queryParser.QueryParser Niet iedere Query kan door QueryParser worden gemaakt(te ingewikkeld of bescherming performance) “New Yor*”*ork“New York” binnen 10 woorden afstand van “Broadway” en max. 5 woorden na het begin van het veld Niet iedere Query wil door QueryParser worden gemaakt Doe aan Interface ontwerp, bijv.* vrije text invoer (geQueryParsed)* aparte interface elementen voor: * velden * ranges * facetten, more like this, …
org.apache.lucene.queryParser.QueryParser StandardAnalyzer RussianAnalyzer QueryParser *parsesetDefaultOperatorsetPhraseSlopsetFuzzyMinSim… String defaultField BrazilianAnalyzer Analyzer DutchAnalyzer * String query … Query File stopwordsString[] stopwordsHashSet stopwords QueryParser.OperatorAND_OPERATOR, OR_OPERATOR floatint
org.apache.lucene.search.Hits Searchersearch Document getgetFields… String fieldNameString value List fields Hits docscoreiteratorlength FieldnamegetValue… int nfloat score Hit getDocumentgetScore HitIterator nexthasNextlength boolean hasNext int length N.B. gebruik HitCollector (low-level API) voor grote aantallen hits
org.apache.lucene.search.highlight.Highlighter Highlighter *setTextFragmentergetBestFragments… QueryScorer * Query Scorer(fragmentScorer) IndexReader String fieldName Formatter SimpleHTMLFormatter * String preTagString postTagFloat maxScoreString minForegroundcolorString maxForegroundcolor String minBackgroundcolorString maxBackgroundcolor Analyzer String fieldNameString textint maxNumFragments GradientFormatter SpanGradientFormatter * String[] bestFragments Fragmenter int fragmentSize SimpleFragmenter *
org.apache.lucene.search.spell.SpellChecker N-gram index SpellChecker *indexDictionarysuggestSimilarsetAccuracy… PlainTextDictionary * Directory(spellIndex) FileInputStreamReader Dictionary LuceneDictionary * IndexReader Stringfieldboolean morePopular String wordintnumSug String[] words float minScore