180 likes | 277 Views
Softvérové knižnice a systémy. Vyhľadávanie informácií Michal Laclav ík. Tools. IR libraries & engines Lucene Egothor Xapian mnoGoSearch Lucene Nutch Porty SearchBlox. Lucene Indexing . IndexWriter Directory FSDirectory, RAMDirectory Analyzer Document Collection of fields Field
E N D
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík 08.11.2007
Tools • IR libraries & engines • Lucene • Egothor • Xapian • mnoGoSearch • Lucene • Nutch • Porty • SearchBlox 08.11.2007
Lucene Indexing • IndexWriter • Directory • FSDirectory, RAMDirectory • Analyzer • Document • Collection of fields • Field • Keyword, UnIndexed, UnStored, Text 08.11.2007
Lucene Indexing 2 • Indexing Dates • Boosting • Field.setBoost • Indexing Numbers • Adding zeros, Analyzers • Sorting • Not tokenized, Field Keyword • Directory • FSDirectory, RAMDirectory • Term vector • Field.Unstored(“subject”,subject,true); 08.11.2007
Lucene Searching • IndexSearcher • Term • Query • Boolean, Phrase, Prefix, Range, Fuzzy (levenstein) • TermQuery • Hits 08.11.2007
Lucene Searching 2 • Query q = QueryParser.parse(“search”, “field”, new SimpleAnalyzer()); • +pubdate:[20040101 TO 20041231] Java AND (Jakarta OR Apache) • Query.toString() • Scoring • Similarity, DefaultSimilarity • Sorting • By field, by multiple • MultiFieldQueryParser • Filtering 08.11.2007
Lucene Searching 3 • Custom Sort Method • Distance search 08.11.2007
Lucene Analysis • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com] 08.11.2007
Lucene Analysis 2 • Indexing • Querying • Query parse, QueryTerm not Analyzed • Results • Tokens, position type • Terms, position • TokenStream, Tokenizer, TokenFilter 08.11.2007
Lucene Analysis 3 • Synonyms, aliases • Same position (phrase query) • UTF-8 • Kodovania, znaky HTML • Content-type • Nutch analysis • The quick 08.11.2007
SandBox • Development tools • Lucli CLI • Luke – toolbox • SnowBall analyzer • T9 indexing example • Highlite • BerkleyDB 08.11.2007
Lucene Doc format • XML • SAX parser Xserces • Digester Apache Jakarta • PDF • PDFBox.org • Buildin support • HTML • JTidy.sf.net • NekoHTML • Word • POI – jakarta project • TextMining.org • RTF • Javax.swing.text.rtf 08.11.2007
Tools • DocSearcher • Docco • SearchBlox 08.11.2007
Lucene Ports • CLucene • dotLucene • Plucene Perl • Lupy Python • PyLucene GCJ + SWIG 08.11.2007
Nutch • Build on lucene • Fetcher, searcher interface • Scalable to several bilions • Ranking ??? • Hadoop • Implementacia MapReduce 08.11.2007
Other Use cases • JGuru • SearchBlox • Alias-i 08.11.2007
Linux tools • Catdoc • Xsl, doc • openoffice • Pdftotext (XPDF) • Encoding • enca 08.11.2007
Ine kniznice • QTag • POS tagging • Stemming • Snowball • Potter • Tvaroslovnik, JULS • SimMetrics • Podobnosti, levenstein, cosmiera • GATE 08.11.2007