210 likes | 342 Views
Softvérové knižnice a systémy. Vyhľadávanie informácií Michal Laclav ík. IR tools Nutch + Hadoop IR API Lucene získavanie informácií Sťahovač: Nutch textové operácie: lucene, GATE Indexovanie: lucene spracovanie odkazov : Nutch B áza dokumentov: Konvertery, kompresia, kódovanie
E N D
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík
IR tools Nutch + Hadoop IR API Lucene získavanie informácií Sťahovač: Nutch textové operácie: lucene, GATE Indexovanie: lucene spracovanie odkazov: Nutch Báza dokumentov: Konvertery, kompresia, kódovanie JavaMail Tika: PDFBox, POI, TextMining zip Vyhľadávanie formulácia dopytu a operácie na dopyte: Solr spracovanie dopytu: Solr vrátenie výsledku na používateľské rozhranie: Solr spätná väzba od používateľa: ? Extrakcia GATE Ontea Regexy Tools - Nástroje Bratislava, 4. november 2013
Tools • IR libraries & engines • Lucene • Egothor • Lucene • Nutch • Sorl • Porty Bratislava, 4. november 2013
Lucene Indexing Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer); iwc.setOpenMode(OpenMode.CREATE); IndexWriter writer = new IndexWriter(dir, iwc); • IndexWriter • Directory • FSDirectory, RAMDirectory, MMapDirectory • Analyzer • Document • Collection of fields • Field • Keyword, UnIndexed, UnStored, Text doc = new Document(); doc.add(new StringField("ctg", value, Field.Store.YES)); doc.add(new TextField(fieldName, value2, Field.Store.NO)); doc.add(new VecTextField("title", data, Field.Store.YES)); writer.addDocument(doc); Bratislava, 4. november 2013
Lucene Indexing 2 • Indexing Dates • Boosting • Field.setBoost • Indexing Numbers • Adding zeros, Analyzers • Sorting • Not tokenized, Field Keyword • Directory • FSDirectory, RAMDirectory • Term vector • Field.Unstored(“subject”,subject,true); Bratislava, 4. november 2013
Lucene Searching directory = FSDirectory.open(new File(index)); reader = DirectoryReader.open(directory); searcher = new IndexSearcher(reader); • IndexSearcher • Term • Query • Boolean, Phrase, Prefix, Range, Fuzzy (levenstein) • TermQuery • Hits queryL = new BooleanQuery(); Query name = new TermQuery(new Term("name_exact", query)); Query alias = new TermQuery(new Term("alias_exact", query)); Query wiki = new TermQuery(new Term("wikipedia_exact", query)); name.setBoost(0.40f); alias.setBoost(0.30f); wiki.setBoost(0.30f); ((BooleanQuery) queryL).add(name, Occur.SHOULD); ((BooleanQuery) queryL).add(alias, Occur.SHOULD); ((BooleanQuery) queryL).add(wiki, Occur.SHOULD); Bratislava, 4. november 2013
Lucene Searching 2 • Query q = QueryParser.parse(“search”, “field”, new SimpleAnalyzer()); • +pubdate:[20040101 TO 20041231] Java AND (Jakarta OR Apache) • Query.toString() • Scoring • Similarity, DefaultSimilarity • Sorting • By field, by multiple • MultiFieldQueryParser • Filtering fields = new String[] {"name", "alias", "text", "wikipedia"}; boosts.put("name", 0.40f); boosts.put("alias", 0.30f); boosts.put("text", 0.20f); boosts.put("wikipedia", 0.10f); MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_43, fields, analyzer, boosts); queryL = parser.parse(query); TopDocs results = s.search(queryL, topK); ScoreDoc[] hits = results.scoreDocs; Bratislava, 4. november 2013
Lucene Searching 3 • Custom Sort Method • Distance search Bratislava, 4. november 2013
Lucene Analysis • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com] Bratislava, 4. november 2013
Lucene Analysis 2 • Indexing • Querying • Query parse, QueryTerm not Analyzed • Results • Tokens, position type • Terms, position • TokenStream, Tokenizer, TokenFilter Bratislava, 4. november 2013
Lucene Analysis 3 • Synonyms, aliases • Same position (phrase query) • UTF-8 • Kodovania, znaky HTML • Content-type • Nutch analysis Bratislava, 4. november 2013
SandBox • Development tools • Lucli CLI • Luke – toolbox • SnowBall analyzer • T9 indexing example • Highlite • BerkleyDB Bratislava, 4. november 2013
Lucene Doc format • Apache Tika • XML • SAX parser Xserces • Digester Apache Jakarta • PDF • PDFBox.org • Buildin support • HTML • JTidy.sf.net • NekoHTML • Word • POI – jakarta project • TextMining.org • RTF • Javax.swing.text.rtf Bratislava, 4. november 2013
Lucene Ports • CLucene • dotLucene • Plucene Perl • Lupy Python • PyLucene GCJ + SWIG Bratislava, 4. november 2013
Nutch • Build on lucene • Fetcher, searcher interface • Scalable to several bilions • Ranking ??? • Hadoop • Implementacia MapReduce Bratislava, 4. november 2013
Other Use cases • JGuru • SearchBlox • Alias-i Bratislava, 4. november 2013
Linux tools • Catdoc • Xsl, doc • openoffice • Pdftotext (XPDF) • Encoding • enca Bratislava, 4. november 2013
Ine kniznice • QTag • POS tagging • Stemming • Snowball • Potter • Tvaroslovnik, JULS • SimMetrics • Podobnosti, levenstein, cosmiera • GATE Bratislava, 4. november 2013
Ukážka • Na minulej prednáške • GATE • http://gate.ac.uk/sale/talks/gate-course-july09/slides-pdf/slides.html • Lucene Bratislava, 4. november 2013
Other Tools • Apache UIMA • text processing (information extraction • OpenNLP • machine learning for text analysis i.e. information extraction • MOSES • Machine learning language translation Bratislava, 4. november 2013
Dostupné dátové zdroje v Slovenskom jazyku • Korpus • http://korpus.juls.savba.sk/ • Organizácie s dátovými zdrojmi v rôznych jazykoch použiteľné na automatický preklad • http://www.tasr.sk/ • http://www.sita.sk • http://www.skrivanek.com/ • Voľne dostupné zdroje: • http://sk.wikipedia.org • http://sk.wiktionary.org • Slovníky • http://slovnik.azet.sk/ • http://slovniky.lingea.sk/ • http://www.sk-spell.sk.cx/mass-msas • Dáta • http://sk-spell.sk.cx/ • http://www.sk-spell.sk.cx/thesaurus/ • http://www.sk-spell.sk.cx/biblia-sk/ • http://www.sk-spell.sk.cx/OCR Bratislava, 4. november 2013