200 likes | 372 Views
Handy Tools and Frameworks. … for our projects and work. Tools …. Apache Lucene WEKA itpp Misc *tex imagemagick, inkscape graphviz, gnuplot, gs. Tech presentation by Pavel Patz htpp:// lucene.apache.org/java/docs/index.html. APACHE LUCENE. Apache Lucene.
E N D
Handy Tools and Frameworks … for our projects and work
Tools … • Apache Lucene • WEKA • itpp • Misc • *tex • imagemagick, inkscape • graphviz, gnuplot, gs
Tech presentation by Pavel Patz htpp:// lucene.apache.org/java/docs/index.html APACHE LUCENE
Apache Lucene • Apache Lucene is a free/open source information retrieval software library • Doug Cutting’s grandmother’s middle name! • And also most powerful OpenSource indexer / search engine • Library for Java, C (with Perl and Python bindings), C++, Objective C, Delphi, Ruby, PHP, Common Lisp and C# (yep, even .net) • Fast and efficient solution • Over 20 MB/s on P 1.5 GHz • Index size 20-30% the size of indexed text • Widely adopted solution • Wikipedia (and MediaWiki as well), E.ON, Beagle, Strigi (Desktop search), isoHunt, Eclipse, Jira, Digg (it!), abclinuxu.cz, BlogScope, CNET, European Bioinformatics Institute, etc.
Apache Lucene – Text processing • Stemmers • removes suffixes to find root of a word • Vs Lemmatizers • Create index storage a.k.a. Directory • In Database, in RAM, on File system • Create Analyzer • We need somehow separate tokens, find roots, exclude stop words • Create IndexWriter • Based on Directory and Analyzer • For each “record” (file, row in table…) create Document and store it
Apache Lucene – Directory & Indexing • Directory consist of documents • Document consist of fields • Like ID, content, timestamps – what do you want to store • Fields • Can be stored, compressed (useful for long strings), not stored • Content of stored fields can be retrieved from directly search result. • Content can be indexed as • Tokenized • Not tokenized (for instance brand names – “Faster Runner”) • Indexed without NORMS (=no scoring) • Not indexed (but can be stored) • Indexing • Each document and / or field can have it’s “boost” value • Score (hitpoints) counting of results is based on many factors, boost value multiplies score of document / field.
Apache Lucene – Search • We have index. So open it! • Use IndexSearcher – use singleton to better performance • Prepare Query • Lucene has simple query language • We should use same analyzer for querying as for indexing • We can search in fields, boost parts of query, make Boolean queries etc. • Execute Query • Enjoy results
Weka 3: Data Mining Software in Java • collection of machine learning algorithms for data mining tasks • Library AND environment in one • Tools for data pre-processing, classification, regression, clustering, association rules, and visualization
WEKA Tools • Collection of machine learning algorithms for data mining tasks • Library AND environment • Tools for data pre-processing, classification, regression, clustering, association rules, and visualization • Own data format (ARFF) • Text oriented, easily editable • Many algorithms (classifiers, preprocessors) • Many parameters • Possible to set in the GUI or in API
WEKA Modules • The WEKA GUI consists of more parts • Explorer • Data analysis, visualisation, model management • Knowledge flow • Streaming data processing • Experimenter • Parameterized tests, statistics, performance evaluation, significance tests • CLI • Command line!
ITPP Intro • Do you Matlab? • Nope? But there is a number of examples in *.m • … and the API is actually nice • You can IT++ • C++ library of mathematical, signal processing and communication classes and functions • IT++ makes an extensive use of existing open-source or commercial libraries for increased functionality, speed and accuracy. In particular BLAS, LAPACK and FFTW • IT++ should work on GNU/Linux, Sun Solaris, Microsoft Windows (with Cygwin, MinGW/MSYS or Microsoft Visual C++) and Mac OS X operating systems
ITPP Features • Basic mathematical features • templated vector and matrix classes • sparse vectors and matrix classes • elementary functions on vectors and matrices • statistics classes and functions • matrix decompositions such as eigenvalue, Cholesky, LU, Schur, SVD, and QR • solving linear system of equations (including over and underdetermined) • random number generation (Mersenne Twister generator) • binary and Galois types (both scalar and vector and matrices) • integration of 1-dimensional functions • unconditional nonlinear optimization (Quasi-Newton search) • Signal processing • filter functions and classes • frequency domain filtering • FFT, DFT, DCT, and Hadamard transforms • time and frequency domain windows • evaluating and finding roots of polynomials (and inverse operations) • filter design functions • fast independent component analysis (fast ICA) • Communications • modulators (BPSK, PSK, PAM, QAM) • vector modulators (e.g. for OFDM and MIMO) • OFDM and CDMA modulators • pulse shaping filters (including RC and RRC) • binary symmetric (BSC) and additive white Gaussian Noise (AWGN) channels • multipath fading channels (both frequency-flat and frequency-selective) • COST 207, COST 257, and ITU channel models • Hamming, extended Golay, and CRC codes • BCH and Reed-Solomon codes • convolutional and punctured convolutional codes • recursive convolutional codes, turbo codes, Interleavers • Protocol simulation • event-based simulation classes • signal and slots for simplified syntax • TCP clients and servers, selective repeat ARQ • queue classes, packet generators, Source coding • Scalar Quantizer (SQ) and Vector Quantizer (VQ) classes and functions for training of these • LPC, LSF, and cepstrum parameter calculation for speech processing • Gaussian Mixture Modeling • reading and saving several different audiofile formats • reading and saving images in PNM format
Building ITPP • Cygwin & linux • Autotools • ./configure [--without-blas --without-lapack --without-fft • make -j • make -j install
((pdf)La | xe)TeX • Tex makes beautiful pdf • looks professional, math, graphics • typesetting can be done like SW development • Portable, vector-oriented, blah blah • Scriptable
Beautiful figures • ImageMagick • Converts many formats (e.g. to pdf) • GraphViz • Create graphs () from text files • Many layouts • Ps, pdf, svg outputs • Java/.NET alternative • GNUPlot • Non-graph plots • Many flavors of graphs (pie charts, etc.)
All together • Put it all together • Test data • Test program • Text output of results (gnuplot, graphviz) • Prepared source for report (latex) = • On-demand generated seminary projects ;)