1 / 28

Writing a Perl XS swig interface to the CLucene C++ text search engine

Explore the process of creating a Perl XS and SWIG interface to the CLucene C++ text search engine. Learn about different technical options, investigating CLucene, interfacing Perl to C++, and more. Understand the problem of Boolean search optimization with over 20,000 CVs for recruitment software. Delve into the high-level answers and the challenges faced during the implementation process.

osborne
Download Presentation

Writing a Perl XS swig interface to the CLucene C++ text search engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Writing a Perl XS swig interface to the CLucene C++ text search engine Peter Edwards Perl XS and SWIG interface to CLucene C++ text search engine

  2. Introduction • Peter Edwards ~ background • Subject ~ writing a Perl XS swig interface to the CLucene C++ text search engine Perl XS and SWIG interface to CLucene C++ text search engine

  3. Aims • Give an idea of the process involved in selecting and using an external library from Perl • Introduction to extending Perl using XS, swig, GNU autotools • Entertainment • Audience: What is your background and interest? Perl XS and SWIG interface to CLucene C++ text search engine

  4. Topics • Understanding the Problem • The Answer (at a high level) • Technical Options • Investigating Options • Writing a perl / C++ Interface • Layers and Components • Lessons Learned Process Extending Perl Perl XS and SWIG interface to CLucene C++ text search engine

  5. Terms • Perl ~ Pathologically Eclectic Rubbish Lister$_ = "wftedskaebjgdpjgidbsmnjgc";tr/a-z/oh, turtleneck Phrase Jar!/; print; • Perl XS ~ eXternal Subroutineallows a perl program to call a C language subroutineXS is also the “glue” language specifying the calling interfacecontains complex “perlguts” stuff that will destroy your sanity • SWIG ~ Simplified Wrapper and Interface Generatormakes it easy to call a C/C++ library from many languages (perl, python, ruby, PHP…) • C++ ~ Object Oriented version of C programming language • text search ~ boolean searching of stemmed words, wildcards • CLucene ~ C++ text search engine based on Java Lucene Perl XS and SWIG interface to CLucene C++ text search engine

  6. Understanding the Problem • Recruitment software written in Perl • 20,000+ candidate Word CVs/resumes • Boolean searching using words or partial words and wildcardse.g. (“BA” or “MA”) and “literature” • Combined with SQL searchinge.g. geographic area, skill profile codes, pay rate • Speed < 2 seconds • Old system used dtSearch proprietary s/w Perl XS and SWIG interface to CLucene C++ text search engine

  7. The Answer (at a high level) Load • Convert candidate CVs from Word to text using wvWare (OpenOffice) converter • Index text against candidate no. Search • Search text -> cand nos -> SQL temp table • Normal SQL search on other criteria Perl XS and SWIG interface to CLucene C++ text search engine

  8. Technical Options (at 2003/4) Proprietary • dtSearch ~ cost; hard to get cand nos out; Windows interface when perl app is Web Open Source • Java Lucene ~ slow but good API and power • C++ CLucene ~ alpha quality rewrite of Lucene in Visual C++ as degree project by Ben van Klinken • Perl CPAN (PLucene etc.) belowhttp://search.cpan.org/modlist/String_Language_Text_Processing Perl XS and SWIG interface to CLucene C++ text search engine

  9. Investigating Perl Options • Wrote test harness to load 1000 CVs then do some searches • Tried about 5 CPAN modules • PLucene search speed okay for small volumes but exponential increase in insert time>60 seconds per insert • Why? Tokenises doc, multi-lingual word stemming, adds doc id to reverse lookup index for each stem token • Other modules faster but search options weak Need to look further Perl XS and SWIG interface to CLucene C++ text search engine

  10. Investigating CLucene • Wrote similar C++ test harness • Speed good: search 20,000 CVs <1 secondload 3 CVs per sec (mostly Word->text) • Code written as VC++ degree project and registered at SourceForge • Jimmy Pritts changed layout and added GNU autoconf files configure.ac Makefile.in to let it build cross-platform on Windows, cygwin, Linux • Had C DLL interface used by PHP wrapper Decided to write Perl wrapper Perl XS and SWIG interface to CLucene C++ text search engine

  11. Interfacing Perl to C++ • When I wrote this wrapper, Perl to C++ interfacing via XS or SWIG was tricky and despite the optimism expressed at http://www.johnkeiser.com/perl-xs-c++.html I had difficulties mapping the CLucene API to XS • Reasons: C++ namespace mangling; object and method mapping; C++ memory garbage collection • So I decided to go via the C DLL wrapper to hide this complexity Perl XS and SWIG interface to CLucene C++ text search engine

  12. Perl XS • Always start with h2xs utility • Code is C with macro extensions • Write C code (XSUBs) • Call internal Perl routines (perlguts) to create variables, allocate arrays…newSViv(IV), sv_setiv(SV*, IV) ~ scalar integer variable • Complicated • Nyarlathotep / “Crawling Chaos” Perl XS and SWIG interface to CLucene C++ text search engine

  13. Enter SWIG • Creates XS for you from a .i definition file • Parses C/C++ .h header files to get types and function prototypes • Allows for inline C/XS code Perl XS and SWIG interface to CLucene C++ text search engine

  14. Swig XS Sample From argv.i // Creates a new Perl array and places a NULL-terminated char ** into it %typemap(out) char ** { AV *myav; SV **svs; int i = 0,len = 0; /* Figure out how many elements we have */ while ($1[len]) len++; svs = (SV **) malloc(len*sizeof(SV *)); for (i = 0; i < len ; i++) { svs[i] = sv_newmortal(); sv_setpv((SV*)svs[i],$1[i]); }; myav = av_make(len,svs); free(svs); $result = newRV((SV*)myav); sv_2mortal($result); argvi++; } Perl XS and SWIG interface to CLucene C++ text search engine

  15. Diagram of Layers Perl OO Wrapper CLucene.pm Low Level Perl CLuceneWrap.pm SWIG generated SWIG XS C Code clucene_wrap.c C DLL Interface clucene_dll.o CLucene C++ Library clucene.so Perl XS and SWIG interface to CLucene C++ text search engine

  16. CLucene C++ Interface src/CLucene/search/SearchHeader.h: #include "CLucene/StdHeader.h" #ifndef _lucene_search_SearchHeader_ #define _lucene_search_SearchHeader_ #include "CLucene/index/IndexReader.h“ … using namespace lucene::index; namespace lucene{ namespace search{ //predefine classes class Searcher; class Query; class Hits; class HitDoc { public: float_t score; int_t id; lucene::document::Document* doc; HitDoc* next; // in doubly-linked cache HitDoc* prev; // in doubly-linked cache HitDoc(const float_t s, const int_t i); ~HitDoc(); }; Perl XS and SWIG interface to CLucene C++ text search engine

  17. CLucene C DLL Interface src/wrappers/dll/clucene_dll.h: #ifndef _DLL_CLUCENE #define _DLL_CLUCENE #include "CLucene/CLConfig.h" … #ifdef _UNICODE //unicode methods # define CL_UNLOCK CL_U_Unlock # define CL_OPEN CL_U_Open # define CL_DOCUMENT_INFO CL_U_Document_Info # define CL_ADD_FILE CL_U_Add_File … CLUCENEDLL_API int CL_U_Unlock(const wchar_t* dir); CLUCENEDLL_API int CL_U_Delete(const int resource, const wchar_t* query, const wchar_t* field); CLUCENEDLL_API int CL_U_Add_Field(const int resource, const wchar_t* fie ld, const wchar_t* value, const int value_length, const int store, const int ind ex, const int token); … Perl XS and SWIG interface to CLucene C++ text search engine

  18. SWIG Definition File clucene.i %module "FulltextSearch::CLuceneWrap" %{ #include "clucene_dllp.h" %} // our definitions for CLucene variables and functions %include "clucene_perl.h" //%include "clucene_dll.h" // could use this but then would need to call CL_N_Se arch not CL_SEARCH etc. %include typemaps.i %include argv.i // helper functions where pointers to result buffers are expected // would be better done with a %typemap(out) if I knew enough about perlguts %inline %{ int val_len; char * val; int CL_GetField1(int resource, char * field) { return CL_GETFIELD(resource,field,&val,&val_len); } … } Perl XS and SWIG interface to CLucene C++ text search engine

  19. SWIG-Generated XS CLuceneWrap.pm # This file was automatically generated by SWIG package FulltextSearch::CLuceneWrap; require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); package FulltextSearch::CLuceneWrapc; bootstrap FulltextSearch::CLuceneWrap; package FulltextSearch::CLuceneWrap; @EXPORT = qw( ); # ---------- BASE METHODS ------------- package FulltextSearch::CLuceneWrap; sub TIEHASH { my ($classname,$obj) = @_; return bless $obj, $classname; } sub CLEAR { } … # ------- FUNCTION WRAPPERS -------- package FulltextSearch::CLuceneWrap; *CL_OPEN = *FulltextSearch::CLuceneWrapc::CL_OPEN; *CL_CLOSE = *FulltextSearch::CLuceneWrapc::CL_CLOSE; … # ------- VARIABLE STUBS -------- package FulltextSearch::CLuceneWrap; *clucene_perl = *FulltextSearch::CLuceneWrapc::clucene_perl; *NULL = *FulltextSearch::CLuceneWrapc::NULL; *val_len = *FulltextSearch::CLuceneWrapc::val_len; *val = *FulltextSearch::CLuceneWrapc::val; *errstr = *FulltextSearch::CLuceneWrapc::errstr; … Perl XS and SWIG interface to CLucene C++ text search engine

  20. SWIG-Generated XS clucene_wrap.c #ifdef __cplusplus extern "C" { #endif XS(_wrap_CL_OPEN) { { char *arg1 ; int arg2 = (int) 1 ; int result; int argvi = 0; dXSARGS; if ((items < 1) || (items > 2)) { SWIG_croak("Usage: CL_OPEN(path,create);"); } if (!SvOK((SV*) ST(0))) arg1 = 0; else arg1 = (char *) SvPV(ST(0), PL_na); if (items > 1) { arg2 = (int) SvIV(ST(1)); } result = (int)CL_OPEN(arg1,arg2); ST(argvi) = sv_newmortal(); sv_setiv(ST(argvi++), (IV) result); XSRETURN(argvi); fail: ; } croak(Nullch); } Perl XS and SWIG interface to CLucene C++ text search engine

  21. CLucene.pm Perl OO Wrapper • Back into the realms of sanity • Normal OO package with methods • Calls XS wrapper functions sub open { my $this = shift; my %arg = @_; my $path = $arg{path} || $this->{path} || confess "path undefined"; my $create = anyof ( $arg{create}, $this->{create}, 0 ); $this->{resource} = FulltextSearch::CLuceneWrap::CL_OPEN ( $path, $creat e ) or confess "Failed to CL_OPEN $this->{path} create $create errst r ".$this->errstrglobal(); $this->{path} = $path; $this; } Perl XS and SWIG interface to CLucene C++ text search engine

  22. Build Environment • Uses GNU autotools and m4 macro processor Definition files • configure.ac ~ top level build definitions • Makefile.am ~ makefile flags definitions Programs • libtool ~ generalised library building • aclocal ~ builds aclocal.m4 from configure.ac • autoconf ~ reads configure.ac to create configure script • autoheader ~ creates C header defines for configure • automake ~ creates Makefile.in from Makefile.am • autoreconf ~ manually remake whole tree of GNU build files Perl XS and SWIG interface to CLucene C++ text search engine

  23. Bootstrap shell script #!/bin/sh # Bootstrap the CLucene installation. mkdir -p ./build/gcc/config set -x libtoolize --force --copy --ltdl --automake aclocal autoconf autoheader automake -a --copy --foreign Perl XS and SWIG interface to CLucene C++ text search engine

  24. Autoconf configure.ac file dnl Process this file with autoconf to produce a configure script. dnl Written by Jimmy Pritts. dnl initialize autoconf and automake AC_INIT([clucene], [1]) AC_PREREQ([2.54]) AC_CONFIG_SRCDIR([src/CLucene.h]) AC_CONFIG_AUX_DIR([./build/gcc/config]) AC_CONFIG_HEADERS([config.h]) AM_INIT_AUTOMAKE dnl Check for existence of a C and C++ compilers. AC_PROG_CC AC_PROG_CXX dnl Check for headers AC_HEADER_DIRENT dnl Configure libtool. AC_PROG_LIBTOOL dnl option to use UTF-8 as internal 8-bit charset to support characters in Unicodeâ ¢ AC_ARG_ENABLE(utf8, AC_HELP_STRING([--enable-utf8],[UTF-8 as internal 8-bit charset to support characters in Unicodeâ ¢ (default=no)]), [AC_DEFINE([UTF8],[],[use UTF-8 as internal 8-bit charset to support characters in Unicodeâ ¢])],enable_utf8=no) AM_CONDITIONAL(USEUTF8, test x$enable_utf8 = xyes) AC_CONFIG_FILES([Makefile src/Makefile examples/Makefile examples/demo/Makefile examples/tests/Makefile examples/util/Makefile wrappers/Makefile wrappers/dll/Makefile wrappers/dll/dlltest/Makefile]) AC_OUTPUT Perl XS and SWIG interface to CLucene C++ text search engine

  25. Makefile.am files src/Makefile.am: AUTOMAKE_OPTIONS = 1.6 include_HEADERS = CLucene.h lsrcdir = $(top_srcdir)/src/CLucene lib_LTLIBRARIES = libclucene.la libclucene_la_SOURCES = include CLucene/analysis/Makefile.am include CLucene/analysis/standard/Makefile.am include CLucene/debug/Makefile.am include CLucene/document/Makefile.am include CLucene/index/Makefile.am include CLucene/queryParser/Makefile.am include CLucene/search/Makefile.am include CLucene/store/Makefile.am include CLucene/util/Makefile.am include CLucene/Makefile.am ./Makefile.am: ## Makefile.am -- Process this file with automake to produce Makefile.in INCLUDES = -I$(top_srcdir) SUBDIRS = src wrappers examples . src/CLucene/document/Makefile.am: documentdir = $(lsrcdir)/document dochdir = $(includedir)/CLucene/document libclucene_la_SOURCES += $(documentdir)/DateField.cpp libclucene_la_SOURCES += $(documentdir)/Document.cpp libclucene_la_SOURCES += $(documentdir)/Field.cpp doch_HEADERS = $(documentdir)/*.h Perl XS and SWIG interface to CLucene C++ text search engine

  26. Recap • We saw how and why I selected an external Perl library • We looked at GNU autotools to provide a cross-platform build environment • We investigated the layers of code needed to interface perl to a C++ library ~ SWIG, C, XS inline helpers, low and high level Perl modules Perl XS and SWIG interface to CLucene C++ text search engine

  27. Lessons Learned • Start off a new external library using GNU autotools and keeping in mind that the API should be easy to use through SWIG • Use SWIG not XS to wrap a C/C++ library • Always use h2xs to start a Perl extension • Open Source feedback and testing are more valuable than you expect (2 emails this week alone) Perl XS and SWIG interface to CLucene C++ text search engine

  28. Where to Get More Information • Perl XS http://en.wikipedia.org/wiki/XS_%28Perl%29http://www.perl.com/doc/manual/html/pod/perlguts.html • C++ / XS http://www.johnkeiser.com/perl-xs-c++.html • SWIG http://en.wikipedia.org/wiki/SWIGhttp://www.swig.org/ • Lucene http://en.wikipedia.org/wiki/Lucene • CLucene http://sourceforge.net/projects/clucene/ • Autoconfhttp://www.gnu.org/software/autoconf/ • Book “Extending and Embedding Perl”, Jenness & Couzens (Manning, 2002) • Any Questions • These slides are at http://perl.dragonstaff.com/ Perl XS and SWIG interface to CLucene C++ text search engine

More Related