280 likes | 293 Views
Explore the process of creating a Perl XS and SWIG interface to the CLucene C++ text search engine. Learn about different technical options, investigating CLucene, interfacing Perl to C++, and more. Understand the problem of Boolean search optimization with over 20,000 CVs for recruitment software. Delve into the high-level answers and the challenges faced during the implementation process.
E N D
Writing a Perl XS swig interface to the CLucene C++ text search engine Peter Edwards Perl XS and SWIG interface to CLucene C++ text search engine
Introduction • Peter Edwards ~ background • Subject ~ writing a Perl XS swig interface to the CLucene C++ text search engine Perl XS and SWIG interface to CLucene C++ text search engine
Aims • Give an idea of the process involved in selecting and using an external library from Perl • Introduction to extending Perl using XS, swig, GNU autotools • Entertainment • Audience: What is your background and interest? Perl XS and SWIG interface to CLucene C++ text search engine
Topics • Understanding the Problem • The Answer (at a high level) • Technical Options • Investigating Options • Writing a perl / C++ Interface • Layers and Components • Lessons Learned Process Extending Perl Perl XS and SWIG interface to CLucene C++ text search engine
Terms • Perl ~ Pathologically Eclectic Rubbish Lister$_ = "wftedskaebjgdpjgidbsmnjgc";tr/a-z/oh, turtleneck Phrase Jar!/; print; • Perl XS ~ eXternal Subroutineallows a perl program to call a C language subroutineXS is also the “glue” language specifying the calling interfacecontains complex “perlguts” stuff that will destroy your sanity • SWIG ~ Simplified Wrapper and Interface Generatormakes it easy to call a C/C++ library from many languages (perl, python, ruby, PHP…) • C++ ~ Object Oriented version of C programming language • text search ~ boolean searching of stemmed words, wildcards • CLucene ~ C++ text search engine based on Java Lucene Perl XS and SWIG interface to CLucene C++ text search engine
Understanding the Problem • Recruitment software written in Perl • 20,000+ candidate Word CVs/resumes • Boolean searching using words or partial words and wildcardse.g. (“BA” or “MA”) and “literature” • Combined with SQL searchinge.g. geographic area, skill profile codes, pay rate • Speed < 2 seconds • Old system used dtSearch proprietary s/w Perl XS and SWIG interface to CLucene C++ text search engine
The Answer (at a high level) Load • Convert candidate CVs from Word to text using wvWare (OpenOffice) converter • Index text against candidate no. Search • Search text -> cand nos -> SQL temp table • Normal SQL search on other criteria Perl XS and SWIG interface to CLucene C++ text search engine
Technical Options (at 2003/4) Proprietary • dtSearch ~ cost; hard to get cand nos out; Windows interface when perl app is Web Open Source • Java Lucene ~ slow but good API and power • C++ CLucene ~ alpha quality rewrite of Lucene in Visual C++ as degree project by Ben van Klinken • Perl CPAN (PLucene etc.) belowhttp://search.cpan.org/modlist/String_Language_Text_Processing Perl XS and SWIG interface to CLucene C++ text search engine
Investigating Perl Options • Wrote test harness to load 1000 CVs then do some searches • Tried about 5 CPAN modules • PLucene search speed okay for small volumes but exponential increase in insert time>60 seconds per insert • Why? Tokenises doc, multi-lingual word stemming, adds doc id to reverse lookup index for each stem token • Other modules faster but search options weak Need to look further Perl XS and SWIG interface to CLucene C++ text search engine
Investigating CLucene • Wrote similar C++ test harness • Speed good: search 20,000 CVs <1 secondload 3 CVs per sec (mostly Word->text) • Code written as VC++ degree project and registered at SourceForge • Jimmy Pritts changed layout and added GNU autoconf files configure.ac Makefile.in to let it build cross-platform on Windows, cygwin, Linux • Had C DLL interface used by PHP wrapper Decided to write Perl wrapper Perl XS and SWIG interface to CLucene C++ text search engine
Interfacing Perl to C++ • When I wrote this wrapper, Perl to C++ interfacing via XS or SWIG was tricky and despite the optimism expressed at http://www.johnkeiser.com/perl-xs-c++.html I had difficulties mapping the CLucene API to XS • Reasons: C++ namespace mangling; object and method mapping; C++ memory garbage collection • So I decided to go via the C DLL wrapper to hide this complexity Perl XS and SWIG interface to CLucene C++ text search engine
Perl XS • Always start with h2xs utility • Code is C with macro extensions • Write C code (XSUBs) • Call internal Perl routines (perlguts) to create variables, allocate arrays…newSViv(IV), sv_setiv(SV*, IV) ~ scalar integer variable • Complicated • Nyarlathotep / “Crawling Chaos” Perl XS and SWIG interface to CLucene C++ text search engine
Enter SWIG • Creates XS for you from a .i definition file • Parses C/C++ .h header files to get types and function prototypes • Allows for inline C/XS code Perl XS and SWIG interface to CLucene C++ text search engine
Swig XS Sample From argv.i // Creates a new Perl array and places a NULL-terminated char ** into it %typemap(out) char ** { AV *myav; SV **svs; int i = 0,len = 0; /* Figure out how many elements we have */ while ($1[len]) len++; svs = (SV **) malloc(len*sizeof(SV *)); for (i = 0; i < len ; i++) { svs[i] = sv_newmortal(); sv_setpv((SV*)svs[i],$1[i]); }; myav = av_make(len,svs); free(svs); $result = newRV((SV*)myav); sv_2mortal($result); argvi++; } Perl XS and SWIG interface to CLucene C++ text search engine
Diagram of Layers Perl OO Wrapper CLucene.pm Low Level Perl CLuceneWrap.pm SWIG generated SWIG XS C Code clucene_wrap.c C DLL Interface clucene_dll.o CLucene C++ Library clucene.so Perl XS and SWIG interface to CLucene C++ text search engine
CLucene C++ Interface src/CLucene/search/SearchHeader.h: #include "CLucene/StdHeader.h" #ifndef _lucene_search_SearchHeader_ #define _lucene_search_SearchHeader_ #include "CLucene/index/IndexReader.h“ … using namespace lucene::index; namespace lucene{ namespace search{ //predefine classes class Searcher; class Query; class Hits; class HitDoc { public: float_t score; int_t id; lucene::document::Document* doc; HitDoc* next; // in doubly-linked cache HitDoc* prev; // in doubly-linked cache HitDoc(const float_t s, const int_t i); ~HitDoc(); }; Perl XS and SWIG interface to CLucene C++ text search engine
CLucene C DLL Interface src/wrappers/dll/clucene_dll.h: #ifndef _DLL_CLUCENE #define _DLL_CLUCENE #include "CLucene/CLConfig.h" … #ifdef _UNICODE //unicode methods # define CL_UNLOCK CL_U_Unlock # define CL_OPEN CL_U_Open # define CL_DOCUMENT_INFO CL_U_Document_Info # define CL_ADD_FILE CL_U_Add_File … CLUCENEDLL_API int CL_U_Unlock(const wchar_t* dir); CLUCENEDLL_API int CL_U_Delete(const int resource, const wchar_t* query, const wchar_t* field); CLUCENEDLL_API int CL_U_Add_Field(const int resource, const wchar_t* fie ld, const wchar_t* value, const int value_length, const int store, const int ind ex, const int token); … Perl XS and SWIG interface to CLucene C++ text search engine
SWIG Definition File clucene.i %module "FulltextSearch::CLuceneWrap" %{ #include "clucene_dllp.h" %} // our definitions for CLucene variables and functions %include "clucene_perl.h" //%include "clucene_dll.h" // could use this but then would need to call CL_N_Se arch not CL_SEARCH etc. %include typemaps.i %include argv.i // helper functions where pointers to result buffers are expected // would be better done with a %typemap(out) if I knew enough about perlguts %inline %{ int val_len; char * val; int CL_GetField1(int resource, char * field) { return CL_GETFIELD(resource,field,&val,&val_len); } … } Perl XS and SWIG interface to CLucene C++ text search engine
SWIG-Generated XS CLuceneWrap.pm # This file was automatically generated by SWIG package FulltextSearch::CLuceneWrap; require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); package FulltextSearch::CLuceneWrapc; bootstrap FulltextSearch::CLuceneWrap; package FulltextSearch::CLuceneWrap; @EXPORT = qw( ); # ---------- BASE METHODS ------------- package FulltextSearch::CLuceneWrap; sub TIEHASH { my ($classname,$obj) = @_; return bless $obj, $classname; } sub CLEAR { } … # ------- FUNCTION WRAPPERS -------- package FulltextSearch::CLuceneWrap; *CL_OPEN = *FulltextSearch::CLuceneWrapc::CL_OPEN; *CL_CLOSE = *FulltextSearch::CLuceneWrapc::CL_CLOSE; … # ------- VARIABLE STUBS -------- package FulltextSearch::CLuceneWrap; *clucene_perl = *FulltextSearch::CLuceneWrapc::clucene_perl; *NULL = *FulltextSearch::CLuceneWrapc::NULL; *val_len = *FulltextSearch::CLuceneWrapc::val_len; *val = *FulltextSearch::CLuceneWrapc::val; *errstr = *FulltextSearch::CLuceneWrapc::errstr; … Perl XS and SWIG interface to CLucene C++ text search engine
SWIG-Generated XS clucene_wrap.c #ifdef __cplusplus extern "C" { #endif XS(_wrap_CL_OPEN) { { char *arg1 ; int arg2 = (int) 1 ; int result; int argvi = 0; dXSARGS; if ((items < 1) || (items > 2)) { SWIG_croak("Usage: CL_OPEN(path,create);"); } if (!SvOK((SV*) ST(0))) arg1 = 0; else arg1 = (char *) SvPV(ST(0), PL_na); if (items > 1) { arg2 = (int) SvIV(ST(1)); } result = (int)CL_OPEN(arg1,arg2); ST(argvi) = sv_newmortal(); sv_setiv(ST(argvi++), (IV) result); XSRETURN(argvi); fail: ; } croak(Nullch); } Perl XS and SWIG interface to CLucene C++ text search engine
CLucene.pm Perl OO Wrapper • Back into the realms of sanity • Normal OO package with methods • Calls XS wrapper functions sub open { my $this = shift; my %arg = @_; my $path = $arg{path} || $this->{path} || confess "path undefined"; my $create = anyof ( $arg{create}, $this->{create}, 0 ); $this->{resource} = FulltextSearch::CLuceneWrap::CL_OPEN ( $path, $creat e ) or confess "Failed to CL_OPEN $this->{path} create $create errst r ".$this->errstrglobal(); $this->{path} = $path; $this; } Perl XS and SWIG interface to CLucene C++ text search engine
Build Environment • Uses GNU autotools and m4 macro processor Definition files • configure.ac ~ top level build definitions • Makefile.am ~ makefile flags definitions Programs • libtool ~ generalised library building • aclocal ~ builds aclocal.m4 from configure.ac • autoconf ~ reads configure.ac to create configure script • autoheader ~ creates C header defines for configure • automake ~ creates Makefile.in from Makefile.am • autoreconf ~ manually remake whole tree of GNU build files Perl XS and SWIG interface to CLucene C++ text search engine
Bootstrap shell script #!/bin/sh # Bootstrap the CLucene installation. mkdir -p ./build/gcc/config set -x libtoolize --force --copy --ltdl --automake aclocal autoconf autoheader automake -a --copy --foreign Perl XS and SWIG interface to CLucene C++ text search engine
Autoconf configure.ac file dnl Process this file with autoconf to produce a configure script. dnl Written by Jimmy Pritts. dnl initialize autoconf and automake AC_INIT([clucene], [1]) AC_PREREQ([2.54]) AC_CONFIG_SRCDIR([src/CLucene.h]) AC_CONFIG_AUX_DIR([./build/gcc/config]) AC_CONFIG_HEADERS([config.h]) AM_INIT_AUTOMAKE dnl Check for existence of a C and C++ compilers. AC_PROG_CC AC_PROG_CXX dnl Check for headers AC_HEADER_DIRENT dnl Configure libtool. AC_PROG_LIBTOOL dnl option to use UTF-8 as internal 8-bit charset to support characters in Unicodeâ ¢ AC_ARG_ENABLE(utf8, AC_HELP_STRING([--enable-utf8],[UTF-8 as internal 8-bit charset to support characters in Unicodeâ ¢ (default=no)]), [AC_DEFINE([UTF8],[],[use UTF-8 as internal 8-bit charset to support characters in Unicodeâ ¢])],enable_utf8=no) AM_CONDITIONAL(USEUTF8, test x$enable_utf8 = xyes) AC_CONFIG_FILES([Makefile src/Makefile examples/Makefile examples/demo/Makefile examples/tests/Makefile examples/util/Makefile wrappers/Makefile wrappers/dll/Makefile wrappers/dll/dlltest/Makefile]) AC_OUTPUT Perl XS and SWIG interface to CLucene C++ text search engine
Makefile.am files src/Makefile.am: AUTOMAKE_OPTIONS = 1.6 include_HEADERS = CLucene.h lsrcdir = $(top_srcdir)/src/CLucene lib_LTLIBRARIES = libclucene.la libclucene_la_SOURCES = include CLucene/analysis/Makefile.am include CLucene/analysis/standard/Makefile.am include CLucene/debug/Makefile.am include CLucene/document/Makefile.am include CLucene/index/Makefile.am include CLucene/queryParser/Makefile.am include CLucene/search/Makefile.am include CLucene/store/Makefile.am include CLucene/util/Makefile.am include CLucene/Makefile.am ./Makefile.am: ## Makefile.am -- Process this file with automake to produce Makefile.in INCLUDES = -I$(top_srcdir) SUBDIRS = src wrappers examples . src/CLucene/document/Makefile.am: documentdir = $(lsrcdir)/document dochdir = $(includedir)/CLucene/document libclucene_la_SOURCES += $(documentdir)/DateField.cpp libclucene_la_SOURCES += $(documentdir)/Document.cpp libclucene_la_SOURCES += $(documentdir)/Field.cpp doch_HEADERS = $(documentdir)/*.h Perl XS and SWIG interface to CLucene C++ text search engine
Recap • We saw how and why I selected an external Perl library • We looked at GNU autotools to provide a cross-platform build environment • We investigated the layers of code needed to interface perl to a C++ library ~ SWIG, C, XS inline helpers, low and high level Perl modules Perl XS and SWIG interface to CLucene C++ text search engine
Lessons Learned • Start off a new external library using GNU autotools and keeping in mind that the API should be easy to use through SWIG • Use SWIG not XS to wrap a C/C++ library • Always use h2xs to start a Perl extension • Open Source feedback and testing are more valuable than you expect (2 emails this week alone) Perl XS and SWIG interface to CLucene C++ text search engine
Where to Get More Information • Perl XS http://en.wikipedia.org/wiki/XS_%28Perl%29http://www.perl.com/doc/manual/html/pod/perlguts.html • C++ / XS http://www.johnkeiser.com/perl-xs-c++.html • SWIG http://en.wikipedia.org/wiki/SWIGhttp://www.swig.org/ • Lucene http://en.wikipedia.org/wiki/Lucene • CLucene http://sourceforge.net/projects/clucene/ • Autoconfhttp://www.gnu.org/software/autoconf/ • Book “Extending and Embedding Perl”, Jenness & Couzens (Manning, 2002) • Any Questions • These slides are at http://perl.dragonstaff.com/ Perl XS and SWIG interface to CLucene C++ text search engine