240 likes | 333 Views
Text Toolkit. Rachit Arora Toolkits. Important Disclaimer. THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY.
E N D
Text Toolkit Rachit Arora Toolkits
Important Disclaimer THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF: • CREATING ANY WARRANTY OR REPRESENTATION FROM IBM (OR ITS AFFILIATES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS); OR • ALTERING THE TERMS AND CONDITIONS OF THE APPLICABLE LICENSE AGREEMENT GOVERNING THE USE OF IBM SOFTWARE. The information on the new product is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information on the new product is for informational purposes only and may not be incorporated into any contract. The information on the new product is not a commitment, promise, or legal obligation to deliver any material, code or functionality. The development, release, and timing of any features or functionality described for our products remains at our sole discretion. THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.
Agenda • Introduction to Text Toolkit • SystemT overview • Changes from Previous version • TextExtract operator parameters & features • Usage Scenarios • Helper Scripts • Tooling
Text Toolkit • The TextExtract operator allows IBM’s SystemT to be used within IBM InfoSphere Streams. • SystemT operation is defined by • either modules or tam files with • external dictionaries • external tables • external views. • Updated version of the toolkit which provides integration with systemT 2.0 and supports the text extraction using modular AQLs • Both versions of toolkit are shipped although old toolkit (Integration with systemT 1.3 can be found in deprecated folder under toolkits directory)
SystemT • Extract information from documents • E.g, get names and phone numbers from emails • Language is AQL • SQL-like, with special functions for text extraction • Eg, supports detagging for HTML • Input is always Document having either default as text or a defined schema • Output fields may be Spans, int, floats, strings, lists of any of those • Span type is start and end in text (ie, 3,5) • Set of output views defined in AQL file, document may produce multiple tuples on each output view • AQLs are contained inside a module • Modules compile to tam files View1 (stream<Type1> View1; stream<Type2> View2 = TextExtract(Document) {… Document modules View2
Changes in the parameters - Parameters removed • AQLFile: In the older version of the toolkit, an AQL file was used as the language to describe a query on text, which is compiled into an AOG file. This parameter is removed in the newer version of the toolkit. Now modules are input to the operator. A module may contain one or more AQL files. The unit of compilation is now a module, not an AQL file. A module is compiled into a tam file. • AOGFile: In the older version of the toolkit, either an AQL file or an AOG file was input to the operator. An AOG file was generated when AQL file was compiled. This parameter is removed in the newer version of the toolkit. Now, we can specify module list containing tam files and a module lookup path as input to the operator. • dictionaryPath: This was an optional parameter in the older version of the toolkit which was used to specify the dictionary path for the operator. This parameter will be removed in the new toolkit, as we now have the dictionaries specified at runtime using the externalDictionary parameter. • label: This was an optional parameter in the old Text toolkit, which was used as the attribute of the input tuple to be passed as the label field to SystemT. It is removed in the new version of the toolkit. • Text – changed to inputDoc • Changes in the Languageware related parameters
Changes in the parameters - Parameters added • moduleName -This optional parameter is used to specify a list of modules to be loaded • modulePath -This optional parameter of type rstring is used to specify the location of the modules. • moduleOutputDir - This parameter is of type rstring and is used to specify the location where the modules • uncompiledModules -This optional parameter of type rstring specifies a list of modules to be compiled • externalDictionary - This optional parameter of type rstring specifies external dictionary objects • externalTable - This optional parameter of type rstring specifies external table objects • externalView - It specifies a list of attributes from the input port which are to be passed as external views • outputMode - It is an optional parameter of type rstring, which takes 2 values – singlePort and multiPort.
Changes in Helper Script and samples • The perl script createApp.pl removed • createTypes.pl updated based on the changes in the newer version of the toolkit • Added new samples to showcase toolkit usage
TextExtract operator parameters & features • Operator parameters passed to systemT • moduleNames to be loaded along with the modulePath to search for the modules • uncompliedModules to be complied and then loaded for extraction • externalDictionary and externalTables • externalViews read from input port • LanguageCodeand tokenizer • Input flexibility • Rstring Input in case of default schema • A tuple attribute is expected as input then module expects input document in a specific schema to be specified as input attribute passed to systemT • languageCodeAttribute specify language to use on a tuple-by-tuple basis • Output flexibility • Generous on type matches (ie, Span can match with rstring, or tuple<int32 begin, int32 end> • Streams integration • Allows attributes to be passed from input to output (without being processed by SystemT) • passThrough: produce output stream for documents that don’t produce tuples • outputViews: limit the output to selected views • singleTupleMode: Various output views are merged in a singleOutputport (enabled as default) • createTypes.pl(Helper Script): Input is a module and output is a TYPE which can further be used to generate composite that can be used in building SPL Applications
Scenario 1 • A user has uncompiled modules that need to be compiled and loaded. • Required parameters: • uncompiledModules: path to the uncompiled modules. • moduleName: From all the uncompiled modules, which modules are to be loaded. • Optional parameters that can be used along with the required parameters specified above: • moduleOutputDir, externalView, externalDictionary, externalTable, languageCode, tokenizer, outputMode, passThrough (can be specified only if singleTupleMode parameter is false) etc.
Scenario 2 • A user has uncompiled modules to be compiled and loaded, and compiled modules to be loaded. • Required parameters: • uncompiledModules: path to the uncompiled modules. • moduleName: Specifies the compiled modules that are to be loaded. • modulePath: Specifies the path to the compiled modules given in the moduleName parameter. • Optional parameters that can be used along with the required parameters specified above: • moduleOutputDir, externalView, externalDictionary, externalTable, languageCode, tokenizer, outputMode, passThrough (can be specified only if singleTupleMode parameter is false) etc.
Scenario 3 • A user has only compiled modules to be loaded. • Required parameters: • moduleName: Specifies the compiled module names that are to be loaded. • modulePath: Specifies the path to the compiled modules given in moduleName parameter. • Optional parameters that can be used along with the required parameters specified above: • externalView, externalDictionary, externalTable, languageCode, tokenizer, outputMode, passThrough (can be specified only if singleTupleMode parameter is false) etc.
Scenario 4 • A user has uncompiled modules that refer to other modules. • Required parameters: • uncompiledModules: path to the uncompiled modules. • modulePath: Specifies the path to the modules being reffered to by the uncompiled modules. • Optional parameters that can be used along with the required parameters specified above: • externalView, externalDictionary, externalTable, languageCode, tokenizer, outputMode, passThrough (can be specified only if singleTupleMode parameter is false) etc.
Helper Script - createTypes createTypes.pl: takes an AQL or modules as input and outputs output types of AQL or modules • optionally, composite using the AQL/module • optionally, a main program using the composite • Eg, suppose there’s a pre-defined module with 25 output views, but only three are needed; you want composite that reflects that simpler operator • Eg, suppose there’s a complex output types, and you want your downstream program to be immune to minor type changes. By keeping types defined separately, might be helpful
Sample output from createTypes • type toPrintmainType = rstring amount, rstring match, rstring metric, rstring metric_normalized; • type toPrintmetricsIndicator_featuresType = rstring metric, rstring amount, rstring match; • // a composite • public composite MainComposite3(input inputStream; • output toPrintmainStream, • toPrintmetricsIndicator_featuresStream, • passThroughStream) { • param • expression<rstring> $languageCode: "en"; • expression<boolean> $passThrough: true; • graph • ( stream<toPrintmainType> toPrintmainStream; • stream<toPrintmetricsIndicator_featuresType> toPrintmetricsIndicator_featuresStream; • stream<inputStream> passThroughStream) = com.ibm.streams.text.analytics::TextExtract(inputStream) { • param • moduleName: "main","metricsIndicator_dictionaries","metricsIndicator_features","metricsIndicator_udfs"; • modulePath: "/homes/hny1/rdalavi1/samples/bin/"; • externalDictionary: "metricsIndicator_dictionaries.abbreviations=/homes/hny1/rdalavi1/samples/src/metricsIndicator_dictionaries/dictionaries/abbreviation.dict"; • outputMode: "multiPort"; • languageCode: $languageCode; • passThrough: $passThrough; • } • }
Tooling • The latest version of the Big Insights tooling, shipped with Big Insights V2.0 will be used for both versions of the toolkit. • The tooling is available to Streams Studio via an update site. The user will need to update the shipped Studio to add the optional tooling. • There are two ways the projects can be imported into the latest version (2.0): • Compatibility mode: In this mode, the projects created in the older version of the toolkit need not be migrated into modular version. • Migration to modular code: In order to use the project with modular aql code, the text analytics properties for the project need to be migrated.