140 likes | 281 Views
IBM Content Analytics with Enterprise Search. BigInsights Integration. Mar 28 th , 2012. Challenge and Approach. Challenge Achieve massive scale-out Utilize cloud environment as resource pool Approach Keep compatibility with current version to respect existing customers
E N D
IBM Content Analytics with Enterprise Search BigInsights Integration Mar 28th, 2012
Challenge and Approach • Challenge • Achieve massive scale-out • Utilize cloud environment as resource pool • Approach • Keep compatibility with current version to respect existing customers • No end user impact • Seamless administration • Utilize current assets • UIMA Infrastructure • UIMA Annotators (LW, System-T, Takmi,…) • Various data source crawlers • … • Utilize BigInsights as scale-out infrastructure
Seamless Scale-out Scenario • ICA V3.0 offers 3 types of system configuration according to the volume of data * BigInsights is supported only on Linux POC with small data can be done on a single workstation Production system will be deployed to 1 to N servers Production system analyzing big data will utilize BigInsights
About InfoSphere BigInsights • IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. … BigInsights enhancesthis technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research…. • InfoSphere BigInsights is prereq • Version 1.3 is officially supported
Feature Overview: Collection on BigInsights • Search & Text Analytics Capability • UIMA • System-T • Gumshoe • Scale Out • IBM Hadoop • ILEL BigIndex • Flexible Job Flow • Orchestrator (a.k.a. MetaTracker) • Easy Data Manipulation • JAQL • Robust File System • GPFS (Shared Nothing Cluster version, not yet released)
UIMA Annotators - Gumshoe LA - Gumshoe GA - LanguageWare - TAKMI - User Custom System-T Analysis Pre-Processing Indexing ICA GA UIMA Analysis • Gumshoe Relevancy Cache ICA V3.0 Analytics Flow on BigInsights Operationby JAQL IBM InfoSphere BigInsights Job Flow controlledby Orchestrator(MetaTracker) IBM Content Analytics Document Processing Flow Custom Data • Link Analysis- Dup Doc Elimination- Facet Grouping • - Custom GA Slave Index BigIndex Orchestrator RDS RDS HDFS/GPFS Job Request Crawler Document Processing Flow Text Analytics / SearchRuntime Various Data source Indexing Service Process UI Importer Local Analysis (UIMA base) Regular OS GlobalAnalysis Other App. Exporter
How Jaql and Hadoop Map/Reduce works with ICA Example: Omit duplicated documents in RDS by Jaql/Hadoop parseRds=fn(rdsFiles:[{path:string,offset:long}],options,output:schematype=null) ( mapReduce({ input:rdsFileDescriptor(rdsFiles[*].path,keepRemoved=false), output:(if(isnull(output))(HadoopTemp())else(output)), map:fn(v) (v->transform [$.uri,$]), reduce: fn(k,v) ( d=v->sort by [$.sequenceNumber desc], if( d[0].code > 0 and d[0].code != 4050)([d[0]])else([]) ), … {uri=“B”,seqno=1,…} {uri=“C” seqno=101,…}, … {uri=“A”,seqno=100,…} … Key=“A”, value={uri=“A”, seqno=100,…} Key=“C”, value={uri=“C”, seqno=101,…} … {uri=“A”,seqno=100,…} {uri=“C” seqno=101,…}, … {uri=“A”,seqno=0,…} {uri=“B” seqno=1,…}, … Key=“A”, value={uri=“A”, seqno=0,…} Key=“B”, value={uri=“B”, seqno=1,…} … Key=“A”, Values=[ {uri=“A”, seqno=100,,…}, {uri=“A”, seqno=0,…} ] Key=“B”,Values=[ {uri=“B”, seqno=1,…}] Key=“C”,Values=[ {uri=“C”, seqno=101,…} ] Output Format Output Format Input Format Input Format Reduce Map Map Reduce RDS Json RDS Json
Easy Configuration • Specify BigInsights Sever Information Admin user can confirm the setting on Topology View • Specify “Use IBM BigInsights” while creating a collection • Then configuration files and ICA libraries, UIMA PEARs (including custom PEAR) and other required modules will be distributed to BIgInsights servers automatically
Storage requirement with BigInsights • ICA • ES_NODE_ROOT should be shared on all nodes to share configuration and other resoureces • BigInsights • Jaql and Map/Reduce uses local storage as temporary storage • HDFS will also uses local storage as a part of the file system • BigIndex also consumes local storage to merge indexes • It is strongly suggested to use GPFS with fibre as storage in performance/reliability reasons for small cluster
Storage requirement with BigInsights : HDFS • HDFS • Storage on each data node will used as a part of file system • Can increase capacity by adding storage on each data node or adding new data node with storage • Have replication of each blocks ( default : 3 ) • Each searcher process downloads index from HDFS to local file system
Storage requirement with BigInsights :Shared storage • Shared storage • High performance storage (i.e. GPFS with fibre) will be required • Each searcher must share the storage • ICA servers should use same storage as ES_NODE_ROOT • Using NFS has some requirement, please check release note
Custom Global Analysis by JAQL • Global Analysis • Obtain new information by examining all documents in a collection • Link Counting • Duplicated Document Detection • etc • Custom Global Analysis by JAQL • User can integrate his own Global Analysis logic using JAQL • Input is the result of ICA document processing (field, facet, content) • Output can be stored as a document field or facet • User Benefits • New data manipulation point across documents • Crawler plug-in, UIMA Annotator can manipulate data only within each document • Manipulate data using Map/Reduce from script like SQL • JAQL releases developers from Java programming of Map/Reduce