1 / 14

IBM Content Analytics with Enterprise Search

IBM Content Analytics with Enterprise Search. BigInsights Integration. Mar 28 th , 2012. Challenge and Approach. Challenge Achieve massive scale-out Utilize cloud environment as resource pool Approach Keep compatibility with current version to respect existing customers

louvain
Download Presentation

IBM Content Analytics with Enterprise Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM Content Analytics with Enterprise Search BigInsights Integration Mar 28th, 2012

  2. Challenge and Approach • Challenge • Achieve massive scale-out • Utilize cloud environment as resource pool • Approach • Keep compatibility with current version to respect existing customers • No end user impact • Seamless administration • Utilize current assets • UIMA Infrastructure • UIMA Annotators (LW, System-T, Takmi,…) • Various data source crawlers • … • Utilize BigInsights as scale-out infrastructure

  3. Seamless Scale-out Scenario • ICA V3.0 offers 3 types of system configuration according to the volume of data * BigInsights is supported only on Linux POC with small data can be done on a single workstation Production system will be deployed to 1 to N servers Production system analyzing big data will utilize BigInsights

  4. About InfoSphere BigInsights • IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. … BigInsights enhancesthis technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research…. • InfoSphere BigInsights is prereq • Version 1.3 is officially supported

  5. Feature Overview: Collection on BigInsights • Search & Text Analytics Capability • UIMA • System-T • Gumshoe • Scale Out • IBM Hadoop • ILEL BigIndex • Flexible Job Flow • Orchestrator (a.k.a. MetaTracker) • Easy Data Manipulation • JAQL • Robust File System • GPFS (Shared Nothing Cluster version, not yet released)

  6. UIMA Annotators - Gumshoe LA - Gumshoe GA - LanguageWare - TAKMI - User Custom System-T Analysis Pre-Processing Indexing ICA GA UIMA Analysis • Gumshoe Relevancy Cache ICA V3.0 Analytics Flow on BigInsights Operationby JAQL IBM InfoSphere BigInsights Job Flow controlledby Orchestrator(MetaTracker) IBM Content Analytics Document Processing Flow Custom Data • Link Analysis- Dup Doc Elimination- Facet Grouping • - Custom GA Slave Index BigIndex Orchestrator RDS RDS HDFS/GPFS Job Request Crawler Document Processing Flow Text Analytics / SearchRuntime Various Data source Indexing Service Process UI Importer Local Analysis (UIMA base) Regular OS GlobalAnalysis Other App. Exporter

  7. How Jaql and Hadoop Map/Reduce works with ICA Example: Omit duplicated documents in RDS by Jaql/Hadoop parseRds=fn(rdsFiles:[{path:string,offset:long}],options,output:schematype=null) ( mapReduce({ input:rdsFileDescriptor(rdsFiles[*].path,keepRemoved=false), output:(if(isnull(output))(HadoopTemp())else(output)), map:fn(v) (v->transform [$.uri,$]), reduce: fn(k,v) ( d=v->sort by [$.sequenceNumber desc], if( d[0].code > 0 and d[0].code != 4050)([d[0]])else([]) ), … {uri=“B”,seqno=1,…} {uri=“C” seqno=101,…}, … {uri=“A”,seqno=100,…} … Key=“A”, value={uri=“A”, seqno=100,…} Key=“C”, value={uri=“C”, seqno=101,…} … {uri=“A”,seqno=100,…} {uri=“C” seqno=101,…}, … {uri=“A”,seqno=0,…} {uri=“B” seqno=1,…}, … Key=“A”, value={uri=“A”, seqno=0,…} Key=“B”, value={uri=“B”, seqno=1,…} … Key=“A”, Values=[ {uri=“A”, seqno=100,,…}, {uri=“A”, seqno=0,…} ] Key=“B”,Values=[ {uri=“B”, seqno=1,…}] Key=“C”,Values=[ {uri=“C”, seqno=101,…} ] Output Format Output Format Input Format Input Format Reduce Map Map Reduce RDS Json RDS Json

  8. Differences : In general

  9. Easy Configuration • Specify BigInsights Sever Information Admin user can confirm the setting on Topology View • Specify “Use IBM BigInsights” while creating a collection • Then configuration files and ICA libraries, UIMA PEARs (including custom PEAR) and other required modules will be distributed to BIgInsights servers automatically

  10. Storage requirement with BigInsights • ICA • ES_NODE_ROOT should be shared on all nodes to share configuration and other resoureces • BigInsights • Jaql and Map/Reduce uses local storage as temporary storage • HDFS will also uses local storage as a part of the file system • BigIndex also consumes local storage to merge indexes • It is strongly suggested to use GPFS with fibre as storage in performance/reliability reasons for small cluster

  11. Storage requirement with BigInsights : HDFS • HDFS • Storage on each data node will used as a part of file system • Can increase capacity by adding storage on each data node or adding new data node with storage • Have replication of each blocks ( default : 3 ) • Each searcher process downloads index from HDFS to local file system

  12. Storage requirement with BigInsights :Shared storage • Shared storage • High performance storage (i.e. GPFS with fibre) will be required • Each searcher must share the storage • ICA servers should use same storage as ES_NODE_ROOT • Using NFS has some requirement, please check release note

  13. Custom Global Analysis

  14. Custom Global Analysis by JAQL • Global Analysis • Obtain new information by examining all documents in a collection • Link Counting • Duplicated Document Detection • etc • Custom Global Analysis by JAQL • User can integrate his own Global Analysis logic using JAQL • Input is the result of ICA document processing (field, facet, content) • Output can be stored as a document field or facet • User Benefits • New data manipulation point across documents • Crawler plug-in, UIMA Annotator can manipulate data only within each document • Manipulate data using Map/Reduce from script like SQL • JAQL releases developers from Java programming of Map/Reduce

More Related