1 / 53

An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform

An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform. Simon Tang InfoSphere Technical Manager IBM GCG. Pain Point: Understanding Core Information Assets. “This data does not look right” – Business User.

mark-price
Download Presentation

An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform Simon Tang InfoSphere Technical Manager IBM GCG

  2. Pain Point: Understanding Core Information Assets “This data does not look right” – Business User “I don’t have the information I need” – Business Analyst “We are not leveraging our information” - Architect “How can I see how this is used” –Governance Steward “I’m not sure what the business wants” - Developer “What systems will be impacted from this change” - DBA

  3. Low data quality costs companies $611 billion annually 25% of time is spent clarifying bad data Undetected defects will cost 10 to 100 times as much to fix upstream Impact of NOT Managing Core Information Assets Scrap and rework Increased $$$ 83% of data integration projects either overrun or fail Lack of consumer confidence Lost opportunities Inaccurate or incomplete data is a leading cause of failure in business-intelligence and CRM projects

  4. Subject Matter Experts Data/Business Analysts Architects GovernanceStewards Trusted Information1. Accurate2.Complete3. Insightful4. Real Time Who are Looking for Trusted Information? • Target Audience • Data/Business Analysts • Subject Matter Experts • Architects • Governance Stewards • What are they working on? • Information-centric projects: • BI & Data Warehousing • Master Data Management • Application Implementation, Consolidation or Migration • Information Architecture • Governance Initiatives • What do these roles do today? • Manage information manually in disconnected tools, documents, and spreadsheets • What is wrong with what they do today? • Time consuming – churn between business & IT • Imprecise & error prone – manual processes not thorough enough • No collaboration – different roles work in silos • Lacks audit trail – no ongoing record • Redundancy – duplication of effort & storage 5

  5. A Flexible Platform for Managing, Integrating, Analyzing and Governing Information Transactional& Collaborative Applications Business Analytics Applications Analyze Integrate Big Data Master Data Manage Cubes Integration Streams Data Data Warehouses Content External Information Sources Streaming Information Govern Security & Privacy Quality Lifecycle

  6. Order Proc SupplyChain Procure-ment CRM Challenges in Data Management • Touching data multiple times at its source – storing multiple times and updating multiple times • Inability to share common business rules across projects, processes and applications • Lack single, repeatable methodology for consistency across all projects • Inconsistent islands of information underlying applications • Complex, manual & costly copy synchronization • Inconsistent and poor quality data • Inability to exploit enterprise meta data across tools

  7. Convert information into a trusted strategic asset Discover and understand the data across heterogeneous systems Design trusted information structures for business optimization Govern that information over time Only IBM has invested to provide the breadth of capabilities to define and govern your information… • Business Vocabulary • Data Relationships • Data Quality Compliance • Data Models and Mapping • Business Specification Rules • Provenance of information

  8. Remedy: 10 Proven Strategies No single path is THE panacea to all corporate data problems - multiple approaches must be employed Consider where your organization’s most SIGNIFICANT data pain exists – take that approach first

  9. Strategy #1 – Understand Source Systems • Verify if characteristics of data conform to established / known business rules Business Analysis Data Analysis • Discovers actual characteristics of data • Report on the assessment and variances / exceptions

  10. 0 10 20 30 40 50 60 70 80 90 100 Strategy #1 – Understand Source Systems • Poor data quality costs U.S. businesses over $600 billion each year • Data deteriorates up to 3% every month • What is the key to integrating corporate data? – Having the right data before you start Ensuring adequate data quality Understanding source data Creating complex transformations Creating complex mappings Ensuring adequate performance Collecting and maintaining meta data Finding skilled programmers Providing access to meta data Ensuring adequate scalability Integrating 3rd party tools Ensuring adequate reliability

  11. Table & Primary Key Analysis Source 2 Column Analysis Source 1 Foreign Key & Duplicate Analysis Foreign Key & Duplicate Analysis Recommended Best Practices: Automated Data Profiling Advice: You won’t have the time, $ or energy to profile 100% quickly so go automated No coding

  12. Error Codes? Spelling Errors Lack of Standards in Synonyms, Acronyms, Abbreviations Assembly Instruction Size Part Strategy #2 – Build-in Data Quality • Same company / person? • Same address? • Same parts? • Same instructions?

  13. Original Building | Street | Unit Blk 1, 1 St, 05-00 05-00 Frist St, Block 1 1 First Str, #05-00 Block 1, First Str, #05-00 1, St, #05-00 Blk 1 |First St|05-00 Blk 1 |First St|05-00 1 |First St|#05-00 Blk 1 |First St|#05-00 1 |St |#05-00 Survive Match Standardize Building | Street | Unit Blk 1 |First St|05-00 Blk 1 |First St|05-00 1 |First St|#05-00 Blk 1 |First St|#05-00 1 |St |#05-00 Final Result #05-00, Blk 1, First St #05-00, 1, St Recommended Best Practices: Data Cleansing Data Re-Engineering

  14. From Data Model From ETL Tool From BI Tool From Database Customer CustomerNumber Name Address Comments CustomerTbl CustomerID Name Address Address1 Comments CustomerDetails CustomerNumber Name Address Remarks Customer ID Name Address1 Address2 Descr Which meta data is right? Which one is current? Which one should be used? Strategy #3 – Share Common Meta Data The Identifier of customers that are tracked for ordering purposes. Corporate customer identifiers are assigned by the Sales Data Controller according to the corporate data description and naming policy for reference identifiers. Unique identifier of customers that are tracked for ordering purposes. Values start with 02 for non-Corporate customers and 01 for Corporate customers. Customer’s identifier numbers. Values start with 01 for Corporate customers, 02 for non-Corporate customers, 03 for overseas-based Customers. <NULL>

  15. Modeling tool BI tool BI Repository ETL Tool + Processes Other sources’ definition files COBOL definition files Recommended Best Practices: Create a common repository Integrated Meta Data Repository Integrate by gathering in from diverse applications and sources

  16. Create a Common Vocabulary Category: Costs Term: Tax Expense Full Name: Tax to be paid on Gross Income “The expense due to taxes …..” (John Walsh is responsible for updates. 90% reliable source) Status: CURRENT Database = DB2 Schema = NAACCT Table = DLYTRANS Column = TAXVL data type = Decimal (14,2) Derivation: SUM(TRNTXAMT) Achieve a common vocabulary between business & technical users! Shared Metadata Server & Repository InfoSphere DataStage InfoSphere Business Glossary

  17. Collaborate and Share Feedback GL Organizational Unit STEWARD: Controllers Office FORMAT: X(7) DEFINITION: A seven digit number designating the organizational unit to which this account belongs. Author Standard Definitions Annotate and Share Feedback I’ve noticed that the last two digits of the GL Organizational Unit, which indicate the sub-department, are often blank.

  18. Extend Business Information • Categorize Information Assets according to Business Logic • Map Business Terms to Information Assets • Find and view relevant details of Information Assets • View the Stewardship of Information Assets

  19. Where does a Field of Data in this Report Come From? • Import & Browse Full BI Report Metadata • Navigate through report attributes • Visually navigate through data lineage across tools • Combines operational & design viewpoint

  20. Metadata Lineage available from Studio & Viewers IBM Confidential

  21. Access Business Glossary from Cognos Studios IBM Confidential

  22. Strategy #4 – Connect to Any System, Anywhere DB2, Informix, Netezza, ODBC, Oracle, Red Brick, SAS, Sybase, Teradata, etc WebSphere MQ, SeeBeyond, JMS, XML, EJB, Web Services, EXML, XMLS, EDI, SWIFT, etc Adabas, Allbase/SQL, Datacom/DB, DB2/400, DB2/OS390, Essbase, FOCUS, IDMS/SQL, IMS, NonStopSQL, RDB, VSAM, etc Oracle Applications, PeopleSoft, SAP R/3, SAP BW, Siebel

  23. Recommended Best Practices: Native Connectivity Software Advice: Go for pre-built connectors with little/no coding Do you wish to worry what will be your next application or database to connect to?

  24. Strategy #5 – Abandon Hand-coding These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work … … but what happens when there is a new source or requirement? Cheap? Works? Maybe not.

  25. Recommended Best Practices: Graphical ETL Tools • Benefits: • Jobs are easy to develop, understand, debug and maintain • Robust, fully-tested, best practices approach to data migration or extraction

  26. Recommended Best Practices: Graphical ETL Tools • Benefits: • Complex transformations can be made very simple with mere point-and-click

  27. Workflow Process - Sequences • Workflow is as important as dataflow. • Dynamic workflow processes can be defined during the workflow itself. • DataStage can run external processes and perform complex evaluations inline. • Advanced concepts such as looping are supported.

  28. Physical Machine Utilization Disk Throughput Average Process Distribution Free Memory Whisker Box Percent CPU Utilization

  29. 44x as much Data and ContentOver Coming Decade Velocity Variety 2020 35 zettabytes Volume 2009 800,000 petabytes Strategy #6 – Implement a Highly Scalable Foundation Prediction: Your data volume is not going to get smaller

  30. Processing Throughput (Hundreds of Gigabytes) Processing Time (Hours) . . . . . . 32X 32 24X 24 16X 16 8X 8 1 1X 32 . . . 32 . . . 1 8 16 24 1 8 16 24 Number of Processors Number of Processors Strategy #6 – Implement a Highly Scalable Foundation 2 considerations in handling growth: You want these Not these or

  31. Scalable Hardware Platform Scalable Database Platform Database vendors have offered a scalable parallel relational database for more than 5 years. Hardware vendors have offered scalable parallel computers for more than 5 years. Data integration vendors are starting to offer “scalable” “parallel” platforms Scalable Data Integration Platform Strategy #6 – Implement a Highly Scalable Foundation Three Elements of a Scalable Infrastructure

  32. Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory SMP System SMP System SMP System SMP System SMP System SMP System SMP System SMP System SMP System SMP System Shared Disk CPU CPU CPU CPU CPU CPU CPU Shared Memory Shared Memory SMP System Recommended Best Practices: Parallelism Make sure you get this Not this

  33. Data Warehouse Source Data TRANSFORM ENRICH LOAD Application Execution: Sequential or Parallel Sequential 4-Way Parallel 64-Way Parallel Sort Time to MPP, GRID, and Clustered Systems Auto parallel-enabled and parallel-aware run-time execution Uniprocessor SMP System Process Join Scan Parallel Serial Parallel Recommended Best Practices: Parallelism One application assembly

  34. Strategy #7 – Architect for “Right-Time” • In an InformationWeek 2003 survey of 467 business professionals about how often their IT systems provide business managers with timely updates of primary products or services: • 3% no such process • 1% annually • 17% monthly • 13% weekly • 36% daily • 5% hourly • 8% every minute • In that same report: • “Whereas 57% of sites surveyed a year ago said that real-time business information was a key company focus, 70% see it that way today.”

  35. Appropriate Response Awareness Event Occurs campaign initiated . . . . . . . . . . . . . . . . . . . . . . . tuning customer churns . . . . . . . . . . . . . . . . . . . . . . . win-back Acceptable Latency fraud committed . . . . . . . . . . . . . . . . . . . . . . . prevention website click . . . . . . . . . . . . . . . . . . . . . . . offer made Recommended Best Practices: Right-Time Business Event Occurs Recognition Response Latency Latency Latency is defined as the elapsed time between when an eventoccurs and when an appropriate response or action is made

  36. Recommended Best Practices: Right-Time Business Event Occurs Recognition Latency • Improving the ability to recognize business events Recognition Response Latency • Improving the ability to respond to those events

  37. Information Server Log-Based Change Data Capture Monitoring and Configuration Key Benefits: • Low impact • Flexible implementation Database Web Services Message Queue InfoSphere Information Server TCP/IP DB2, Oracle, SQL Server, etc Database Logs Source Engine Target Engine Flat files • Heterogeneous platform support • Easy to use

  38. EDW IBM Information Server InfoSphere CDC & InfoSphere DataStage (ETL) Information ServerChange Data Capture Data Stage Consumption Direct Connect TCP via Data Stage operator Staging Table Out of the box Point Of Sale Message Queue Out of the box “CDC” Continuous DataStage DSX file format Flat File Native Teradata, DB2, Oracle, SQL Server, Sybase… Oracle DB Log IBM Information Server ETL Load Retail Including BalOp (ELT)

  39. Web applications EAI, BPM, EII Dashboards Portals Packaged Apps Master Data Stores Data Warehouses Legacy Apps Business Partner Data Strategy #8 – Extend Quality and Transformation Capabilities throughout the Enterprise • Hand-coded rules in each project/tool are not re-usable to other projects/tools • High costs associated with building & maintaining data access, data quality and transformation rules in each project

  40. Web Services Service-Oriented Architecture SOA Approach Message Queues, EAI get customer SOAP/HTTP Java, Application Servers JMS EJB Packaged Apps Master Data Stores Data Warehouses Legacy Apps Business Partner Data Recommended Best Practices: Data Integration Services • Service-Oriented Architecture (SOA) approach packages data integration logic of SOA-friendly applications as services • Services can be invoked as Web Services, EJB, JMS by any third-party applications

  41. Strategy #9 – Choose a Proven Deployment Methodology designed for Quick Success • Many available out there • How many and which are workable – who knows? • Be aware there are as much risks in deployment methodology as there in tools usage

  42. plan investigate proto-type iteration unittest End Derive BusinessValue operate design etc. manage systemtest Evaluate Results maint-enance 12 - 24 Weeks deploy develop Deploy Solution UAT monitor regressiontest Establish BusinessDrivers Prod-uction audit Start Recommended Best Practices: Iterative Deployment Plan

  43. A Blueprint DirectorThe GPS for your information project Palette free form “sketching” elements Diagram for a blueprint • Method browser (displaying method content) • Asset browser (browsing metadata repository) • Glossary explorer (showing glossary tree view) • Outline (zoom in/out view) • Blueprint explorer (shows tree view of the elements in the blueprint) Context specific property view

  44. Business and IT: Working Together Business Analyst • Collects business terms and business requirements; Converts into business rules in a spec Business Requirements • Business terms Successful Data Integration Project • extract • transform • load Create DataStage jobs and data flows that reflect business needs. Mapping specification created – critical to collaboration between IT and business • Takes those business rules and mapping spec and turns them into code, such as a DataStage job. Developer

  45. Track business requirements to application deployment • Single, central managed infrastructure to track requirements to deployment • Import Excel mapping spreadsheets • Define and link business terms to physical structures • Generate DataStage jobs with annotated to-do tasks for developer • Generate historical documentation for tracking Define mapping specification with business rules and terms Auto-generate DataStage jobs Flexible reporting and tracking

  46. Strategy #10 – Ensure Interoperability of Integration Infrastructures The Goal Connected, integrated, seamlessly The Reality Cobbled, piece-meal, manual-intensive

  47. Data Integration Projects require a Collaborative Effort Business user business terms business requirements Data Modeler Business Analyst Data Analyst transformation rules data model Developer • extract • transform • load data flow application

  48. 7 1 3 Establish Platform Import & Enhance Industry Model Understand Data Relationships Deliver Reports Cognos Discovery Data Architect Populates 5 6 2 4 Generate Logic to Load Warehouse Define Business Requirement & Glossary Assess, Monitor, Manage Data Quality Rules Map Sources to Target Model Links DataStage & QualityStage FastTrack Information Analyzer Business Glossary Metadata Server Simplification & Content: reduces project time, risk and cost! 49

  49. Information Services Director Publish SOA services for information integration and access Business Glossary QualityStage Federation Server Enterprise Data Dictionary Virtualize access to disparate information DataStage Data Quality: Standardize, Correct & Match Data Global Name Recognition Extract, Transform, and Load in Batch or Real-time Information Analyzer CDC & Replication Data Source Profiling & Problem Diagnosis Recognize & Classify Multi-cultural names Deliver and replicate changed data Metadata Server / Metadata Workbench / FastTrack Manage and track consistent metadata across information integration tasks and automate generation of data flow logic Parallel Processing Rich Connectivity to Applications, Data, and Content Recommended Best Practices: Integrated Tool Suites

More Related