590 likes | 783 Views
An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform. Simon Tang InfoSphere Technical Manager IBM GCG. Pain Point: Understanding Core Information Assets. “This data does not look right” – Business User.
E N D
An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform Simon Tang InfoSphere Technical Manager IBM GCG
Pain Point: Understanding Core Information Assets “This data does not look right” – Business User “I don’t have the information I need” – Business Analyst “We are not leveraging our information” - Architect “How can I see how this is used” –Governance Steward “I’m not sure what the business wants” - Developer “What systems will be impacted from this change” - DBA
Low data quality costs companies $611 billion annually 25% of time is spent clarifying bad data Undetected defects will cost 10 to 100 times as much to fix upstream Impact of NOT Managing Core Information Assets Scrap and rework Increased $$$ 83% of data integration projects either overrun or fail Lack of consumer confidence Lost opportunities Inaccurate or incomplete data is a leading cause of failure in business-intelligence and CRM projects
Subject Matter Experts Data/Business Analysts Architects GovernanceStewards Trusted Information1. Accurate2.Complete3. Insightful4. Real Time Who are Looking for Trusted Information? • Target Audience • Data/Business Analysts • Subject Matter Experts • Architects • Governance Stewards • What are they working on? • Information-centric projects: • BI & Data Warehousing • Master Data Management • Application Implementation, Consolidation or Migration • Information Architecture • Governance Initiatives • What do these roles do today? • Manage information manually in disconnected tools, documents, and spreadsheets • What is wrong with what they do today? • Time consuming – churn between business & IT • Imprecise & error prone – manual processes not thorough enough • No collaboration – different roles work in silos • Lacks audit trail – no ongoing record • Redundancy – duplication of effort & storage 5
A Flexible Platform for Managing, Integrating, Analyzing and Governing Information Transactional& Collaborative Applications Business Analytics Applications Analyze Integrate Big Data Master Data Manage Cubes Integration Streams Data Data Warehouses Content External Information Sources Streaming Information Govern Security & Privacy Quality Lifecycle
Order Proc SupplyChain Procure-ment CRM Challenges in Data Management • Touching data multiple times at its source – storing multiple times and updating multiple times • Inability to share common business rules across projects, processes and applications • Lack single, repeatable methodology for consistency across all projects • Inconsistent islands of information underlying applications • Complex, manual & costly copy synchronization • Inconsistent and poor quality data • Inability to exploit enterprise meta data across tools
Convert information into a trusted strategic asset Discover and understand the data across heterogeneous systems Design trusted information structures for business optimization Govern that information over time Only IBM has invested to provide the breadth of capabilities to define and govern your information… • Business Vocabulary • Data Relationships • Data Quality Compliance • Data Models and Mapping • Business Specification Rules • Provenance of information
Remedy: 10 Proven Strategies No single path is THE panacea to all corporate data problems - multiple approaches must be employed Consider where your organization’s most SIGNIFICANT data pain exists – take that approach first
Strategy #1 – Understand Source Systems • Verify if characteristics of data conform to established / known business rules Business Analysis Data Analysis • Discovers actual characteristics of data • Report on the assessment and variances / exceptions
0 10 20 30 40 50 60 70 80 90 100 Strategy #1 – Understand Source Systems • Poor data quality costs U.S. businesses over $600 billion each year • Data deteriorates up to 3% every month • What is the key to integrating corporate data? – Having the right data before you start Ensuring adequate data quality Understanding source data Creating complex transformations Creating complex mappings Ensuring adequate performance Collecting and maintaining meta data Finding skilled programmers Providing access to meta data Ensuring adequate scalability Integrating 3rd party tools Ensuring adequate reliability
Table & Primary Key Analysis Source 2 Column Analysis Source 1 Foreign Key & Duplicate Analysis Foreign Key & Duplicate Analysis Recommended Best Practices: Automated Data Profiling Advice: You won’t have the time, $ or energy to profile 100% quickly so go automated No coding
Error Codes? Spelling Errors Lack of Standards in Synonyms, Acronyms, Abbreviations Assembly Instruction Size Part Strategy #2 – Build-in Data Quality • Same company / person? • Same address? • Same parts? • Same instructions?
Original Building | Street | Unit Blk 1, 1 St, 05-00 05-00 Frist St, Block 1 1 First Str, #05-00 Block 1, First Str, #05-00 1, St, #05-00 Blk 1 |First St|05-00 Blk 1 |First St|05-00 1 |First St|#05-00 Blk 1 |First St|#05-00 1 |St |#05-00 Survive Match Standardize Building | Street | Unit Blk 1 |First St|05-00 Blk 1 |First St|05-00 1 |First St|#05-00 Blk 1 |First St|#05-00 1 |St |#05-00 Final Result #05-00, Blk 1, First St #05-00, 1, St Recommended Best Practices: Data Cleansing Data Re-Engineering
From Data Model From ETL Tool From BI Tool From Database Customer CustomerNumber Name Address Comments CustomerTbl CustomerID Name Address Address1 Comments CustomerDetails CustomerNumber Name Address Remarks Customer ID Name Address1 Address2 Descr Which meta data is right? Which one is current? Which one should be used? Strategy #3 – Share Common Meta Data The Identifier of customers that are tracked for ordering purposes. Corporate customer identifiers are assigned by the Sales Data Controller according to the corporate data description and naming policy for reference identifiers. Unique identifier of customers that are tracked for ordering purposes. Values start with 02 for non-Corporate customers and 01 for Corporate customers. Customer’s identifier numbers. Values start with 01 for Corporate customers, 02 for non-Corporate customers, 03 for overseas-based Customers. <NULL>
Modeling tool BI tool BI Repository ETL Tool + Processes Other sources’ definition files COBOL definition files Recommended Best Practices: Create a common repository Integrated Meta Data Repository Integrate by gathering in from diverse applications and sources
Create a Common Vocabulary Category: Costs Term: Tax Expense Full Name: Tax to be paid on Gross Income “The expense due to taxes …..” (John Walsh is responsible for updates. 90% reliable source) Status: CURRENT Database = DB2 Schema = NAACCT Table = DLYTRANS Column = TAXVL data type = Decimal (14,2) Derivation: SUM(TRNTXAMT) Achieve a common vocabulary between business & technical users! Shared Metadata Server & Repository InfoSphere DataStage InfoSphere Business Glossary
Collaborate and Share Feedback GL Organizational Unit STEWARD: Controllers Office FORMAT: X(7) DEFINITION: A seven digit number designating the organizational unit to which this account belongs. Author Standard Definitions Annotate and Share Feedback I’ve noticed that the last two digits of the GL Organizational Unit, which indicate the sub-department, are often blank.
Extend Business Information • Categorize Information Assets according to Business Logic • Map Business Terms to Information Assets • Find and view relevant details of Information Assets • View the Stewardship of Information Assets
Where does a Field of Data in this Report Come From? • Import & Browse Full BI Report Metadata • Navigate through report attributes • Visually navigate through data lineage across tools • Combines operational & design viewpoint
Metadata Lineage available from Studio & Viewers IBM Confidential
Access Business Glossary from Cognos Studios IBM Confidential
Strategy #4 – Connect to Any System, Anywhere DB2, Informix, Netezza, ODBC, Oracle, Red Brick, SAS, Sybase, Teradata, etc WebSphere MQ, SeeBeyond, JMS, XML, EJB, Web Services, EXML, XMLS, EDI, SWIFT, etc Adabas, Allbase/SQL, Datacom/DB, DB2/400, DB2/OS390, Essbase, FOCUS, IDMS/SQL, IMS, NonStopSQL, RDB, VSAM, etc Oracle Applications, PeopleSoft, SAP R/3, SAP BW, Siebel
Recommended Best Practices: Native Connectivity Software Advice: Go for pre-built connectors with little/no coding Do you wish to worry what will be your next application or database to connect to?
Strategy #5 – Abandon Hand-coding These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work … … but what happens when there is a new source or requirement? Cheap? Works? Maybe not.
Recommended Best Practices: Graphical ETL Tools • Benefits: • Jobs are easy to develop, understand, debug and maintain • Robust, fully-tested, best practices approach to data migration or extraction
Recommended Best Practices: Graphical ETL Tools • Benefits: • Complex transformations can be made very simple with mere point-and-click
Workflow Process - Sequences • Workflow is as important as dataflow. • Dynamic workflow processes can be defined during the workflow itself. • DataStage can run external processes and perform complex evaluations inline. • Advanced concepts such as looping are supported.
Physical Machine Utilization Disk Throughput Average Process Distribution Free Memory Whisker Box Percent CPU Utilization
44x as much Data and ContentOver Coming Decade Velocity Variety 2020 35 zettabytes Volume 2009 800,000 petabytes Strategy #6 – Implement a Highly Scalable Foundation Prediction: Your data volume is not going to get smaller
Processing Throughput (Hundreds of Gigabytes) Processing Time (Hours) . . . . . . 32X 32 24X 24 16X 16 8X 8 1 1X 32 . . . 32 . . . 1 8 16 24 1 8 16 24 Number of Processors Number of Processors Strategy #6 – Implement a Highly Scalable Foundation 2 considerations in handling growth: You want these Not these or
Scalable Hardware Platform Scalable Database Platform Database vendors have offered a scalable parallel relational database for more than 5 years. Hardware vendors have offered scalable parallel computers for more than 5 years. Data integration vendors are starting to offer “scalable” “parallel” platforms Scalable Data Integration Platform Strategy #6 – Implement a Highly Scalable Foundation Three Elements of a Scalable Infrastructure
Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk Shared Disk CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory SMP System SMP System SMP System SMP System SMP System SMP System SMP System SMP System SMP System SMP System Shared Disk CPU CPU CPU CPU CPU CPU CPU Shared Memory Shared Memory SMP System Recommended Best Practices: Parallelism Make sure you get this Not this
Data Warehouse Source Data TRANSFORM ENRICH LOAD Application Execution: Sequential or Parallel Sequential 4-Way Parallel 64-Way Parallel Sort Time to MPP, GRID, and Clustered Systems Auto parallel-enabled and parallel-aware run-time execution Uniprocessor SMP System Process Join Scan Parallel Serial Parallel Recommended Best Practices: Parallelism One application assembly
Strategy #7 – Architect for “Right-Time” • In an InformationWeek 2003 survey of 467 business professionals about how often their IT systems provide business managers with timely updates of primary products or services: • 3% no such process • 1% annually • 17% monthly • 13% weekly • 36% daily • 5% hourly • 8% every minute • In that same report: • “Whereas 57% of sites surveyed a year ago said that real-time business information was a key company focus, 70% see it that way today.”
Appropriate Response Awareness Event Occurs campaign initiated . . . . . . . . . . . . . . . . . . . . . . . tuning customer churns . . . . . . . . . . . . . . . . . . . . . . . win-back Acceptable Latency fraud committed . . . . . . . . . . . . . . . . . . . . . . . prevention website click . . . . . . . . . . . . . . . . . . . . . . . offer made Recommended Best Practices: Right-Time Business Event Occurs Recognition Response Latency Latency Latency is defined as the elapsed time between when an eventoccurs and when an appropriate response or action is made
Recommended Best Practices: Right-Time Business Event Occurs Recognition Latency • Improving the ability to recognize business events Recognition Response Latency • Improving the ability to respond to those events
Information Server Log-Based Change Data Capture Monitoring and Configuration Key Benefits: • Low impact • Flexible implementation Database Web Services Message Queue InfoSphere Information Server TCP/IP DB2, Oracle, SQL Server, etc Database Logs Source Engine Target Engine Flat files • Heterogeneous platform support • Easy to use
EDW IBM Information Server InfoSphere CDC & InfoSphere DataStage (ETL) Information ServerChange Data Capture Data Stage Consumption Direct Connect TCP via Data Stage operator Staging Table Out of the box Point Of Sale Message Queue Out of the box “CDC” Continuous DataStage DSX file format Flat File Native Teradata, DB2, Oracle, SQL Server, Sybase… Oracle DB Log IBM Information Server ETL Load Retail Including BalOp (ELT)
Web applications EAI, BPM, EII Dashboards Portals Packaged Apps Master Data Stores Data Warehouses Legacy Apps Business Partner Data Strategy #8 – Extend Quality and Transformation Capabilities throughout the Enterprise • Hand-coded rules in each project/tool are not re-usable to other projects/tools • High costs associated with building & maintaining data access, data quality and transformation rules in each project
Web Services Service-Oriented Architecture SOA Approach Message Queues, EAI get customer SOAP/HTTP Java, Application Servers JMS EJB Packaged Apps Master Data Stores Data Warehouses Legacy Apps Business Partner Data Recommended Best Practices: Data Integration Services • Service-Oriented Architecture (SOA) approach packages data integration logic of SOA-friendly applications as services • Services can be invoked as Web Services, EJB, JMS by any third-party applications
Strategy #9 – Choose a Proven Deployment Methodology designed for Quick Success • Many available out there • How many and which are workable – who knows? • Be aware there are as much risks in deployment methodology as there in tools usage
plan investigate proto-type iteration unittest End Derive BusinessValue operate design etc. manage systemtest Evaluate Results maint-enance 12 - 24 Weeks deploy develop Deploy Solution UAT monitor regressiontest Establish BusinessDrivers Prod-uction audit Start Recommended Best Practices: Iterative Deployment Plan
A Blueprint DirectorThe GPS for your information project Palette free form “sketching” elements Diagram for a blueprint • Method browser (displaying method content) • Asset browser (browsing metadata repository) • Glossary explorer (showing glossary tree view) • Outline (zoom in/out view) • Blueprint explorer (shows tree view of the elements in the blueprint) Context specific property view
Business and IT: Working Together Business Analyst • Collects business terms and business requirements; Converts into business rules in a spec Business Requirements • Business terms Successful Data Integration Project • extract • transform • load Create DataStage jobs and data flows that reflect business needs. Mapping specification created – critical to collaboration between IT and business • Takes those business rules and mapping spec and turns them into code, such as a DataStage job. Developer
Track business requirements to application deployment • Single, central managed infrastructure to track requirements to deployment • Import Excel mapping spreadsheets • Define and link business terms to physical structures • Generate DataStage jobs with annotated to-do tasks for developer • Generate historical documentation for tracking Define mapping specification with business rules and terms Auto-generate DataStage jobs Flexible reporting and tracking
Strategy #10 – Ensure Interoperability of Integration Infrastructures The Goal Connected, integrated, seamlessly The Reality Cobbled, piece-meal, manual-intensive
Data Integration Projects require a Collaborative Effort Business user business terms business requirements Data Modeler Business Analyst Data Analyst transformation rules data model Developer • extract • transform • load data flow application
7 1 3 Establish Platform Import & Enhance Industry Model Understand Data Relationships Deliver Reports Cognos Discovery Data Architect Populates 5 6 2 4 Generate Logic to Load Warehouse Define Business Requirement & Glossary Assess, Monitor, Manage Data Quality Rules Map Sources to Target Model Links DataStage & QualityStage FastTrack Information Analyzer Business Glossary Metadata Server Simplification & Content: reduces project time, risk and cost! 49
Information Services Director Publish SOA services for information integration and access Business Glossary QualityStage Federation Server Enterprise Data Dictionary Virtualize access to disparate information DataStage Data Quality: Standardize, Correct & Match Data Global Name Recognition Extract, Transform, and Load in Batch or Real-time Information Analyzer CDC & Replication Data Source Profiling & Problem Diagnosis Recognize & Classify Multi-cultural names Deliver and replicate changed data Metadata Server / Metadata Workbench / FastTrack Manage and track consistent metadata across information integration tasks and automate generation of data flow logic Parallel Processing Rich Connectivity to Applications, Data, and Content Recommended Best Practices: Integrated Tool Suites