1 / 16

Information Integration

Lec . 9 May 13, 2010 ISM 158. Information Integration. Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam. Enterprise Information. Centralized versus Distributed?. Distributed systems occur naturally

aretha
Download Presentation

Information Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lec. 9 May 13, 2010 ISM 158 Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam

  2. Enterprise Information

  3. Centralized versus Distributed? • Distributed systems occur naturally • State of the art does not allow complex queries or deep analysis against distributed information • Centralization may also be favored due to lower costs of infrastructure, license and labor, as well as due to their ability to better enforce tighter integrity constraints and other information management policies • Ultimately, the decision needs to take into account issues of ownership and control • Technology considerations often are secondary; even so, rational rules for resolving these considerations exist, as described in Distributed Computing Economics paper

  4. Contrasting Business & Technical Information Businessdomain Technical domain Real-time information Unstructured sources Inconsistent information SQL schema & query Search federation Ad hoc query Central controlCentral archive Steering Dashboards Schema evolution Metadata scaling Complex metadataSimpler data fusion Data mining Pivoting Pivoting XML or WS schema & query Heavy data processingSimple metadata fusion ETL ETL Deep linguistics Centralized metadata Distributed complex controls Stable schemata Streaming A/V Visualization Data bandwidth scaling Distributed archives File schema & query Structured sources

  5. The Guiding Principles • Privacy and security • Compliance / auditability • Retention requirements • Business value • Informationquality • It is a bad idea to address the following as afterthoughts • Scale • Availability • Integrity • The ability to embed function close to data is fundamental to scalable information processing • In order to deliver the best performance/$, systems tend to scale out from technology sweet spot of the day • Redundancy configured in from the start, as well as mechanisms for early detection and isolation of faults • Optimize availability by optimizing recovery

  6. content data storage Scalable Content Processing • Enterprise information is complex • Diversity of information sources and formats • Entail complex integration and processing flows • Metadata generation and indexing • Content indexing • Protection and security scalable processing connectors connectors e.g. JCR API scalable repository

  7. Scale out architecture used under cloud information services Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Attribute indexing • Smart Cells • Scalable distributed system of self contained, all-inclusive data repositories • Principles • Scale-out • Federation • Intelligence close to data • Pluggable platforms supporting proprietary and 3rd-party storage services • Example • Platforms for Information Lifecycle Management services Content indexing Supported protocols and APIs Storage:Block,File,Object &Fragment Smart Query Fabric

  8. Considerations in Distributed Information Management • Information is distributed across heterogeneous sources and has varied provenance • Integration • Information management requires information about information • Metadata • Useful information is timely and findable • Real-time integration and caching • Indexing • Semantic analysis • Context

  9. Dimensions of Integration

  10. Ecosystem of integration products • Metadata • Determines information richness • Service Orientation • Determines protocol richness • Future • Integration as syndication • Integration aaS WS-basedSOAMicrosoft,IBM UniformaccessMOSS, Attivio JSR 170 ECIDay Metadata RSS-basedNewsGator XML-based EIIBEA LiquidData, Mark Logic PureEAITibco, SAG SQL-based EIISAP, Oracle, Composite Service-orientedness

  11. Points for Discussion in class • Consider a healthcare patient information scenario. • Is it mainly transactional or mainly analytic? • Would you lean toward a distributed (EAI) approach or a centralized one (warehouse)? • Consider a scenario in which a company wants to drill down into the root causes of customer complaints? • Again, centralized or distributed? • Identifying the root cause • Tracking the problem • Would real-time integration become a requirement?

  12. Points to ponder at home • Pros of integration • Connecting the dots • Single view of … • Quality control over • Inconsistency • Staleness • Gaps • Cons of integration • Loss of context • Often, read only • Cost • Duplication • Scale • Losing battle? • Risk

  13. Where to learn more • Data Integration: The Relational Logic Approach by Michael Genesereth, Morgan & Claypool Publishers, 2010

  14. Upcoming guest lectures in May • Dr. V. Galotra, Oracle • SOA Deep Dive • Rahul Nim, Efficient Frontier • Online marketing

  15. Questions?

  16. News PRESENTATION

More Related