290 likes | 473 Views
Content Integration for E-Business. Joe Hellerstein. New Generation of e-Business on the Internet. Companies moving beyond marketing, storefronts Attempting to do operations on the Internet procurement supply chain customer relationships etc. In a cross-enterprise environment
E N D
Content Integration for E-Business Joe Hellerstein
New Generation of e-Business on the Internet • Companies moving beyond marketing, storefronts • Attempting to do operations on the Internet • procurement • supply chain • customer relationships • etc. • In a cross-enterprise environment • Requires cross-enterprise content integration • catalog integration is the procurement instance of this problem
Content Integration • Content integration across enterprises • Not the “in-house” data warehousing problem • Not the Enterprise App Integration (EAI) problem • “Operational” data must be integrated • As opposed to historical (trend) data • E.g. pricing, availability, supply chain • Structured and unstructured data • Not just relational or XML queries • Not just text search • A combination of the two: logic meets statistics
The “Butterfly” • Everybody’s favorite picture c. 1/2000: • At question (6/2001) is how many butterflies, who owns them • Not a startup opportunity (Transora vs. Chemdex) • Perhaps one of the wings is smaller than the other (HomeDepot) Marketplace Suppliers Buyers
Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Research Evangelism
Some Scenarios for Content Integration • Catalog Management: Integration and Syndication • “MRO” (Maintenance, Repair and Operations) a la Grainger • Thousands of suppliers, run by a “content manager” • Availability and Pricing • Travel industry • Necessitates live, cross-enterprise querying • Supply Chain Management • E.g. auto industry • Increase in production requires the entire supply chain (“the cows”) • Contractual information along with catalog and availability
Marketing: The EcoSystem and its Terminology • Enterprise Application Integration (EAI): App Glue • Imperative, message-oriented programming (scripting languages) • Transactional networking (persistent queues) • Gateways to popular packaged apps • Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc. • Data Integration: Warehousing and associated processes • Intra-enterprise, for “business intelligence” (historical trends) • Vendors: Informatica, Ascential, DBMS vendors • Content Management: Tools for content creation • Web page and graphic design • Versioning and configuration management • Vendors: Vignette, Interwoven, etc.
Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Content Access, Mapping and Transformation • Query Processing • Research Evangelism
Content Integration: Characteristic and Challenges • New integration challenges for e-business • cross-enterprise • operational • data-centric (not app-centric) • structured/unstructured • Two main thrusts • Content Access, Mapping and Transformation • Query Processing
Content Access: Relationships with Providers • Varying relationships with content providers • Direct DBMS access (typically in-house) • Direct access to federated apps (SAP, etc.) • Gateway vendors a la Merant, NEON, Attunity, etc. • Arm’s-length relationships • HTML screen scraping • XML messaging Relationships evolve over time! MySimon example
Content Mapping • Syntactic and semantic integration • Formatting/normalization is one piece of the puzzle • XML, HTML, Relational, etc. • Semantics is much harder • E.g. “price”. E.g. “delivery”. • Semantics gate the process • A “content manager” must own the transformation task • Ease of use critical • Home Depot has 60,000 suppliers! • Standards can help a bit (e.g. UDDI) • But graphical tools are the name of the game
Schemas and Taxonomies • Cross-enterprise = multiple schemas • Even if standards prevail (very optimistic) • Early e-catalog systems were locked into one schema • Great for service companies, e.g. Requisite • Tools are sounding the death knell • Taxonomies are critical • Natural for browsing, especially with dirty data • “Black Ink”, “India ink”, “fountain pen ink, black” • Taxonomy per vertical markets, plus standards like UNSPSC • Office Supplies->Ink and lead refills->India ink • Taxonomy as data: query it, browse it, etc. • Integration task includes taxonomy integration!
Themes in Content Access and Mapping • Scalability in human terms • “Content managers”, not geeks • The name of the game: semi-automatic tools • Statistical (“fuzzy”) techniques to provide hints (not silver bullets) • Integrated into graphical programming-by-example interfaces • Problem domains: • Wrapper generation • Data cleaning • Schema mapping • Taxonomy mapping • Syndication • One of the key “systems” challenges today
Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Research Evangelism
Query Processing Issues • Content to be integrated is increasingly “uncacheable” • Arm’s-length accessibility • Business rules, not data • E.g. custom content throughout the dataflow • Volatile information • E.g. Availability • Yet a great deal of content is cacheable and slowly changing • Upshot: need a combined technology • Prefetch/Cache/Replicate when possible • Query live when impossible
Federated Query Processing • DBMS community must shed our materialization myopia! • ETL/Warehousing was inelegant and limited • What do we do on a “cache miss”?? • Should be no distinction between materialized views and queries! • Federated Query Processing • Query across multiple sources • Choose among multiple replicas, materialized views • Consider staleness • This is the natural extension of the modern database vision • Cohera uses Mariposa’s economic model to do this • Decouples optimization, cost estimation, storage and processing
Standard Queries Required • Hand-coded queries are brittle: you want ad-hoc • Don’t buy a handful of beans • Need support for standard query languages • SQL and XPath today • SQL/XQuery tomorrow • Everybody knows this! • Part of industrial religion • Oracle on one side • Dotcoms on the other side • You might get by claiming to be “XML compliant” • But most people have cottoned on by now
IR capabilities need to be in the engine • The best-integrated data will still be noisy (product names, etc) • Text search on taxonomies, names, descriptions • Still no good integration of DBMS and IR engines • Storage (compression huge in IR) • Index concurrency (many updates per doc in IR) • Query optimization challenges • Note: this is not semi-structured querying! • Integration of logic + statistics is the real model/query challenge • Plus HCI issues • Unify: “query”, “browse”, “mine”, “rank” • Cohera integrates AltaVista into the engine & optimizer
Core Systems Issues Remain Important • Availability, Scalability, Load Balancing • All critically important in the B2B space • Availability: you don’t even control the components! Outage=news. • Scalability: MRO wants to grow up to very big installations • Load Balancing: need to respect SLAs, etc. • Need adaptive, load balancing, federated QP • 100s to 1000’s of “sites” • Replication is key to availability, but optimizer must understand it • Cohera’s economic model adapts for each query • Other models being studied (see DE Bulletin 6/2000) • Compile-time, centralized optimizers (R*, et al) will break
Query Processing: Themes • Standards • Logic + Statistics • Adaptivity to changing performance, load, failures • Optimizer Scalability
So What Really Matters Today? • Cohera sells because… • Customers need the content integration workbench today • They are in integration pain! • Comes in multiple guises (e-catalog, supplier enablement, etc.) • Smart tools start cutting the pain immediately • Customers want an open, standard solution • Plain old SQL and relational schemas (vs. Requisite, e.g.) • XML “in the bottom”, “out the top” for messaging/integration • Customers want federated querying…tomorrow • For today, they’ll settle for a centralized solution • Want the flexibility to grow in that direction • Federated query engine works fine centralized • The converse clearly not true
Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration • Research Evangelism
Research Evangelism • Semi-Automatic Tools • Statistical + logical techniques, with a user in the loop • E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]http://control.cs.berkeley.edu • schema integration algebra • interactive visualization • programming-by-example • statistical inferencing for discrepancies and domain detection • A new class of “systems” work! • “Tools”/“Apps” must be part of our agenda • Many systems challenges here, especially on the stat/HCI side • Architectural elegance, API design, extensibility, scalability, etc.
Research Evangelism, Cont. • Adaptive Query Processing • Critical to the federated B2B space • Unpredictable world, you don’t control the components • Also critical to the ubiquitous computing space • Sensors are the next challenge • Who’s the DBA of your housepaint? The freeway lines? • Economic optimization (Mariposa) is one model • Finer-Grained adaptivity possible (Eddies, SIGMOD 2K) • See http://telegraph.cs.berkeley.edu for examples, ideas, SW
Research Evangelism, Cont. • Tired of research on relational? Choose wisely! • One big direction here is to integrate IR • Another is to abandon languages in favor of interfaces • query+browse+mine: semi-automatic GUIs again! • XML is critical to business, but under control • We’re doing fine in this space, thank you • XQuery will push (merge with?) SQL • The end-result will resemble things you’ve seen before • But text search is eating our lunch! • Intellectual impact in the last decade? • Industrial impact in the last decade? • Text search is mostly “just” an access method + a sort metric • Integrate into our composable algebras and architectures! • Teach it in our undergrad classes
Summary • Content Integration is a new, challenging industrial space • Cohera provides the first complete solution • Access with varying relationships, formats • Support for multiple schemas and taxonomies • Support for custom syndication • Support for distributed data, both cacheable and uncacheable • Ad hoc querying • Fuzzy & structured search • Availability, Scalability, Load Balancing • Smart graphical tools for content managers • A fertile area for research as well • Join the fun!