300 likes | 364 Views
Getting Data to Applications: Why Do We Fail, and How We Can Do Better?. Arnon Rosenthal, Frank Manola, Scott Renner. Toward an Industrial Revolution for Data Interoperability Incremental, (full) Interfaces, Incentives. Arnon Rosenthal, Frank Manola, Scott Renner.
E N D
Getting Data to Applications: Why Do We Fail, and How We Can Do Better? Arnon Rosenthal, Frank Manola, Scott Renner
Toward anIndustrial Revolution for Data InteroperabilityIncremental, (full) Interfaces, Incentives Arnon Rosenthal, Frank Manola, Scott Renner
Goal: A Common Operational Picture (COP) View tier logistics mapmaker intelligence operations User sees data values, assembled and expressed in user’s own terms The “Common Operation Picture” warehouse or federation: an integrated subset of information sources with presentations for different users Source tier NIMA info products sensor naval ground air
Current Status Read only is insufficiently ambitious for a guiding vision but is driving many industrial solutions • Proposed architectures (e.g., messaging) often don’t fit • Metadata • Operations: update /annotate/subscribe • Fusion • Numerous initiatives that are likely to fail e.g., common operational pictures • Today’s technology: Costly, little reuse, skill-intensive
Toward Attainable Goals (and more realistic slogans) • “Give everyone transparent (read) access to all data”. (Any success stories?) The vision of perfection crowds out ability to live with imperfection!| • Restate the challenge: Prepare data/software systems to work with partners -- including unknown future ones? • Connection-creation as a core competence for IT • Describe each service that is offered or wanted (e.g., some operation on some data) • Reduce cost of establishing the software connection • Reuse knowledge captured when a connection is built
What Do We Mean “Industrial Revolution”? • Small tasks • Each with one skill • Many atomic steps become automatable • Each produces reusable knowledge (as opposed to motivating a few lines within a program) • “Market-driven” (as connections are made) rather than giant initiatives
Future of Large Info Management Architectures • Consensus among researchers for scalable sharing • Each data resource describes what it offers • Each consumer describes what it wants • Discovery and brokering processes create a connection • (prototypes automate some cases) Is it really so different from today?each functional task is performed by today’s developers • Key difference: “describe and generate”
A word from our sponsor: We’re Hiring • Researcher / Consultants, Prototypers, Systems Engineers (or make us an offer) • Main offices: suburbs of Boston and Washington DC • Also jobs in Norfolk, Montgomery, St. Louis, San Diego, … + Europe, Asia • We’re a nonprofit working mostly for the US government (A good place to learn. So you’ll get more stock options later) • US Citizens and Permanent residents only (so MITRE can get you a security clearance)
Talk Outline • Why do current approaches so often fail? • We act as if we believe ridiculous things -- in architectures and in design discussions • Where should we try to go? Incremental Interoperability • Aim to revolutionize -- incrementally • How to Start Moving in this Direction? • Scope of talk: Create logical connectivity -- development and logical admin • Omits: Systems planning, execution performance (cache selection, indexing, dissemination)
Tacit Assumptions -- andAntidotes -- 2 • “End State”fallacies: • Architectures are for a perfect end state (?)Systems conform and consumers benefit only when transition is complete (?) • You’ll add flexibility later(?) • Config. mgt. is a sufficient strategy for change (?) Advice Nuggets • Architect for manageable, adaptable, imperfect systems (for 2001, 2002, … 2999) • Transitional states are within the architecture • Architect for adaptability. How to contract for it? • Config. management is only a brake
Tacit Assumptions -- andAntidotes -- 3 • Mandates will elicit good quality metadata (?) • Local administrators will rush to keep you up to date (?) Advice Nuggets • Active (operational) metadata is kept accurate • Passive metadata is untested, and soon too obsolete to drive automated processing (except browsing) • More carrots, fewer sticks • If your tools use the metadata to ease the providers’ tasks, you’ll get better metadata Calls for metadata should include an exploitation plan
Tacit Assumptions -- andAntidotes -- 4 • “Midpoint” Fallacy: Design a compromise interface (msg?) Build around and above it. (?) • “Message interface” Fallacy : “Send message Mxyz” is a fine interface between systems (?) • Support interfaces procedurally (e.g., Java + parser) (?) • Describe the “natural” interface. • One interface supports all subsets. • Connectors are separate & declarative (e.g. SQL + fns?) • On the consumer’s interface, generate • operations (e.g., query, update, subscribe) • metadata, e.g., units, error, access controls
Two Prongs Too Simple Tacit Assumption 6: Interoperability Metaphor: Universal Plug Important element of truth: Design to plug into the “infosphere”, not into one neighbor
CORBA/DCOM transactions 3 4 5 6 7 8 10 11 12 13 1 2 9 16 17 18 19 20 21 23 24 25 14 15 22 XML A Better Interoperability Metaphor: A Multi-Pin Connector All the Pins Have To Fit -- and Many are compound Data Each attribute has semantics format, quality SQL Track Resolution of Each Pin’s Issues
Organization of the Section • Why do current approaches so often fail? • Where Should We Want to Go? • Approach • Taxonomy of needed capabilities • How to Start Moving in this Direction? • Research Agenda: Risk Mitigation
Transition is the steady state, with good ways to cope • Descriptions of sources, consumers exist -- sometimes • When build next connection, capture more You’re still funded to build connections No giant process cutover • Discovery and brokering tools work with whatever descriptions they find • Integration contractors already do discovery and brokering! • Manually, with too little reuse! • For everything, there are multiple ways to do it • Choose one, but work with those who chose differently • Connections and transforms are partially known
Steps to Connect a Consumer to Provider(s):(with metadata reuse) Obtain descriptions of each player • Use same form for consumers’ needs as for providers • May employ intermediary vocabularies • Discover potential (source, consumer) pairs • Obtain transformsfor • Element representations (e.g., miles km; jpeg gif) • Object and set representations (e.g., ODBC XML) • Protocols (e.g., DCOM CORBA) • Pull versus push, whole versus changes • Generate the entire connection (tuned for efficiency) What vendor can supply the framework?
Metadata Drives Connection Creation(when there is enough metadata) New “Wants” from consumer Discovery process Repository/ Knowl. Base execute Brokering process Transform Library +
Connection Creation Drives Metadata New “Wants” from consumer M’data capture tools + Discovery process Repository/ Knowl. Base execute Brokering process M’data capture tools Transform Library +
Connection Creation Drives Vocabularies (?) Vocab and I/f creation tools New “Wants” from consumer M’data capture tools + Discovery process Repository/ Knowl. Base execute Brokering process Optimizer M’data capture tools Transform Library +
Toward an “industrial revolution” for IT:Re-imagine Existing Processes as Simpler Steps • Each step should • Require just one or two skills • Benefit from existing resources -- metadata and transforms • Be fully automated (sometimes) • Produce reusable resources for later steps • Key challenges: • Incentives: It’s must be made easierto generate from resource atoms than to code it all yourself! • To support these incentives, we may need tools that assemble the atomic components into a solution
Data Descriptions: A Taxonomy (foil 1 of 2) Data admin for requirements parallels admin for offers! • Use same constructs • Enables (partly) automated comparisons • Interpretation: element semantics, element representation, schema • Scope and completeness of what you provide (population), e.g., images of + all US air-fuel depots, since 1970 + some NATO fuel depots since 1990 • Delivery style (push/pull, whole / changes) (Is offer/need model adequate for update transactions?)
Data Descriptions’ Taxonomy (foil 2 of 2) • Quality of service • Data quality, timeliness, attribution, completeness, obligation(to continue providing), cost, … • Guidance for data merging (match-up, conflict resolution) • Server information, e.g. (catch-all) • Access language, protocols, address, security domains, …
Talk Outline • Why do current approaches so often fail? • Discussion of a “low risk” approach • What the goal system looks like • How it evolves • Tool and technology details • How to Start Moving in this Direction? How to: • Simplify the task of interfacing to a particular system • Establish more connections • Make created interfaces “first class” • Research Agenda: Risk Mitigation
Getting Started along the New Road • Provide help in creating needed interfaces • Focus on individual programs, small initiatives • Give incremental benefits, to keep all aboard • What’s the minimum to give some benefits? • Separate existing work into atomic tasks that require fewer skills, and are sometimes automatable • No giant cutovers, with massive retraining, coordination • Issues • What does each program need to do? • What requires coalitions, or central funding? (e.g., repository, brokers)
Tasks (examples) • Define vocabularies for • Metadata (how to say “means the same”, or “distanceUnits = km” or “Corba3.0 interface) • Aspects to be brokered (of scope, representation, …) • Frequently-exchanged domain data (Part#, Facility#) • Describe portions of systems in terms of these vocabs • Be opportunistic, e.g., when building new connections • Provide transforms among major representations, protocols • Provide brokers for various aspects (simple brokers first) “Partial brokering” must help metadata providers
Who Will Be Most Interested? (Suggested Initial Targets) • Find a system which needs multiple interfaces. (to customers and/or feeders) • Good candidates • Non-dominant players who must connect to multiple others • Dominant player with bad ease-of-connecting (MIDB?) • Issue: How soon till it’s helpful • Generate, based on own entries in metadata repository • Transformers are quickly helpful (esp. harder ones, e.g., coordinates, image formats) • Perhaps attach to DBMS, or to XML engine?
Example Initiatives (and their benefits) • Publish interface in one formalism (with description)e.g., SQL • Tools generate the additional interfaces, without disturbing the original publishere.g., XML, CORBA, DCOM, html, … • Publish interface in one vocabulary, for all exported info e.g., Supply • Tools generate “closest feasible” interface in other vocabularies that have been related to ite.g., Repair, Procurement, Defense finance, … • Transform representations (image format, coord system) • Provide interfaces as (root concept, well known modifier) • Derive metadata, additional operations (e.g., update)
Summary: Try an approach that hasn’t failed consistently! • Identified pitfalls that are too rarely avoided • Described incremental steps toward large scale data admin for diverse, changing, incomplete systems • Generate connections from reusable resources (system metadata, vocabulary metadata, transforms)active metadata • Separation of skills, use point and click • Incentives: Make provide resource + generateeasier than writing connecting code • Connection-creation creates more reusable resources • Projects cooperate to create vocabularies, acquire tools It’s a low risk approach -- begin prototyping
Challenges for Database Researchers • Better brokering for matching requirements to sets of views • Assume multiple ontologies, spotty connection, incremental improvement • Explain the shortfalls, understandably • Scalable fusion (to match objects, resolve data conflicts) without n x n pairwise administration • Pragmatic • Acquisition guidance, e.g., metrics on flexibility (what should be in each acquisition contract?) • Combine techniques for learning metadata? No more discovery heuristics! • Automate physical DBA work (caching, optimization)