E N D
MASSACHUSETTS INSTITUTE OF TECHNOLOGYSLOAN SCHOOL OF MANAGEMENTINFORMATION TECHNOLOGIES GROUPData Transformation & Model Integrationin support of ICEWS :COntext INterchange (COIN) ProjectPresentation & DemonstrationStuart Madnick (smadnick@mit.edu) Michael Siegel (msiegel@mit.edu) August 22, 2006 2003-07-01
Data Transformation & Model Integration Issue: Automation for Scalability & Maintainability • ICESW Challenges (from the ICEWS preliminary proposal, dated 4 May 2006): • “no inter-agency system exists to integrate broad data sets.” • need “proper, seamless integration of existing tools, methodologies, and data ...” • “Social network models, dynamic events data systems and state stability models all validated – just need to be integrated seamlessly.” • Problem: • Easy to solve (manually) when only 3-4 data sources and 2-3 models • But, if grow to many data sources and models – becomes major effort • Hard to scale to large number of sources and models • Hard to change (to incrementally add new sources and models) • Hard to evolve (changing specifications of existing sources and models) • Hard to automate the integration and maintainance • Solution: • Use COntext INterchange (COIN) Technology (for Data Extraction and Integration) • Basic theory funded as part of DARPA Intelligent Integration of Information (I3) • (Uses declarative knowledge about data sources/models and reasoning) • Since then, further refinement and increasingly complex prototypes • Ready to extend to model integration application and large-scale prototyping • Stuart Madnick (smadnick@mit.edu), Michael Siegel (msiegel@mit.edu), Nazli Choucri (nchoucri@mit.edu)
Data Transformation & Integration in Support of Model Operation – Hypothetical Example Specific Data Needed for Each Model Semi-Processed Data Sources Raw Data Sources Civilian casualties (per week) Civilian casualties (per day) Model M1 Military casualties (per week) Web sites Filtered Individual Economic Sector Performance (in Euros) Military casualties (per day) Semi-Structured Data Sources Newspapers Some Data Extraction can be Automated via COIN Technology Total casualties (per day) Model M2 Total Economic Sector Performance (in local currency) Individual Economic Sector Performance (in local currency1) Data Transformation & Integration Automated via COIN Technology Databases 1 Interesting challenge, especially with time series data, is that semantics change over time – e.g., currency in France changed from Francs to Euros; Yugoslavia changed several times. COIN handles this easily as part of its temporal semantics capabilities.
Model Input/Output Transformation & Integration – Hypothetical Example Projected Individual Economic Sector-by-Sector Performance (in local currency) Recent Individual Economic Sector-by-Sector Performance (in local currency) Model M3 Projected Total Economic Sector-by-Sector Performance (in Euros) Data Transformation & Integration Automated via COIN Technology Model M5 Refined Projected Total Casualties (per week) Projected Total Casualties (per week) The automation of model integration – especially to deal with many autonomous evolving models, with differing input/outputs, is an important technological advance proposed. Recent Civilian Casualties1 (per day) Projected Civilian Casualties (per day) Model M4 Recent Military Casualties (per day) Projected Military Casualties (per day) 1 Many more subtleties to data semantics, such as differing definitions of what is a “casualty” in different data sources and as used in different models - that must be reconciled.
Shared Conversion Ontologies Creation Context Mediator Source Receiver Context Context civ_ casualties Select civ_casualties/7 From source1 Where region=“12” Select civ_casualties From source1 Where region=“12” Context 20 Transformation 140 Source Receiver Simplified Overview of COntext INterchange (COIN) Advanced Technology Concept: Time Modifiers:DaysWeeks f() Weeks Days Specialized symbolic equation solving techniques used to dynamically create comprehensive conversion programs from small conversion components Light-weight Ontologies with Context Modifiers Mediation & Transformation uses an integrated framework of abductive and constraint logic programming Declarative description of Source’s actual semantics Declarative description of Receiver’s desired/expected semantics 2 1 3
Agenda • Introduction to the Context Interchange (COIN) Project – Motivation "[A]lthough there are many private and public databases that contain information potentially relevant to counterterrism programs, they lack the necessary contextdefinitions (i.e., metadata) and access tools to enable interoperation with other databases and the extraction of meaningful and timely information." National Research Council (2002), "Making the Nation Safer" (emphasis added) • Description of the COIN Context Mediation Technology • Demonstration of the COIN Technology
COntext INterchange (COIN) Project CONTEXT MEDIATION * Automatic conflict detection and conversion - Derived data - Source selection - Source attribution Web Pages INPUT PROCESSING * Automatic web wrapping - Semi-structured text -Multi-source query plan and execution OUTPUT PROCESSING ODBC Driver Web - Publishing Appli- cations Receivers Sources TRUSTED AGENTS Data bases Browsers APPLICATIONS: Financial services, electronic commerce, asset visibility, in-transit visibility.
Key COIN Technologies • Web Wrapper • Extractselected information from web (HTML+XML) • Allows web to be treated as large relational SQL database • Handles dynamic web sites, cookies, “login”, etc. • Performs SQL Joins & Unions involving DB’s + Web sources • Context Mediator • Resolvesemantic (meaning)differences • Enable meaningful aggregation & comparison
Context Context Context Role Of Context 08-07-09 07-08-09 $ ? 09-07-08 • CONTEXT VARIATIONS: • - GEOGRAPHIC ( US vs. UK ) • - FUNCTIONAL (CASH MGMT vs. LOANS ) • - ORGANIZATIONAL ( CITIBANK vs. CHASE ) Data: Databases Web data E-mail
Representational Ontological Temporal Types of Context
Example : Context Differences ( from multiple web sources) Daimler Benz ( DCX ) Financial Data P/E Ratio ABC 11.6 Bloomberg 5.57 DBC 19.19 MarketGuide 7.46
Complementary Aggregation Example • Q: How did CO2 emissions (total, per GDP, per capita) change over time (between 1990 and 2000) in Yugoslavia? • Context 1: YUG as a geographic region bounded before the breakup • Context 2: YUG as a legal autonomous state • Related effort: • Laboratory for Information Globalization and Harmonization Technologies and Studies ( LIGHTS ) Project
World Bank’s World Dev. Indicator DB; UN Statistic Division; Statistics Bureaus OAK Ridge’s CDIAC DB; WRI; GSSD; EPAs Olsen (Web) In 1000 tons per year GDP in billions local currency; Population in millions Total CO2 in 1000 tons per year; GDP in billions USD; CO2/Capita in tons per person; CO2/GDP in tons per million USD; GDP/Capita in USD per person Many sources needed: Meanings in sources & users might differ
The 1999 Overture Unit-of-measure mixup tied to loss of $125Million Mars Orbiter “NASA’s Mars Climate Orbiter was lost because engineers did not make a simple conversion from English units to metric, an embarrassing lapse that sent the $125 million craft off course. . . . . . . The navigators ( JPL ) assumed metric units of force per second, or newtons. In fact, the numbers were in pounds of force per second as supplied by Lockheed Martin ( the contractor ).” Source: Kathy Sawyer, Boston Globe, October 1, 1999, page 1.
Shared Conversion Ontologies Libraries Context Management Administrator Context Mediator Source Receiver Context Context part length Select partlengthx3.35 From catalog Where partno=“12AY” Select partlength From catalog Where partno=“12AY” Context 55.25 Transformation 17 Source Receiver The Context Interchange Approach Concept: Length MetersFeet f() meters feet 2 1 3
COIN Conceptual Model (Ontology)
Another Context Example (Basis for Demo) Context Mediation Services Company Name DAIMLER-BENZ 614,995 Net Income * 97,736,992 Sales Datastream Company Name DAIMLER-BENZ AG * Net Income 346,577 Sales 56,268,168 WorldScope Company Name DAIMLER BENZ CORP Net Income 615,000,000 * Sales 97,737,000,000 Appl. Users & Disclosure * O&A DEM-USD Exchange Rate Systems 1.00 German Mark= 0.58 US Dollar as 12/31/93 OANDA Web Server * Wrapper Services
Disclosure Worldscope DataStream Country of USD Country of Currency Incorporation Incorporation Used Money Amount Money Amount Money Amount Currency As_Of_Date As_Of_Date As_Of_Date Conversion 3 Letters 3 Letters 2 Letters Currency Symbols 1 1000 1000 Scale Fact or Disclosure Names Worldscope Names DataStream Names Company Names American with ‘/’ as American with ‘/’ as European with ‘ - ’ as Date Style separator separator separator Olsen (OANDA) Web Source uses 3 Letter Currency Symbols and European Date Style with ‘/’ as a separator Some Context Differences Context Definitions
exchange- Rate number string curTypeSym fromCur toCur country- Name officialCurrency scaleFactor currency- Type dateFmt txnDate format currency countryIncorp date fyEnding company- Financials company- Name company Domain Model • Some currency context possibilities: • Currency is stated explicitly as part of record • Currency not stated, but the same for all (e.g., US $) • Currency not stated or constant, but inferred by country Inheritance Attribute Modifier
COIN System Architecture SERVER PROCESSES MEDIATOR PROCESSES CLIENT PROCESSES Web Client COIN N SQL Compiler ( cgi -scripts) Repository SQL Query HTTPD-Daemon HTTPD-Daemon Context Datalog Mediator N Query WWW Gateway SQL Query Mediated Query Optimizer Wrapper Optimized ODBC-compliant Apps Query Plan Executioner HTTPD-Daemon Results (e.g Microsoft Excel) ODBC-Driver Web-site Data Store for HTTPD-Daemon Intermediate Results
System Demonstration Single Source Queries with Mediation Q6. Scenario: Using Context Interchange, you can look at the Disclosure data using Datastream Context. Query: Find out from Disclosure what Net Income for DAIMLER-BENZ was. Use Datastream Context. Capabilities Demonstrated: Ability to perform Scale Factor Conversion, Date Format Conversion, Company Name Conversion.
Demonstration @ context2.mit.edu Source Context
Conflict Detection and Mediation Mediated Query in Datalog Date convert Scale factor convert Name convert
Mediated SQL Query & Result Mediated SQL Query Adjust scale factor Date format conversion Name conversion Final results – from Disclosure but in Datastream context
More Complex Example (4 sources: DB + Web) Databases Web source select WorldcAF.TOTAL_ASSETS, DiscAF.NET_SALES, DiscAF.NET_INCOME, DStreamAF.TOTAL_EXTRAORD_ITEMS_PRE_TAX, quotes.Last from WorldcAF, DiscAF, DStreamAF, quotes where WorldcAF.COMPANY_NAME = "DAIMLER-BENZ AG" and DStreamAF.AS_OF_DATE = "01/05/94" and WorldcAF.COMPANY_NAME = DStreamAF.NAME and WorldcAF.COMPANY_NAME = DiscAF.COMPANY_NAME and WorldcAF.COMPANY_NAME = quotes.Cname;
Generated SQL (1st Part) select worldcaf.total_assets, discaf.net_sales, ((discaf.net_income*0.001)*olsen.rate), (dstreamaf2.total_extraord_items_pre_tax*olsen2.rate), quotes.Last from (select date1, 'European Style -', '01/05/94', 'American Style /' from datexform where format1='European Style -' and date2='01/05/94' and format2='American Style /') datexform, (select dt_names, 'DAIMLER-BENZ AG' from name_map_dt_ws where ws_names='DAIMLER-BENZ AG') name_map_dt_ws, (select ds_names, 'DAIMLER-BENZ AG' from name_map_ds_ws where ws_names='DAIMLER-BENZ AG') name_map_ds_ws, (select 'DAIMLER-BENZ AG', ticker, exc from ticker_lookup2 where comp_name='DAIMLER-BENZ AG') ticker_lookup2, (select 'DAIMLER-BENZ AG', latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets, country_of_incorp from worldcaf where company_name='DAIMLER-BENZ AG') worldcaf, (select country, currency from currencytypes where currency <> 'USD') currencytypes, (select exchanged, 'USD', rate, date from olsen where expressed='USD') olsen, (select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp from discaf) discaf,
Generated SQL (Continued - Partial) (select as_of_date, name, total_sales, total_extraord_items_pre_tax, earned_for_ordinary, currency from dstreamaf) dstreamaf, (select as_of_date, name, total_sales, total_extraord_items_pre_tax, earned_for_ordinary, currency from dstreamaf) dstreamaf2, (select char3_currency, char2_currency from currency_map where char3_currency <> 'USD') currency_map, (select country, currency from currencytypes where currency <> 'USD') currencytypes2, (select exchanged, 'USD', rate, '01/05/94' from olsen where expressed='USD' and date='01/05/94') olsen2, (select Cname, Last from quotes) quotes where currencytypes.country = discaf.location_of_incorp and currencytypes.currency = olsen.exchanged and dstreamaf.currency = dstreamaf2.currency and dstreamaf2.currency = currency_map.char2_currency and olsen.date = discaf.latest_annual_data and currency_map.char3_currency = currencytypes2.currency and currencytypes2.currency = olsen2.exchanged and name_map_dt_ws.dt_names = dstreamaf2.name and name_map_ds_ws.ds_names = discaf.company_name and ticker_lookup2.ticker = quotes.Cname and datexform.date1 = dstreamaf2.as_of_date and currencytypes.currency <> 'USD' and currency_map.char3_currency <> 'USD' union select worldcaf2.total_assets, discaf2.net_sales, ((discaf2.net_income*0.001)*olsen3.rate), dstreamaf4.total_extraord_items_pre_tax, quotes2.Last from (select date1, 'European Style -', '01/05/94', 'American Style /' from datexform where format1='European Style -' and date2='01/05/94' and format2='American Style /') datexform2, (select dt_names, 'DAIMLER-BENZ AG' from name_map_dt_ws where ws_names='DAIMLER-BENZ AG') name_map_dt_ws2, (select ds_names, 'DAIMLER-BENZ AG' from name_map_ds_ws where ws_names='DAIMLER-BENZ AG') name_map_ds_ws2, (select 'DAIMLER-BENZ AG', ticker, exc from ticker_lookup2 where comp_name='DAIMLER-BENZ AG') ticker_lookup22, (select 'DAIMLER-BENZ AG', latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets, country_of_incorp from worldcaf where company_name='DAIMLER-BENZ AG') worldcaf2, (select country, currency from currencytypes where currency <> 'USD') currencytypes3, (select exchanged, 'USD', rate, date from olsen where expressed='USD') olsen3, (select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp from discaf) discaf2, (select as_of_date, name, total_sales, total_extraord_items_pre_tax, earned_for_ordinary, currency from dstreamaf) dstreamaf3, (select 'USD', char2_currency from currency_map where char3_currency='USD') currency_map2, etc
Execution Trace (1st Part - Partials) Parallel Execution . . . Retrieving data From Web source
Execution Trace (Continued - Partials) . . . Stock price returned From Web source Another Web source used (for currency conversion) . . .
The 1805 Overture In 1805, the Austrian and Russian Emperors agreed to join forces against Napoleon. The Russians promised that their forces would be in the field in Bavaria by Oct. 20. The Austrian staff planned its campaign based on that date in the Gregorian calendar. Russia, however, still used the ancient Julian calendar, which lagged 10 days behind. The calendar difference allowed Napoleon to surround Austrian General Mack's army at Ulm and force its surrender on Oct. 21, well before the Russian forces could reach him, ultimately setting the stage for Austerlitz. Source: David Chandler, The Campaigns of Napoleon, New York: MacMillan 1966, pg. 390.
Summary • Tremendous opportunity to gather and integrate information from many diverse sources • But … need to overcome many context challenges • Context-type “metadata” plays a critical role • COIN technology can be an important aid for semantically meaningful information integration: • - Scalable • - Extensible • - Application Domain Merging • - Reuse and extension of ontologies and contexts
Appendix – Some Useful Reference Material • Documents Overview: http://computer.org/conferen/proceed/meta/1999/papers/84/smadnick.html"Metadata Jones and the Tower of Babel: The Challenge of Large-Scale Semantic Heterogeneity", (IEEE Meta-Data Conference)Theory of COIN: http://web.mit.edu/smadnick/www/wp/1997-03.pdf“Context Interchange: New Features and Formalisms for the Intelligent Integration of Information” (ACM TOIS) Contact us for more … • Web sites Main COIN web site: http://context2.mit.edu Miscellaneous demos: http://context2.mit.edu/coin/demos/ Self-explanatory demo: http://interchange.mit.edu:8080/gcms_v4/airCarMergeTop.html (Airfare and Car rental applications, includes ontology merging. Caution: still under development)
Appendix: Sample Applications • Airfare, Car Rental and Merged Travel • Weather • Global Price Comparison • Airfare Aggregation • Disaster Relief • TASC Financial Example • Web Services Demo • Corporate Householding • Counter-Terrorism Intelligence
Appendix: COIN Web-Wrapper Technology User or Program(via SQL Query) Select Edgar.Net_income From Edgar Where Edgar.Ticker=intc and Edgar.Form=10-Q Web page spec file * Web Wrapper Generator HTML Side SQL Side Ticker Net Income 1,983 INTC Data record returned * Spec file contains: Schema, Navigation rules, and Extraction rules.