ITCS 6163

ITCS 6163 View Maintenance

Implementing a Warehouse • Monitoring: Sending data from sources • Integrating: Loading, cleansing,... • Processing: Query processing, indexing, ... • Managing: Metadata, Design, ...

new Monitoring • Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … • Incremental vs. Refresh

Monitoring Techniques • Periodic snapshots • Database triggers • Log shipping • Data shipping (replication service) • Transaction shipping • Polling (queries to source) • Application level monitoring è Advantages & Disadvantages!!

Monitoring Issues • Frequency • periodic: daily, weekly, … • triggered: on “big” change, lots of changes, ... • Data transformation • convert data to uniform format • remove & add fields (e.g., add date to get history) • Standards (e.g., ODBC) • Gateways

Monitoring Products • Gateways: Info Builders EDA/SQL, Oracle Open Connect, Informix Enterprise Gateway, … • Data Shipping: Oracle Replication Server, Praxis OmniReplicator, … • Transaction Shipping: Sybase Replication Server, Microsoft SQL Server • Extraction: Aonix, ETI, CrossAccess, DBStar • Monitoring/Integration products later on

Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Integration • Data Cleaning • Data Loading • Derived Data Query &analysis integration

Change detection • Detect & send changes to integrator • Different classes of sources • Cooperative • Queryable • Logged • Snapshot/dump

Data transformation • Convert data to uniform format • Byte ordering, string termination • Internal layout • Remove, add, & reorder attributes • Add (regeneratable) key • Add date to get history

Data transformation (2) • Sort tuples • May use external utilites • Can be much faster (10x) than SQL engine • E.g., perl script to reorder attributes

External functions (EFs) • Special transformation functions • E.g., Yen_to_dollars • User defined • Specified in warehouse table definition • Aid in integration • Must be applied to updates, too

Data integration • Rules for matching data from different sources • Build composite view of data • Eliminate duplicate, unneeded attributes

billing DB customer1(Joe) merged_customer(Joe) service DB customer2(Joe) Data Cleaning • Migration (e.g., yen ð dollars) • Scrubbing: use domain-specific knowledge (e.g., social security numbers) • Fusion (e.g., mail list, customer merging) • Auditing: discover rules & relationships(like data mining)

Data cleansing • Find (& remove) duplicate tuples • E.g., Jane Doe & Jane Q. Doe • Detect inconsistent, wrong data • Attributes that don’t match • E.g., city, state and zipcode • Patch missing, unreadable data • Want to “backflush” clean data • Notify sources of errors found

Loading Data • Incremental vs. refresh • Off-line vs. on-line • Frequency of loading • At night, 1x a week/month, continuously • Parallel/Partitioned load

Derived Data • Derived Warehouse Data • indexes • aggregates • materialized views (next slide) • When to update derived data? • Incremental vs. refresh

The “everything is a view” view • Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs) • Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.) • Pure data: Snapshot. Procedure is thrown away! Not maintainable. • Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies). Approximate models (e.g., statistical). (Quasi-cubes).

does not exist at any source Materialized Views • Define new warehouse relations using SQL expressions

Integration Products • Monitoring & Integration: Apertus, Informatica, Prism, Sagent, … • Merging: DataJoiner, SAS,… • Cleaning: Trillum, ... • Typically take warehouse off-line • Typically refresh • or simple incremental: e.g., Red Brick Table Management Utility, Prism

Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Managing • Metadata • Warehouse Design • Tools Query &analysis integration

Metadata • Administrative • definition of sources, tools, ... • schemas, dimension hierarchies, … • rules for extraction, cleaning, … • refresh, purging policies • user profiles, access control, ...

Metadata • Business • business terms & definition • data ownership, charging • Operational • data lineage • data currency (e.g., active, archived, purged) • use stats, error reports, audit trails

Tools • Development • design & edit: schemas, views, scripts, rules, queries, reports • Planning & Analysis • what-if scenarios (schema changes, refresh rates), capacity planning • Warehouse Management • performance monitoring, usage patterns, exception reporting • System & Network Management • measure traffic (sources, warehouse, clients) • Workflow Management • “reliable scripts” for cleaning & analyzing data

Tools - Products • Management Tools • HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager • System & Network Management • HP OpenView, IBM NetView, Tivoli

Current State of Industry • Extraction and integration done off-line • Usually in large, time-consuming, batches • Everything copied at warehouse • Not selective about what is stored • Query benefit vs storage & update cost • Query optimization aimed at OLTP • High throughput instead of fast response • Process whole query before displaying anything

Future Directions • Better performance • Larger warehouses • Easier to use • What are companies & research labs working on?

Research (1) • Incremental Maintenance • Data Consistency • Data Expiration • Recovery • Data Quality • Error Handling (Back Flush)

Research (2) • Rapid Monitor Construction • Temporal Warehouses • Materialization & Index Selection • Data Fusion • Data Mining • Integration of Text & Relational Data

Items sold Sales Make warehouse self-maintainable • Add auxiliary tables to minimize update cost • Original + auxiliary are self-maintainable • E.g., auxiliary table of all unsold catalog items • Some updates may still be self-maintainable • E.g., insert into catalog if item (the join attribute) is a key Catalog

Detection of self-maintainability • Most algorithms are at table level • Most algorithms are compile-time • Tuple levelat runtime [Huyn 1996, 1997] • Use state of tables and update to determine if self-maintainable • E.g., check whether sale is for item previously sold

Warehouse maintenance • Current systems ignore integration of new data • Or assume warehouse can be rebuilt periodically • Depend on long “downtime” to regenerate warehouse • Technology gap: continuous incremental maintenance

Maintenance research • Change detection • Data consistency • Single table consistency • Multiple table consistency • Expiration of data • Crash recovery

Snapshot change detection • Compare old & new snapshots • Join-based algorithms • Hash old data, probe with new • Window algorithm • Sliding window over snapshots • Good for local changes

Integrated data consistency • Conventional maintenance inadequate • Sources report changes but: • No locking, no global transactions (sources don’t communicate, coordinate with each other) • Inconsistencies caused by interleaving of updates

Example anomaly • table Sold = catalog x sale x emp • insert into sale[hat, Sue] • delete from catalog[$12, hat] Sold price item clerk age sale catalog emp price item item clerk clerk age Sue 26 $12 hat

catalog price item Sold price item clerk age emp clerk age sale catalog Sue 26 item clerk price item $12 hat Anomaly (2) $12,hat,Sue,26 ignored Q1 = catalog  [hat, Sue] A(Q1)= [$12,hat, Sue] Q2 = [$12,hat, Sue]  emp delete from catalog [$12, hat] insert into sale [hat, Sue] A(Q2)= [$12,hat,Sue,26] hat Sue

Choices to deal with anomalies • Keep all relations in the DW (storage-expensive!) • Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance) • Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE.

Sold price clerk catalog catalog sale sale price item price item item clerk item clerk $12 hat hat Sue Another anomaly example V = [$12, Sue]-= V V = [$12, Sue]-= V WRONG! $12 Sue Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale) A(Q1) =  A(Q2) =  Delete(sale[hat,Sue])

Depts Store City Add. NY Madison Ave catalog Store Dept City City Add. Shoes NY Yet another anomaly example Depts = Dept(catalog  Store) Shoes Bags Shoes Bags Bags Q1= Dept(catalog [NY,Madison Av]) Q2= Dept([Bags,NY]  Store) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) = [[Bags]] Insert(Store[NY, Madison Av. Bags NY

Eager Compensating Algorithm(ECA) Principle: send compensating queries to offset the effect of concurrent updates ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE).

Depts Store City Add. NY Madison Ave catalog Store Dept City City Add. Shoes NY Anomaly example revisited (ECA) Depts = Dept(catalog  Store) Q2= Dept([Bags,NY]  Store) -Dept([Bags,NY]  [NY,Madison Ave]] Shoes Bags Q1= Dept(catalog [NY,Madison Av]) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) =  Insert(Store[NY, Madison Av. Bags NY

ECA Algorithm SOURCE DATA WAREHOUSE (DW) S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S trigger S_qui at S S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =  MV=MV+COL COL =  ssi = current source state UQS = unanswered query set

ECA-key Avoids the need for compensating queries. Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema)

UQS=UQS+{Q2}={Q1,Q2} A1 = {[bag,Jane]} COL = {[bags,Jane], [bags,Sue]} COL = {[hat,Sue]} COL = {[bags,Jane]} UQS = {Q2} UQS =  hat Sue COL =  Q1= i,d(catalog [hat,Jane]) UQS = {Q1} Sells Item clerk bags Sue bagsJane Insert(catalog[bag,acc])) catalog catalog catalog emp Item dept. Item dept. Item dept. item clerk hat acc bags acc acc Sue hat acc bags acc Example of ECA-key Q2= i,c([bags,acc] emp) A(Q2) = {[bags,Sue],[bags,Jane]} Delete(catalog,[hat,acc]) Insert (sale[acc,Jane]) acc Jane

Strobe algorithm ideas • Apply actions only after a set of interleaving updates are all processed • Wait for sources to quiesce • Compensate effects of interleaved updates • Subtract effects of later updates before installing changes • Can combine these ideas STROBE IS A FAMILY OF ALGORITHMS

Strobe Terminology • The materialized view MV is the current state of the view at the warehouse V(ws). • Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qithe portion of the query that can be answered by x. • Example: if V = r1  r2  r3, and U and update received from r2, then Q = (r1  U  r3) and next_source(Q) = (r1, r1  U)

Strobe terminology (2) Source_evaluation(Q): /returns answers to Q/ Begin i = 0; WQ = Q; A0 = Q;(x,Q1)  next_source(WQ); While x is not nil do Let i = i + 1; Send Qi to source x; When x returns Ai, let WQ = WQ(Ai); Let (x,Qi+1)  next_source(WQ); Return(Ai); End

Strobe Algorithm Source DW -After exec. Ui, send Ui to DW AL =  -When receiving Qi When Ui is received Compute Aiover ss[x] if a deletion Send Aito DW Qj  UQS add Ui to pend(Qj) Add key_del(MV,Ui) to AL if an insertion Qi = V(Ui), pend(Qi) =  Ai= source_evaluate(Qi); Uj  pend(Qi), key_del(Ai,Uj); Add insert(MV,Ai) to AL When UQS =  , apply AL to MV as a single transaction, without adding duplicate tuples to MV Reset AL

Pend(Q1) =  AL ={key_del(MV,U2)} AL =  Apply key_del(A12,U2)  A2 =  Add nothing to AL UQS =  MV =  Q1=catalog [hat,Sue]emp Pend(Q1) = U2 catalog U2=Del([$12,hat]) A11=[$12,hat,Sue] Sold Q11=(catalog[hat,Sue]) price item A12=[$12,hat,Sue,26] Q12 =[$12,hat,Sue]  emp U1=Insert(sale, [hat, Sue]) price item clerk age sale catalog emp price item item clerk clerk age Sue 26 $12 hat Example with Strobe hat Sue

AL = {del([hat,Sue]} MV =  AL = {ins([shoes,Jane]} MV = {[shoes,Jane]} T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])} sale sale sale sale sale item clerk item clerk item clerk item clerk item clerk shoes Jane hat Sue hat Sue shoes, Jane Transaction-Strobe

ITCS 6163

ITCS 6163

Presentation Transcript

ITCS 6114

ITCS 6163

ITCS 6010

ITCS 937

ITCS 6114

ITCS 6010

ITCS 6265

ITCS 6114

ITCS 6265

ITCS 6163

ITCS 6163 Data Warehousing

ITCS 6010

ITCS 6114

ITCS 6114

ITCS 6114

ITCS-3190

ITCS 6114

ITCS 6114

ITCS 6010

ITCS 6114