340 likes | 447 Views
Database 12c Row Pattern Matching. Beating the Best Pre-12c Solutions [CON3450]. Stew ASHTON Oracle OpenWorld 2014. Photo Opportunity. Presentation available on http:// www.slideshare.net / stewashton /row-patternmatching12coow14 For exact link: See @ StewAshton on Twitter
E N D
Database 12c Row Pattern Matching Beating the Best Pre-12c Solutions [CON3450] Stew ASHTON Oracle OpenWorld 2014
Photo Opportunity • Presentation available on http://www.slideshare.net/stewashton/row-patternmatching12coow14 • For exact link: • See @StewAshton on Twitter • Or see http://stewashton.wordpress.com
Agenda • Who am I? • Pre-12c solutions compared to row pattern matching with MATCH_RECOGNIZE • For all sizes of data • Thinking in patterns • Watch out for “catastrophic backtracking” • Other things to keep in mind (time permitting)
Who am I? • 33 years in IT • Developer, Technical Sales Engineer, Technical Architect • Aeronautics, IBM, Finance • Mainframe, client-server, Web apps • 25 years as an American in Paris • 9 years using Oracle database • Performance analysis • Replace Java with SQL • 2 years as internal “Oracle Development Expert”
1) “Fixed Difference” • Identify and group rows with consecutive values • My presentation: print slides to keep • Math: subtract known consecutives • If A-1 = B-2 then A = B-1 • Else A <> B-1 • Consecutive becomes equality,non-consecutive becomes inequality • “Consecutive” = fixed difference of 1
1) Pre-12c select min(page) firstpage, max(page) lastpage, count(*) cnt FROM ( SELECT page, page – Row_Number() over(order by page) as grp_id FROM t ) GROUP BY grp_id;
Think “match a row pattern” • PATTERN • Uninterrupted series of input rows • Described as a list of conditions (“regular expressions”) PATTERN (A B*) "A" : 1 row, "B" : 0 or more rows, as many as possible • DEFINE each row condition [A undefined = TRUE] B AS page = PREV(page)+1 • Each series that matches the pattern is a “match” • "A" and "B" identify the rows that meet their conditions
Input, Processing, Output SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY page MEASURES A.pagefirstpage, LAST(page) lastpage, COUNT(*) cnt ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B*) DEFINE B AS page = PREV(page)+1 ); • Define input • Order input • Process pattern • using defined conditions • Output: rows per match • Output: columns per row • Go where after match? SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY page PATTERN (A B*) DEFINE B AS page = PREV(page)+1 ONE ROW PER MATCH MEASURES A.pagefirstpage, LAST(page) lastpage, COUNT(*) cnt AFTER MATCH SKIP PAST LAST ROW );
1) Run_Stats comparison For one million rows: “Latches” are serialization devices: fewer means more scalable
2) “Start of Group” • Identify group boundaries, often using LAG() • 3 steps instead of 2: • For each row: if start of group, assign 1Else assign 0 • Running total of 1s and 0s produces a group identifier • Group by the group identifier
2) Requirement Merge contiguous date ranges in same group
with grp_starts as ( select a.*, case when start_ts= lag(end_ts) over( partition by group_name order by start_ts ) then 0 else 1 end grp_start from t a ), grps as ( select b.*, sum(grp_start) over( partition by group_name order by start_ts ) grp_id from grp_startsb) select group_name, min(start_ts) start_ts, max(end_ts) end_ts from grps group by group_name, grp_id;
2) Match_Recognize SELECT * FROM t MATCH_RECOGNIZE( PARTITION BY group_name ORDER BY start_ts MEASURES A.start_tsstart_ts, end_tsend_ts, next(start_ts) - end_ts gap PATTERN(A B*) DEFINE B AS start_ts = prev(end_ts) ); • New this time: • Added PARTITION BY • MEASURESadded gap using row outside the match! • ONE ROW PER MATCHandSKIP PAST LAST ROWare the defaults One solution replaces two methods: simple!
2) Run_Stats comparison For 500,000 rows:
2) Predicate pushing Select * from <view> where group_name = 'X'
3) “Bin fitting”: fixed size • Requirement • Order by study_site • Put in “bins” with size = 65,000 max
SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM ( SELECT s, e, cnt, sm FROM t MODEL MEASURES (study_site s, study_site e, cnt, cntsm) RULES ( sm[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN cnt[cv()] ELSE sm[cv() - 1] + cnt[cv()] END, s[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN s[cv()] ELSE s[cv() - 1] END ) ) GROUP BY s; DIMENSION BY (row_number() over(order by study_site) rn) rn [cv() – 1] [cv()] [cv()] [cv()] [cv() – 1] [cv()] rn [cv() - 1] [cv()] [cv()] [cv()] [cv() – 1] • DIMENSION with row_numberorders data and processing • rn can be used like a subscript • cv() means current row • cv()-1 means previous row
New this time: • PATTERN(A+) replaces (A B*)means 1 or more rows • Why?In previous examples I used PREV(), which returns NULL on the first row. SELECT * FROM t MATCH_RECOGNIZE ( ORDER BY study_site MEASURES FIRST(study_site) first_site, LAST(study_site) last_site, SUM(cnt) sum_cnt PATTERN (A+) DEFINE A AS SUM(cnt) <= 65000 ); One solution replaces 3 methods: simpler!
3) Run_Stats comparison For one million rows:
4) “Bin fitting”: fixed number • Requirement • Distribute values in 3 “bins” as equally as possible • “Best fit decreasing” • Sort values in decreasing order • Put each value in least full bin
4) Brilliant pre 12c solution SELECT bin, Max (bin_value) bin_value FROM ( SELECT * FROM items MODEL DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn) MEASURES ( item_name, item_value, Row_Number() OVER (ORDER BY item_value DESC) bin, item_valuebin_value, Row_Number() OVER (ORDER BY item_value DESC) rn_m, 0 min_bin, Count(*) OVER () - 3 - 1 n_iters ) RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) ( min_bin[1] = Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn<= 3], bin[ITERATION_NUMBER + 3 + 1] = min_bin[1], bin_value[min_bin[1]] = bin_value[CV()] + Nvl(item_value[ITERATION_NUMBER+4], 0)) ) WHERE item_name IS NOT NULL group by bin;
SELECT * from items MATCH_RECOGNIZE ( ORDER BY item_valuedesc MEASURES sum(bin1.item_value) bin1, sum(bin2.item_value) bin2, sum(bin3.item_value) bin3 PATTERN ((bin1|bin2|bin3)+) DEFINE bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value) ); PATTERN ((bin1|bin2|bin3)+) bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value) • ()+ = 1 or more of whatever is inside • '|' = alternatives, “preferred in the order specified” • Bin1 condition: • No rows here yet, • Or this bin least full • Bin2 condition • No rows here yet, or • This bin less full than 3
4) Run_Stats comparison For 10,000 rows:
Backtracking • What happens when there is no match??? • “Greedy” quantifiers - * + {2,} • are not that greedy • Take all the rows they can, BUTgive rows back if necessary – one at a time • Regular expression engines will test all possible combinations to find a match
Repeating conditions select 'match' from ( select level n from dual connect by level <= 100 ) match_recognize( pattern(a b* c) define b as n > prev(n) , c as n = 0 ); Runs in 0.005 secs select 'match' from ( select level n from dual connect by level <= 100 ) match_recognize( pattern(a b* b* b* c) define b as n > prev(n) , c as n = 0 ); Runs in 5.4 secs
Imprecise Conditions SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price), STRT AS price >= nvl(PREV(PRICE),0) ); Runs in 0.02 seconds CREATE TABLE Ticker ( SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER ); insert into ticker select 'ACME', sysdate + level/24/60/60, 10000-level from dual connect by level <= 5000; SELECT * FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price) ); Runs in 24 seconds INMEMORY: 13 seconds
Keep in Mind • Backtracking • Precise conditions • Test data with no matches • To debug:Measures classifier() cl, match_number() mnAll rows per match with unmatched rows • No DISTINCT, no LISTAGG • MEASURES columns must have aliases • “Reluctant quantifier” = ? = JDBC bind variable • “Pattern variables” are range variables, not bind variables
Output Row “shape” ORA-00918, anyone?
Questions? More details at:stewashton.wordpress.com