140 likes | 249 Views
Shailendra Mishra. Director (CEP). Case Study – Collusion detection in multi-player Games. Consider the following problem: (A) Commits an identity theft. (A) Acquires (n) credit cards as a result of the identity theft.
E N D
Shailendra Mishra Director (CEP)
Case Study – Collusion detection in multi-player Games • Consider the following problem: • (A) Commits an identity theft. • (A) Acquires (n) credit cards as a result of the identity theft. • (A) Goes to an online gaming site uses the credit cards to play online poker with his (s) friends. • (A) loses all his money to his (s) friends. • The online gaming company has to now pay his friends. • Analysis of the problem • Assume, for a moment that (A) didn’t commit identity theft. • (A) is playing a fair game with his friends or otherwise. • The results of this game, generate a stream of outcomes of wins and losses by (A) to any no. of his friends where 1<= i <= s. • The problem is to detect whether the pattern of wins and losses are genuine or not. • More formally, we are asking: • “When is a certain number of a particular subsequence unlikely to be fortuitous”.
Modeling the Collusion Detection Problem • Let T be an ordered sequence of events. • Let W be the window observation of size w within which the analysis is confined. • Formally consider an alphabet of cardinality | |. • Consider an event sequence T = t1, t2,…tn of length n over . • We then define an episode over as follows: • Single pattern S = s1s2s3…sm of length m • Set of patterns S1,S2,…,Sd . • Set of all distinct permutations of S where ordering within window of observation doesn’t matter.
Formal Statement of the Problem • Assume event sequence is generated by a memory less Bernoulli or Markov source. • Let’s restate our problem formally – we are interested in finding Ω∃(n, w, m) that represents the number of windows containing atleast one occurrence of S, when sliding the window n events over T. • To address this: • Compute the Expected value Ω∃ (n, w, m). • Compute Var(Ω∃ (n, w, m). • Show that Ω∃ (n, w, m) converges to a normal distribution. • Allows us to set a threshold (n, m, w) s.t for a given confidence level that P{(Ω (n, w, m) > (n, w, m))} < . • Implies, For (n, w, m) occurrences of such windows, probability that such a number is generated by randomness is highly unlikely.
Formulation of equivalent Pattern Matching Problem • Given an alphabet Д = {a1, a2, …, a | Д |} and a pattern S=s1s2…sm of length m. • Search occurrences of S as subsequence within a window W of size w in another sequence known as the event sequence T = t1t2..tn of length n. • A valid occurrence of S in T corresponds to a set of integers i1, i2,..,im such that the following hold: • 1 <=i1 < i2 < … < im <= n • ti1 = s1, ti2 = s2, … tim = sm • im – i1 < w • We now estimate Ω∃ (n, w, m, S, Д) which represents #(windows) that contains atleast one occurrence of S, when sliding window over n consecutive events in event sequence T over alphabet Д.
Theorams & Results – Gwadera, Attalah & Szpankowski (Purdue) • Consider a memoryless source with pi being the probability of generating symbol ai εД. • Also, assume P(S) = ∏mi=1 pi • Result -1: Probability that a window of size w contains atleast one occurrence of episode S. For all m and w >= m we have: P∃(w, m) = P(S) ∑ w-mi=0 ∑ ∑ k=0mnk ∏qknk where qk = 1-pk • Result -2: Let now m be fixed and i ≠ j => pi ≠ pj, then for any ε > 0: P∃(w, m) = 1 - P(S) ∑ mi=1 ∑ (1-pi)w /pi ∏j≠Im 1/(pj-pi) + O(εw) where w -> ∞
Computation of Bounds • Assume a memoryless source, then for x = O(1), we have limn->∞P{Ω∃ (n, w, m)-E(Ω∃ (n, w, m))/√(Var(P(Ω∃ (n, w, m)) < x} = 1/2π∫-∞x exp(-t2/2)dt for a fixed m and w. Now let’s establish the threshold for (n, m, w). First we find an α0 for a given β s.t β = ∫ α0∞ exp(-t2/2)dt = P {N(0, 1) > α0} Where N(0, 1) is the standard normal distribution. We set the threshold: (n, w, m) = E(Ω∃ (n, w, m) + √Var(Ω∃ (n, w, m) As long as we are in the region where central limit theoram applies P{Ω∃ (n, w, m) > (n, w, m) <= β
Q & A
Shailendra Mishra Director (CEP)
SQL Standards update • Pattern Matching Proposal – Version 12 of the review draft has been circulated. • Participants – Coral8 <some parts>, IBM, Oracle, Streambase. • BEA systems also reviewed the draft. • Status – 12th version of the draft is ready and has been circulated. • Objective - Submit a working draft to ANSI SQL • Discussing a streams language proposal with IBM • Participants IBM & ORACLE • Status – Exchanged Docs. Regarding language specifications • Objective - Submit a working draft to ANSI SQL • Discussing convergence language proposal with Streambase • Participants IBM & Streambase • Status – Discussing convergence proposal for the last 6 months • Objective - Submit a paper to Transactions on Databases (TODS)
Pattern Query With ONE ROW PER MATCH SELECTa_symbol, a_tstamp, /* start time */, a_price, /* start price */, max_c_tstamp, /* inflection time */, last_c_price, /* low price */, max_f_tstamp, /* end time */, last_c_price, /* end price */, Matchno FROMTickerMATCH_RECOGNIZE (PARTITION BYSymbol MEASURESA.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price, MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBERASmatchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINEBAS(B.price < PREV(B.price)), CAS(C.price <= PREV(C.price)), DAS(D.Price > PREV(D.price)), EAS(E.Price >= PREV(E.Price)), FAS(F.Price >= PREV(F.price) ANDF.price > A.price))
Pattern Query With All ROWs PER MATCH SELECTa_symbol, a_tstamp, /* start time */, a_price, /* start price */, max_c_tstamp, /* inflection time */, last_c_price, /* low price */, max_f_tstamp, /* end time */, last_c_price, /* end price */, Matchno FROMTickerMATCH_RECOGNIZE (PARTITION BYSymbol MEASURESA.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) OVER () AS max_c_tstamp, LAST (C.Price) OVER () AS last_c_price, MAX (F.Tstamp) OVER () AS max_f_tstamp MATCH_NUMBERASmatchno CLASSIFIER AS classy AFTER ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINEBAS(B.price < PREV(B.price)), CAS(C.price <= PREV(C.price)), DAS(D.Price > PREV(D.price)), EAS(E.Price >= PREV(E.Price)), FAS(F.Price >= PREV(F.price) ANDF.price > A.price))
MATCH_RECOGNIZE syntax The full syntax of the MATCH_RECOGNIZE clause is as under: PARTITION BY — optional MEASURES - optional, but we expect this will always be used { ONE ROW | ALL ROWS } PER MATCH — default to ONE ROW AFTER MATCH SKIP { TO NEXT ROW | PAST LAST ROW | TO <variable> | TO LAST<variable> | TO FIRST <variable> } - default AFTER MATCH SKIP PAST LAST ROW { MAXIMAL | INCREMENTAL } MATCH - defaults to MAXIMAL MATCH PERMUTE – optional PERMUTE EXPAND - optional PATTERN — mandatory SUBSET — optional DEFINE — mandatory CLASSIFIER - optional (ALL ROWS PER MATCH only) MATCH_NUMBER - optional
Q & A