470 likes | 609 Views
LAHAR: Extracting Events from Probabilistic Streams. Chris Re, Julie Letchner , Magdalena Balazinska and Dan Suciu University of Washington. What is a Lahar ?. This is a Lahar. It’s a massive, fast stream of dirt(y data).
E N D
LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinskaand Dan Suciu University of Washington
What is a Lahar? This is a Lahar It’s a massive, fast stream of dirt(y data) Our system, Lahar, processes querieson massive, dirty streams of data May 18, 1980 ~ 8:27am … a few minutes later Lahar -- SIGMOD 2008 -- Christopher Re
Event Queries • Motivating App: RFID • Event queries as Cayuga, Sase and Snoop • Complex sequences using projections, predicates,… E D C B Query: “Alert when Joe enters 422” A Joe entered office 422 at t=8 i.e. Joe outside 422, inside 422 Lahar -- SIGMOD 2008 -- Christopher Re
Challenges: Tracking Joe’s Location Antennas 6th Floor in CS building Blue ring is Joe’s Location Lahar -- SIGMOD 2008 -- Christopher Re
Challenges: Tracking Joe’s Location • Propose: infer location, keep probs & query with Lahar • Model Based View [Deshpandeet al] of an HMM Antennas Two Problems: Missed Readings Granularity Mismatch 6th Floor in CS building Blue ring is Joe’s Location Lahar retains probabilities, achieves higher quality (P/R) and is still efficient. Lahar -- SIGMOD 2008 -- Christopher Re
Outline • RFID streams to probabilistic streams • Lahar queries on probabilistic streams • Query algorithms: Regular and Extended Regular • Experiments Lahar -- SIGMOD 2008 -- Christopher Re
Tracking Joe’s Location 6th Floor in CS building Antennas Blue ring is ground truth Lahar -- SIGMOD 2008 -- Christopher Re
Probabilities via particle filter 6th Floor in CS building Antennas Blue ring is ground truth Each orange particle is a guess of Joe’s location Particles guess many locations per timestep, so data are uncertain Lahar -- SIGMOD 2008 -- Christopher Re
From particles to a probabilistic stream At(tag,loc) Query Particle Filter output via At – a model based view Lahar -- SIGMOD 2008 -- Christopher Re
Semantics of the Model possible stream (worlds) Prob = 0.2 * 0.6* … At(tag,loc) A query q returns the probability that q is true at each time t “Joe enters 422” @ t=8 (0.4+0.2) * 0.6 = 0.36 Probability outside 422 (in Hall3,Hall4) Lahar -- SIGMOD 2008 -- Christopher Re
Outline • RFID streams to probabilistic streams • Lahar queries on probabilistic streams • Query algorithms: Regular and Extended Regular • Experiments Lahar -- SIGMOD 2008 -- Christopher Re
Inspired by Cayuga [Demers et al 2006, White et al 2007] Lahar Queries by Example Alert when Joe is in hallway 4 and later in office 422 Lahar -- SIGMOD 2008 -- Christopher Re
Inspired by Cayuga [Demers et al 2006, White et al 2007] Lahar Queries by Example Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4 Joe in 422 Lahar -- SIGMOD 2008 -- Christopher Re
Inspired by Cayuga [Demers et al 2006, White et al 2007] Lahar Queries by Example Alert when Joe is in hallway 4 and later in office 422 Joe in Hall4 Joe in 422 Alert when Joe is in hallway 4, and immediately in office 422 Lahar -- SIGMOD 2008 -- Christopher Re
Inspired by Cayuga [Demers et al 2006, White et al 2007] Lahar Queries by Example Alert when Joe is in hallway 4 and later in office 422 Challenge with probabilities: Naïve approach is exponential; unavoidable (#P) Joe in Hall4 Joe in 422 Alert when Joe is in hallway 4, and immediately in office 422 Joe in Hall4 Joe in 422 Lahar -- SIGMOD 2008 -- Christopher Re
A hierarchy of Lahar queries • Regular Queries (Efficient, streamable) • Alert when Joe enters 422 • Extended Regular(Efficient, streamable) • Alert when anyone enters 422 Lahar -- SIGMOD 2008 -- Christopher Re
A hierarchy of Lahar queries • Regular Queries (Efficient, streamable) • Alert when Joe enters 422 • Extended Regular(Efficient, streamable) • Alert when anyone enters 422 • Safe (Efficient, but not streamable) • Unsafe (Inefficient) Lahar -- SIGMOD 2008 -- Christopher Re
Outline • RFID streams to probabilistic streams • Lahar queries on probabilistic streams • Query algorithms: Regular and Extended Regular • Experiments Lahar -- SIGMOD 2008 -- Christopher Re
Review: A non-probabilistic example • Alert me when Joe enters 422 {} {1} {2} Accept at t = 8 Final {} {1} Joe in Hall4 Joe in 422 1 2 {} Lahar -- SIGMOD 2008 -- Christopher Re
… now with probabilities • Alert me when Joe enters 422 Accept t=8 with p = 0.3 Distribution on States {} 1.0 {} 0.5, {1} 0.5 Final {} 0.65, {1} 0.05, {2} 0.3 Joe in Hall4 Joe in 422 1 2 Lahar -- SIGMOD 2008 -- Christopher Re
Lies in the preceding slides… (technical details) • Richer predication: “Alert when Joe enters any office” • Translate query and input into an alphabet • Key Technical Detail: • Alphabet is small in data • Streamable • See paper for compilation Final Joe in Hall4 Joe in 422 1 2 Lahar -- SIGMOD 2008 -- Christopher Re
Extension to Extended regular “Alert when anyone enters 422” Lahar -- SIGMOD 2008 -- Christopher Re
Extension to Extended regular • Algorithm: • (Obs1) suggests run automaton for each person • (Obs2) suggests multiply to get prob any is true “Alert when anyone enters 422” (Obs 1) Each query is regular (Obs 2) disjoint sets of events Hence, probabilistically independent Space = O(# persons), not # timesteps: can stream Lahar -- SIGMOD 2008 -- Christopher Re
Summary of Contributions • Regular Queries (Efficient, streamable) • Compiled to an automaton,streaming, O(1) space • Extended regular (Efficient, streamable) • Streaming with O(m) space, i.e. # of persons. • See paper for Markovian correlations, more sophisticated predication, complete compilation and static analysis algorithms • Safe (Efficient, but not streamable) • Unsafe (Inefficient, most #P-hard)
Outline • RFID streams to probabilistic streams • Lahar queries on probabilistic streams • Query algorithms: Regular and Extended Regular • Experiments Lahar -- SIGMOD 2008 -- Christopher Re
Experimental Setup • Quality: How is P/R affected by keeping probs? • 52 objects, 352 locations, 10k sq. ft. • 2x30min trace with 10 min break in between • Participants marked down true locations Lahar -- SIGMOD 2008 -- Christopher Re
Experimental Setup • Quality: How is P/R affected by keeping probs? • 52 objects, 352 locations, 10k sq. ft. • 2x30min trace with 10 min break in between • Participants marked down true locations • “Alert when anyone enters a coffee room” • Baseline: Most Likely Estimate (MLE) • Each timestep/Each person: most likely location Lahar -- SIGMOD 2008 -- Christopher Re
Quality: Realtime – Improve over MLE? • Declare an event “true”, if its Pr > threshold • Vary threshold 10% improvement in F1 Precision Recall F1 Lahar -- SIGMOD 2008 -- Christopher Re
Performance: Is the cost too high? Synthetic Data – Same query Lahar -- SIGMOD 2008 -- Christopher Re
Related Work • Event Queries – Deterministic • Cayuga, SASE, SnoopIB • Model-Based Views • BBQ, recently, Kanagalet al ICDE 08 • Probabilistic Databases • Mystiq, Trio, MayBMS, Maryland, Purdue,MCDB • Particle Filters on HMMs • Doucet, Godsill Lahar -- SIGMOD 2008 -- Christopher Re
Conclusion • Showed Lahar • Processed output of several inference tasks (HMMs) • Applies more generally than just RFID • Quality (F1) gains by keeping probability • Performance usable in real-time • Lots of concurrent tags • No indexing! Lahar -- SIGMOD 2008 -- Christopher Re
NB: example to follow Overview of Regular Query Algorithm • Compile an event query q • Automaton (A) over a language L • Mapping (M) events to subsets of L • Runtime – Input is set of events E • Map E into subsets of L via M • Maintain set of possible states of A Deterministic Probabilistic stays same stays same distribution distribution Size of distribution depends only on the query, q. For details, see paper Lahar -- SIGMOD 2008 -- Christopher Re
Why are ER queries hard? • Regular Queries ~ Regular Expressions • Mapping is non-trivial • Inspired by Cayuga [Demers et al. 06] • Queries have #P-combined complexity • Encode mDNF as regular expression • Intuition: n-sized automaton leads to • Extended regular ~ 1 NFA per/person • k persons implies O(k)-size automaton • Exponential cost When ER, can avoid blowup Lahar -- SIGMOD 2008 -- Christopher Re
Regular and Extended Regular • Query is regular if no variable is shared between subgoals • Query is extended regular if any variable shared by two subgoals, is shared by all subgoals p is shared between subgoals Lahar -- SIGMOD 2008 -- Christopher Re
Correlations Lahar -- SIGMOD 2008 -- Christopher Re
Sequencing by example • Sequencing is parameterized [Cayuga] Semicolon means “the next event among those that match next goal” Semicolon is not “after” Time Lahar -- SIGMOD 2008 -- Christopher Re
Compilation by example • Each goal “corresponds” to two letters: • move (m) – the query should advance • accept (a) – the next subgoal accepts Does not contain Final Any other maps to empty set Lahar -- SIGMOD 2008 -- Christopher Re Does contain
Subtle example.. • What about: Does not contain Final Any other maps to empty set Lahar -- SIGMOD 2008 -- Christopher Re Does contain
CUT II Lahar -- SIGMOD 2008 -- Christopher Re
Motivating Apps • RFID apps • Diary and Active Calendar Application. • Alert if I go to a database meeting. • Supply chain • Alert if Mach 3 razors are being stolen • Many independent HMMs • Elder care [Intel/UW] • Alert if elder takes their medicine with water • Activity Recognition • Financial applications on predictive HMM • Alert if head-and-shoulders market Lahar -- SIGMOD 2008 -- Christopher Re
Compile Select and Filter • Intuition: goal maps to two letters: • match (m) : matches filter • accept (a) : accepted by select Does not contain Final language and automaton are the same for both queries Lahar -- SIGMOD 2008 -- Christopher Re Does contain
Wrinkle in the language:Filter v. Selection “Alert next time Joe is in 502 after he is in 501” Yes “Alert if the next place Joe is in after 501 is 502” No At Time Lahar -- SIGMOD 2008 -- Christopher Re
Recap of Algorithms • Regular Queries • Compiled them to an NFA, then used image • Data complexity O(1) • Extended regular • Several regulars multiplied together • Depends on number of distinct people in the data, not number of time steps. Lahar -- SIGMOD 2008 -- Christopher Re
Text1 • Euclid • Eculid • Euclid • Euclid • Euclid • Euclid • Symbol Lahar -- SIGMOD 2008 -- Christopher Re
Inspired by Cayuga [Demers et al 2006, White et al 2007] Lahar Queries by Example Alert when Joe is in hallway 4 and later in office 422 Challenge with probabilities: Naïve approach is exponential; unavoidable (#P) Joe in Hall4 Joe in 422 Alert when Joe is in hallway 4, and immediately in office 422 Joe in Hall4 Joe in 422 Lahar -- SIGMOD 2008 -- Christopher Re
Quality: Archived – Improve over Viterbi? • Smoothing v. Viterbi (MAP) • Lahar tracks of Markovian Correlations • Viterbi leverages correlations for MAP estimate Approx ~30% gain in F1 Precision Recall F1 Lahar -- SIGMOD 2008 -- Christopher Re