450 likes | 565 Views
Effective Variation Management for Pseudo Periodical Streams. Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University. Summary. Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion.
E N D
Effective Variation Management for Pseudo Periodical Streams Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University
Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion
Pseudo Periodical Stream • Pseudo Periodical Stream • Data seems to repeat in a certain period • Tiny variation exists between different periods • Common in the domain of medical, seismology • Typical stream variations: gradual evolutions rather than burst changes
An Example of Pseudo Periodical Stream • The respiratory data repeats about every 3.2 seconds • Reflects the evolution of the patient’s illness during five hours
Variation Management on Data Stream • Data streams are widely applied in many domains • Stock market analysis • Road traffic control • Medical signal processing • Online variation management -- an important task • When did the variation occur? (Detect variations) • What is the variation ? / How does it change? (Describe variations) • Why it turns to change in this way ? (Help understanding variations )
Major Technical Challenges • Value Type • Traditional Algorithms: Discrete values (enumerative) or Time series (equidistant intervals) • Data stream: consecutive real number with variable sampling frequencies • Training Sets or Models • Several training sets or predefined models • Data stream evolves and the models may not work soon • On the contrary, the system is required to generate such models as output
Major Technical Challenges II • Variation Type • Not only on abnormalvalues and distribution • The structure in a period (shape) • Noises: unpredictable, random • In many applications, the variations are monitored manually • Our contribution: proposing a new method named Pattern Growth Graph (PGG) to detect and store variations over pseudo periodical streams
Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion
Data Stream Management Systems • Data stream work can be loosely classified in two categories: DSMS and Online Data Mining • Data Stream Management Systems (DSMS) • Such as STREAM, Aurora, TelegraphCQ… • Mainly focus on completing predefined SQL queries • Not try to find the data features, or to monitor the variations
Online data mining • Variation management is an important part of online data mining • Three classes according to the algorithms • Symbolic Approaches • Mathematic Transformation • Predefined Models • Symbolic Approaches: Tarzan and SAX • Space: Put the entire time series/data stream in memory • Precision is not good for SAX
Mathematic Transformation • Mathematic Transformation: Discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT) • Require the data length fixed, as well as the sampling frequency (equidistant intervals) • Haar wavelet transform can only perform on 2ndata items, e.g, the data length must be 1024 or 2048 • Predefined Models: Using Zigzag to detect events in financial streams (SIGMOD 04) • Too domain specific • Users can not provide such models in advance – actually they would like them as the output
Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion
Task Specification by Respiration Stream • Variation : Online detect the stream variation in one pass • Wave: The smallest unit concerned is not a single point, but values in a certain period represented as a wave • Alarms: F is actually the noise caused by body movements • Summary: A summary with acceptable error bound is very helpful
Wave Splitting I • Variation: the difference from old data • Detected by comparing the old data and coming stream • Waste too much resources if comparing at each coming item • Just comparing at each wave -- much more efficient • How to divide the stream according to the data features?
Wave Splitting II • Fixed length window will accumulate error • Observation: The waves start and end at valley points that are smaller than a certain value
Upper Bound of Valley Points • User define • Update with the average value of past valley points
Valley Sections • Valley Section: Approximate flat section represents the time interval between two events • It is also worth to study as one part of the wave • Take the last point of the section as the cut point
Two Problems in Online Matching I • Problem 1: The data stream’s sampling frequency is usually high (>100Hz), waves should be simplified • Problem 2: How to compare two waves with different time lengths, and may not have data at same time point? • A: {(10,0.5), (20, 1.0), (25, 1.3), …(90, 50.5)} 22 data items • B: {(11,0.5), (25, 1.2), (30, 1.7) … (87, 50)} 20 data items
Two Problems in Online Matching II • Solution 1: Piecewise Liner Representation • Make Problem 2 more difficult: patterns are simplified as segments, how to compare segments and points?
Wave-pattern Matching • In real applications, two sequences are assumed to match if their paths roughly coincide • PLR segments record paths of old data • Testing whether the incoming stream items are on the paths • The intensity of variations can be determined by the number of matching items
Record the Patterns • Observation: Many patterns just have few partial segments changed • Most stream variations are gradual evolutions rather than burst mutations • Recording by a simple list not only ignores their relationship but also causes storage redundancy • Utilize the similarity among patterns and reuse the unchanged parts • Pattern Growth Graph (PGG) is designed to store patterns and the variation history
1.5 1 Wave Pattern 1 0.5 Pattern 2 0 1 51 101 151 1 ' 2 ' 3 ' 4 ' Pattern 2 End (Growth Pattern) Pattern Growth Graph • Implemented as bi-directional linked list • Only generate new segments on the un-matched data • New patterns seems to grow from the old one 3 ' 3 4 6 4 ' 7 5 2 1 1 ' 2 ' 8 Pattern 1 Start End ( Base Pattern) 1 2 3 4 5 6 7 8
Construct Full Wave-pattern • New Problem: Wave-Pattern matching needs full pattern to compare, while PGG only stores the new parts • Fortunately we can construct the full pattern by propagating the pointers Pattern 3 1 " 2 " End 1 ' 2 ' 3 ' Pattern 2 Collision Start End Pattern 1 1 2 3 4 5 6 7 8 9 模式 左1 右1 左2 右2 Step 0 1” 2” 1’ 2’ 9 End Step 1 1’1” 2’ 9 2” 1 3’ 8 \ Step 2 1 1’1”2’3’ 8 92” Start 8 ( Collision! ) 7 \ Final 11’ 1”2’3’ 8 9 2” \ \ \ \
Problems for PGG size • Waves in data stream: N PGG size: k • Time complexity of PGG based matching algorithm is O (k*n) • In the worst case, each incoming wave introduces a new pattern: overall time cost is O (n2) • When PGG becomes larger, the algorithm is time-consuming • PGG is not allowed to take “forgetting functions” • Hard to delete in PGG • Some uncommon patterns may have higher domain significance
Rank the Patterns • Observation: The most frequent pattern and its similar patterns have the highest possibility to match the incoming wave • Matching probability factor • The patterns with smaller probability are not deleted, but have lower priority to be compared • When one pattern get a match, system not only increase its own rank, also its “families”
Reconstruct the Stream View with PGG • Queries on traditional DSMS • predefined, hard to conduct when data items passed by • Answer “the patient's ECG in the past five hours” • Record all patterns’ occurrence time in PGG • Reconstruct the stream view with PGG patterns • Only consumes about 4% storage space of the original stream, but can provide an approximate stream view within 5% relative error bound
Track Pattern Evolution • To answer “Why will it change in this way ?” • User selects an interesting pattern, PGG can track the source of it
False Alarm • A successful system needs to reduce the false alarms introduced by noises • The major problem: noises are caused by many sources, they have various styles and are hard to be modeled
Noise Reorganization • A short cut: considering the pattern’s evolution history • Some strategies to reduce false alarms on medical stream: • Unusual values in growth patterns: the patients’ condition has been exacerbated -- Warning • New pattern, it matches successive waves: the underlying pathology mechanism might have some fundamental changes -- Warning • A series of new patterns and they all un-match the previous/following waves -- suspected as noises
Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion
Experimental Setup • Data Set • Medical streams: Six real pathology signals including ECG, respiration... (over 25,000,000 data points) • Earthquake waves: The pacific earthquake wave data from the NGA project. (100,000 data points) • Sunspot data: All the sunspot records between the year 1850 and 2001 (55,000 data points) • Environment: Intel Pentium 4 3.0GHz CPU with 1GB RAM, Windows XP Professional, JDK 1.5.0…
Effect of Rank Function • At the beginning, the effect is insignificant. • After three million data points, the naive algorithm’s performance decreases rapidly • In the end, the rank algorithm outperforms by about 300%
Reconstruct the Stream View • ECG data stream (more than 10M data items) can be represented with only 420 patterns • The amazing compressing result is achieved due to two factors • The PLR simplify can reduce the size of patterns to about 20% • PGG further reduces it to about 3.31% by compressing the repeating and similar patterns (Patterns only need 0.3%, the rest 3% stores the occurrence time of the patterns)
Compared with Other Methods • Compared PGG with SAX (symbolic approaches), Discrete Haar Wavelet Transformation (mathematic transformation) and Zigzag (predefined models) • The processing efficiency is average 60K—70K items/sec • Much higher than real application needs
Variation Detection & Noise Recognition • Two important measurements: • Sensitivity (High Positive Rate): The algorithm send alarms at meaningful variations • Selectivity (Low Negative Rate): The algorithm does not send false alarms on noises • The two measurements are conflict • Increasing sensitivity to find more variations will inevitably cause more false alarms • In a medical environment, sensitivity is much more important -- missing a meaningful variation may cost the patient’s life
Best Results of Sensitivity on Respiration Stream • Zigzag sends false alarm at almost every noise section • DWT and SAX nearly cannot distinguish real variations from noises
Results of Noise Recognition on Other Stream • For other stream, we take precision as the main measurement • PGG performs accurately and stably • Zigzag is volatile with different datasets: • Good on three blood pressure signals (ABP, CVP and ICP, meaningful variations are outliners) • Poorly on PLETH (meaningful variations are of inner structures)
Discussion • Zigzag: focuses on extreme data points, strongly influenced by outliers • SAX: good at finding in a long period using frequency statistics -- more suitable for time series • DWT: only effective for signals with strict periods • With the effective data structure, PGG discovers and records as much features of the data stream as possible • The recorded information helps distinguish between meaningful variations and noises
Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion
Conclusion • Streams are split as waves and represented by PLR patterns • Detect variations by online wave-pattern matching • Pattern Growth Graph stores the variation history • Reconstruct the stream view with high accuracy • Effectively distinguish meaningful variations from noises
Future Work • Extend PGG to multiple streams • Implement the PGG method in other application domains such as weather forecasting and financial analysis • Combine with other methods, like Zigzag…
Thank You Very Much! Please give me questions and suggestions