1 / 45

Effective Variation Management for Pseudo Periodical Streams

Effective Variation Management for Pseudo Periodical Streams. Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University. Summary. Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion.

eris
Download Presentation

Effective Variation Management for Pseudo Periodical Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective Variation Management for Pseudo Periodical Streams Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University

  2. Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion

  3. Pseudo Periodical Stream • Pseudo Periodical Stream • Data seems to repeat in a certain period • Tiny variation exists between different periods • Common in the domain of medical, seismology • Typical stream variations: gradual evolutions rather than burst changes

  4. An Example of Pseudo Periodical Stream • The respiratory data repeats about every 3.2 seconds • Reflects the evolution of the patient’s illness during five hours

  5. Variation Management on Data Stream • Data streams are widely applied in many domains • Stock market analysis • Road traffic control • Medical signal processing • Online variation management -- an important task • When did the variation occur? (Detect variations) • What is the variation ? / How does it change? (Describe variations) • Why it turns to change in this way ? (Help understanding variations )

  6. Major Technical Challenges • Value Type • Traditional Algorithms: Discrete values (enumerative) or Time series (equidistant intervals) • Data stream: consecutive real number with variable sampling frequencies • Training Sets or Models • Several training sets or predefined models • Data stream evolves and the models may not work soon • On the contrary, the system is required to generate such models as output

  7. Major Technical Challenges II • Variation Type • Not only on abnormalvalues and distribution • The structure in a period (shape) • Noises: unpredictable, random • In many applications, the variations are monitored manually • Our contribution: proposing a new method named Pattern Growth Graph (PGG) to detect and store variations over pseudo periodical streams

  8. Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion

  9. Data Stream Management Systems • Data stream work can be loosely classified in two categories: DSMS and Online Data Mining • Data Stream Management Systems (DSMS) • Such as STREAM, Aurora, TelegraphCQ… • Mainly focus on completing predefined SQL queries • Not try to find the data features, or to monitor the variations

  10. Online data mining • Variation management is an important part of online data mining • Three classes according to the algorithms • Symbolic Approaches • Mathematic Transformation • Predefined Models • Symbolic Approaches: Tarzan and SAX • Space: Put the entire time series/data stream in memory • Precision is not good for SAX

  11. Mathematic Transformation • Mathematic Transformation: Discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT) • Require the data length fixed, as well as the sampling frequency (equidistant intervals) • Haar wavelet transform can only perform on 2ndata items, e.g, the data length must be 1024 or 2048 • Predefined Models: Using Zigzag to detect events in financial streams (SIGMOD 04) • Too domain specific • Users can not provide such models in advance – actually they would like them as the output

  12. Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion

  13. Task Specification by Respiration Stream • Variation : Online detect the stream variation in one pass • Wave: The smallest unit concerned is not a single point, but values in a certain period represented as a wave • Alarms: F is actually the noise caused by body movements • Summary: A summary with acceptable error bound is very helpful

  14. System Framework

  15. Wave Splitting I • Variation: the difference from old data • Detected by comparing the old data and coming stream • Waste too much resources if comparing at each coming item • Just comparing at each wave -- much more efficient • How to divide the stream according to the data features?

  16. Wave Splitting II • Fixed length window will accumulate error • Observation: The waves start and end at valley points that are smaller than a certain value

  17. Upper Bound of Valley Points • User define • Update with the average value of past valley points

  18. Valley Sections • Valley Section: Approximate flat section represents the time interval between two events • It is also worth to study as one part of the wave • Take the last point of the section as the cut point

  19. Two Problems in Online Matching I • Problem 1: The data stream’s sampling frequency is usually high (>100Hz), waves should be simplified • Problem 2: How to compare two waves with different time lengths, and may not have data at same time point? • A: {(10,0.5), (20, 1.0), (25, 1.3), …(90, 50.5)} 22 data items • B: {(11,0.5), (25, 1.2), (30, 1.7) … (87, 50)} 20 data items

  20. Two Problems in Online Matching II • Solution 1: Piecewise Liner Representation • Make Problem 2 more difficult: patterns are simplified as segments, how to compare segments and points?

  21. Wave-pattern Matching • In real applications, two sequences are assumed to match if their paths roughly coincide • PLR segments record paths of old data • Testing whether the incoming stream items are on the paths • The intensity of variations can be determined by the number of matching items

  22. Record the Patterns • Observation: Many patterns just have few partial segments changed • Most stream variations are gradual evolutions rather than burst mutations • Recording by a simple list not only ignores their relationship but also causes storage redundancy • Utilize the similarity among patterns and reuse the unchanged parts • Pattern Growth Graph (PGG) is designed to store patterns and the variation history

  23. 1.5 1 Wave Pattern 1 0.5 Pattern 2 0 1 51 101 151 1 ' 2 ' 3 ' 4 ' Pattern 2 End (Growth Pattern) Pattern Growth Graph • Implemented as bi-directional linked list • Only generate new segments on the un-matched data • New patterns seems to grow from the old one 3 ' 3 4 6 4 ' 7 5 2 1 1 ' 2 ' 8 Pattern 1 Start End ( Base Pattern) 1 2 3 4 5 6 7 8

  24. Construct Full Wave-pattern • New Problem: Wave-Pattern matching needs full pattern to compare, while PGG only stores the new parts • Fortunately we can construct the full pattern by propagating the pointers Pattern 3 1 " 2 " End 1 ' 2 ' 3 ' Pattern 2 Collision Start End Pattern 1 1 2 3 4 5 6 7 8 9 模式 左1 右1 左2 右2 Step 0 1” 2” 1’ 2’ 9 End Step 1 1’1” 2’ 9 2” 1 3’ 8 \ Step 2 1 1’1”2’3’ 8 92” Start 8 ( Collision! ) 7 \ Final 11’ 1”2’3’ 8 9 2” \ \ \ \

  25. Problems for PGG size • Waves in data stream: N PGG size: k • Time complexity of PGG based matching algorithm is O (k*n) • In the worst case, each incoming wave introduces a new pattern: overall time cost is O (n2) • When PGG becomes larger, the algorithm is time-consuming • PGG is not allowed to take “forgetting functions” • Hard to delete in PGG • Some uncommon patterns may have higher domain significance

  26. Rank the Patterns • Observation: The most frequent pattern and its similar patterns have the highest possibility to match the incoming wave • Matching probability factor • The patterns with smaller probability are not deleted, but have lower priority to be compared • When one pattern get a match, system not only increase its own rank, also its “families”

  27. Reconstruct the Stream View with PGG • Queries on traditional DSMS • predefined, hard to conduct when data items passed by • Answer “the patient's ECG in the past five hours” • Record all patterns’ occurrence time in PGG • Reconstruct the stream view with PGG patterns • Only consumes about 4% storage space of the original stream, but can provide an approximate stream view within 5% relative error bound

  28. Track Pattern Evolution • To answer “Why will it change in this way ?” • User selects an interesting pattern, PGG can track the source of it

  29. False Alarm • A successful system needs to reduce the false alarms introduced by noises • The major problem: noises are caused by many sources, they have various styles and are hard to be modeled

  30. Noise Reorganization • A short cut: considering the pattern’s evolution history • Some strategies to reduce false alarms on medical stream: • Unusual values in growth patterns: the patients’ condition has been exacerbated -- Warning • New pattern, it matches successive waves: the underlying pathology mechanism might have some fundamental changes -- Warning • A series of new patterns and they all un-match the previous/following waves -- suspected as noises

  31. System Framework

  32. Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion

  33. Experimental Setup • Data Set • Medical streams: Six real pathology signals including ECG, respiration... (over 25,000,000 data points) • Earthquake waves: The pacific earthquake wave data from the NGA project. (100,000 data points) • Sunspot data: All the sunspot records between the year 1850 and 2001 (55,000 data points) • Environment: Intel Pentium 4 3.0GHz CPU with 1GB RAM, Windows XP Professional, JDK 1.5.0…

  34. Effect of Rank Function • At the beginning, the effect is insignificant. • After three million data points, the naive algorithm’s performance decreases rapidly • In the end, the rank algorithm outperforms by about 300%

  35. Reconstruct the Stream View • ECG data stream (more than 10M data items) can be represented with only 420 patterns • The amazing compressing result is achieved due to two factors • The PLR simplify can reduce the size of patterns to about 20% • PGG further reduces it to about 3.31% by compressing the repeating and similar patterns (Patterns only need 0.3%, the rest 3% stores the occurrence time of the patterns)

  36. Compared with Other Methods • Compared PGG with SAX (symbolic approaches), Discrete Haar Wavelet Transformation (mathematic transformation) and Zigzag (predefined models) • The processing efficiency is average 60K—70K items/sec • Much higher than real application needs

  37. Variation Detection & Noise Recognition • Two important measurements: • Sensitivity (High Positive Rate): The algorithm send alarms at meaningful variations • Selectivity (Low Negative Rate): The algorithm does not send false alarms on noises • The two measurements are conflict • Increasing sensitivity to find more variations will inevitably cause more false alarms • In a medical environment, sensitivity is much more important -- missing a meaningful variation may cost the patient’s life

  38. Best Results of Sensitivity on Respiration Stream • Zigzag sends false alarm at almost every noise section • DWT and SAX nearly cannot distinguish real variations from noises

  39. Results of Noise Recognition on Other Stream • For other stream, we take precision as the main measurement • PGG performs accurately and stably • Zigzag is volatile with different datasets: • Good on three blood pressure signals (ABP, CVP and ICP, meaningful variations are outliners) • Poorly on PLETH (meaningful variations are of inner structures)

  40. Discussion • Zigzag: focuses on extreme data points, strongly influenced by outliers • SAX: good at finding in a long period using frequency statistics -- more suitable for time series • DWT: only effective for signals with strict periods • With the effective data structure, PGG discovers and records as much features of the data stream as possible • The recorded information helps distinguish between meaningful variations and noises

  41. Summary • Introduction • Related Work • Variation Management for Pseudo Periodical Stream • Experiments • Conclusion

  42. Conclusion • Streams are split as waves and represented by PLR patterns • Detect variations by online wave-pattern matching • Pattern Growth Graph stores the variation history • Reconstruct the stream view with high accuracy • Effectively distinguish meaningful variations from noises

  43. The System Interface

  44. Future Work • Extend PGG to multiple streams • Implement the PGG method in other application domains such as weather forecasting and financial analysis • Combine with other methods, like Zigzag…

  45. Thank You Very Much! Please give me questions and suggestions

More Related