130 likes | 161 Views
This paper presents a real-time streaming pattern detection system for eCommerce using a Domain Specific Language (DSL) to express customer behaviors and Spark for processing. The system converts behavior patterns into Deterministic Finite Automata (DFAs) and efficiently detects patterns over event streams. Evaluation results show scalability and efficiency in handling a large volume of customers and events with low latency. The study demonstrates the capability to detect patterns for millions of customers in subsecond latency.
E N D
Real Time Streaming Pattern Detection for eCommerce AUTHORS - William Braik, FloréalMorandat, Jean-Rémy Falleri, Xavier Blanc PRESENTED BY KRITI NARSAPUR (Student id: 1294630)
contents • Introduction • Background • Pattern Detection • Evaluation • Conclusion
introduction • Pattern detection over event streams • Challenges of real time pattern detection • Efficiency • Scalability • Existing approach – measure web traffic in batch fashion
Introduction contd.. • Experimented approach: • Domain Specific Language (DSL) – express customers’ behaviours • DSL semantics – compilation process transforms patterns into Deterministic Finite Automata (DFAs) • Spark – Big Data streaming platform – run pattern detection algorithm in real time • cDiscount Requirement: • Handle customers’ behaviours detection – 1million customers send around 400 events each day – latency < 1 sec
BackgroundCdiscount architecture • Event of stream, e = (t, d)
Pattern detection:A DSL to express behaviour pattern • Patterns: sequences of events. • Event is matched according to its action type • DSL also supports complement of action type • 2 non-contiguous operators, that ignores all events that do not match the pattern : • FollowedBy • KleenePlus+ • Time constraints : Interval (operator) and Window (pattern) • Data constraint • Negative Acceptation Condition (NAC)
Pattern detection:From patterns to automata • DFAs are used to detect patterns (NFA is just used for representation) • Translate each pattern to corresponding DFA • Run DFA for each customer • Memory usage is proportional to number of simultaneous customers * number of patterns to detect • 2 step transformations: • Generate NFA • Convert NFA into corresponding DFA • Run the pattern detection using Spark P : View + Exit
evaluation • Goal – assess whether a given cluster of machines, with a set pool of memory and CPU resources is capable of detecting patterns efficiently • Total number of automata that this engine runs, A = C * P • C : number of simultaneous customers handled by the system • P : number of patterns to be observed • Throughput of events, T = C * E • E : number of events • Measure maximum value of A and T that are supported by given cluster
Evaluation contd..protocol • Maximum of A – simulate creation of new automata, until performance criterion is not met. • Maximum of T – generate stream with a given throughput and check how system performs under stress • Phase 1 – Creates as many automata as needed to reach A • Measure memory footprint • Phase II – Keeps system working, without increasing number of automata in memory • Measure maximum detection latency
Evaluation contd..results • Run total 30 configurations: • T ϵ {1000, 2500, 5000, 10000, 15000, 20000, 30000, 35000, 40000} • A ϵ { 0.5M,1M,2M } • Phase 1 – Latency Detection increases • Phase II – Curve stabilizes T = 5000 events per second
conclusion • This study provided • DSL – expressing behaviour patterns • Compiler – translate them into DFA • Detection engine • Experimental results showed that, for 5000 events per second, it can handle: • 1million customers with subsecond detection latency • 2millions with latency lower than 2 seconds