310 likes | 435 Views
Mining Logs for Long-Term Patterns. Boris Novikov , Elena Michailova , Ekaterina Ivannikova , Alice Pigul Saint Petersburg University. Motivation. Goal: To predict system behavior based on some knowledge from data Solution: Short-term patterns Long-term patterns Applicability:
E N D
Mining Logs for Long-Term Patterns Boris Novikov, Elena Michailova, Ekaterina Ivannikova, Alice Pigul Saint Petersburg University
Motivation Goal: • To predict system behavior based on some knowledge from data Solution: • Short-term patterns • Long-term patterns Applicability: • Storage system performance • Financial market movies • Geographical – ecological prediction
Some Data are Needed System logs: • Low-level system I/O and other storage system logs • Application logs (e.g. DBMS performance monitoring) • Transaction logs Problems: • Huge volume • Hard to get realistic data
The Data • A production DB for a medium size company • The business area: sea transportation • DBMS: Oracle 10g • Database size: approx. 90 Gb (operational) • Query execution statistics • Only summary data were available
The Data Structure • The fields: • SQL id • Elapsed time • Executions • CPU • Start interval time • End interval time • The data are aggregated on 1 hour intervals • Query IDs but not SQL text
Pattern • Queries • indicate business functions; • the links to data may be found via query parameters. • Group of queries • might indicate business processes. • A pattern is a set of queries with significant resource consumption at the same or close intervals
Why it is Hard? • Several business processes are interactive and hence chaotic • The processes might be too small or large
Algorithm • Preparing data • Looking for Patterns • Finding Periods • Validation
Cleaning • Remove queries which are not helpful but produce significant workload • Chaotic queries with high intensity • Uniformly distributed queries • Trivial periods (working hours) • Anomalies (occasional very high workload)
How to Clean? Queries producing • Nearly constant workload for (almost all) snapshots • Nearly constant ratio to total workload for the snapshot • Anomalies Approach • Variance • Frequency of occurrence
Removing Chaotic Queries Minimum variance is 0.5, drop no more 9% queries
Algorithm • Preparing data • Looking for Patterns • Finding Periods • Validation
Mining Patterns • Patterns: • related queries are grouped together. • The number of patterns depends on: • number of queries in a group; • correlation measure.
The interconnectedness of the data • Between queries q1 and q2 • Adding a query q to a group G • Delete all new groups, when: M < threshold
The Impact of the Threshold M >0.6, it is corresponds to more than 4/5 snapshots
Algorithm • Preparing data • Looking for Patterns • Finding Periods • Validation
Periods • Transform patterns to the binary sequence: If (q1,q2,..., qn є snapsoti) then Binary[i] = 1 else Binary[i] = 0 • Cycle c = {p,o}, whereрis period and оis offset. Example: ship arrive to the port every Friday P = 7 days, o = 5th day, cycle c = {7,5}
Algorithms • Exact periods • Approximate periods: allow missing or extra entries
Exact periods mining • Detection of cycles with the 100% support0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 – cycle (4, 1) • All possible cycles-candidates (p, o) are generated(Сand). • The binary sequence is scanning ones and all non-periodic cycles are excluded from Cand: The residual cycles in Cand are periodic cycles. if BinSeq[i] = 0 for (p = P_min; p <= P_max; p++) Cand.delete(p, i mod p)
Detection of periods with a given minimum support (1) • 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 cycle (4, 1) with support 80% • Base-line algorithm: for (p = P_min; p <= P_max; p++) for (o = 0; o < p; o++) s = CalculateSup(p,o); if (s >= Sup_min) Add cycle (p,o) to Result;
Algorithm • Preparing data • Looking for Patterns • Finding Periods • Validation
Validation • The goal: • Compare found groups with known business processes (departures) • Analyze: • Departures processed in other systems were not found at all • Some departures weren’t found • Some of groups weren’t associated with known processes • Different ports correlate with different groups • Possible reason: different cargo types
Validation Summary | {departures : |tdep-tgr|<2}| Detected departures = | { departures } |
Conclusions • Mining summary data is computationally feasible and can produce reasonably good precision • Topics for future work: • Convert business process patterns into data access patterns • Compare alternative mining approaches • Evaluate the techniques on other classes of data • Define a framework for adaptive self-tuning