Mining Logs for Long-Term Patterns

Mining Logs for Long-Term Patterns Boris Novikov, Elena Michailova, Ekaterina Ivannikova, Alice Pigul Saint Petersburg University

Motivation Goal: • To predict system behavior based on some knowledge from data Solution: • Short-term patterns • Long-term patterns Applicability: • Storage system performance • Financial market movies • Geographical – ecological prediction

Some Data are Needed System logs: • Low-level system I/O and other storage system logs • Application logs (e.g. DBMS performance monitoring) • Transaction logs Problems: • Huge volume • Hard to get realistic data

The Data • A production DB for a medium size company • The business area: sea transportation • DBMS: Oracle 10g • Database size: approx. 90 Gb (operational) • Query execution statistics • Only summary data were available

The Data Structure • The fields: • SQL id • Elapsed time • Executions • CPU • Start interval time • End interval time • The data are aggregated on 1 hour intervals • Query IDs but not SQL text

Pattern • Queries • indicate business functions; • the links to data may be found via query parameters. • Group of queries • might indicate business processes. • A pattern is a set of queries with significant resource consumption at the same or close intervals

Why it is Hard? • Several business processes are interactive and hence chaotic • The processes might be too small or large

Typical Workload

Algorithm • Preparing data • Looking for Patterns • Finding Periods • Validation

Cleaning • Remove queries which are not helpful but produce significant workload • Chaotic queries with high intensity • Uniformly distributed queries • Trivial periods (working hours) • Anomalies (occasional very high workload)

Some examples - Anomaly

Some examples - Period

Some examples – Trivial periods

How to Clean? Queries producing • Nearly constant workload for (almost all) snapshots • Nearly constant ratio to total workload for the snapshot • Anomalies Approach • Variance • Frequency of occurrence

Removing Chaotic Queries Minimum variance is 0.5, drop no more 9% queries

Mining Patterns • Patterns: • related queries are grouped together. • The number of patterns depends on: • number of queries in a group; • correlation measure.

The interconnectedness of the data • Between queries q1 and q2 • Adding a query q to a group G • Delete all new groups, when: M < threshold

The Impact of the Threshold M >0.6, it is corresponds to more than 4/5 snapshots

Example of correlated group

Periods • Transform patterns to the binary sequence: If (q1,q2,..., qn є snapsoti) then Binary[i] = 1 else Binary[i] = 0 • Cycle c = {p,o}, whereрis period and оis offset. Example: ship arrive to the port every Friday P = 7 days, o = 5th day, cycle c = {7,5}

Algorithms • Exact periods • Approximate periods: allow missing or extra entries

Exact periods mining • Detection of cycles with the 100% support0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 – cycle (4, 1) • All possible cycles-candidates (p, o) are generated(Сand). • The binary sequence is scanning ones and all non-periodic cycles are excluded from Cand: The residual cycles in Cand are periodic cycles. if BinSeq[i] = 0 for (p = P_min; p <= P_max; p++) Cand.delete(p, i mod p)

Detection of periods with a given minimum support (1) • 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 cycle (4, 1) with support 80% • Base-line algorithm: for (p = P_min; p <= P_max; p++) for (o = 0; o < p; o++) s = CalculateSup(p,o); if (s >= Sup_min) Add cycle (p,o) to Result;

Validation • The goal: • Compare found groups with known business processes (departures) • Analyze: • Departures processed in other systems were not found at all • Some departures weren’t found • Some of groups weren’t associated with known processes • Different ports correlate with different groups • Possible reason: different cargo types

Validation Summary | {departures : |tdep-tgr|<2}| Detected departures = | { departures } |

Departures from Gdynia port

Conclusions • Mining summary data is computationally feasible and can produce reasonably good precision • Topics for future work: • Convert business process patterns into data access patterns • Compare alternative mining approaches • Evaluate the techniques on other classes of data • Define a framework for adaptive self-tuning

Mining Logs for Long-Term Patterns