100 likes | 255 Views
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on Tackling Computer Systems Problems with Machine Learning Techniques ) Presented By Hassan Wassel. Introduction.
E N D
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on Tackling Computer Systems Problems with Machine Learning Techniques ) Presented By Hassan Wassel
Introduction • System logs is a critical tool for system administrators. • They are massive in amount • We need to rank them according to importance. • Previous work: • Ranking using expert rules • Visualization • One machine log
What is Important? • This paper propose that an important message is the message appears in a probability higher than the expected. • Represent messages of the same type by one message type. • Calculate the empirical distribution of probabilities and rank them. • Systems are not homogeneous.
Algorithm • Using K-means clustering to divide system logs into classes. • Estimate the empirical distribution of each class. • Given a system log, identify a class and rank messages according to its P
Clustering • K-Means tries to minimize an objective function J=Sum j Sum i d2(Xi, Zj) • Inputs: • Number of Clusters • Distance Matrix • Outputs: • Membership matrix • Objective function value Features Patterns Clusters Patterns
Dimensionality Problem • The data was 3000 system log with 15,000 message type. However, it is sparse • Distance measurement using these 15,000 feature is computationally intensive. • Solution: Dimensionality reduction
Feature Construction • Using Spearman Correlation between every two system logs • Corr(x,y) = 1 – (6 || rx – ry||2)/(N(N-1)) • From k logs X n message types to k X k similarity matrix. • Question: How to calculate rank vectors?
Evaluation • Compare Spearman Correlation to other feature construction schemes. • Histogram of Pairwise distance • Maximal Mutual Information • Improvement in Score
Comment • Future Work • Correlation based clustering • Feature extraction + choice of distance measure • Bi-clustering • Fuzzy Clustering • Evaluation • Use of human expertise to evaluate the ranking. • Clustering index
Thank you! Pros and Cons!