550 likes | 1.06k Views
Vik Singh w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006. SQL Text Mining. Road Map. Motivations Methodology & Algorithms Experiments & Analysis Application Future Work. 1. Motivations (1): SDSS SkyServer. Multi-TB Astronomy Archive
E N D
Vik Singh w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006 SQL Text Mining
Road Map • Motivations • Methodology & Algorithms • Experiments & Analysis • Application • Future Work 1
Motivations (1): SDSS SkyServer • Multi-TB Astronomy Archive • 5 Yrs, 178M Hits, 77M Views, 1M Unique IP's20M SQL Queries (9.8M Unique) • SQL access to telescope data • Wish to categorize user’s SQL queries • Schema redesign, caching, query recommendation, popular templates, segment users 2
(2): Not much prior work • Could not find research showing how to characterize SQL • But many sites and databases maintain query logs • Fortunately there is related work • NLP, IR, Machine Learning 3
Methodology • Use unsupervised learning (K-Means) • Cluster centers give us query templates • Cluster sizes tell us popularity • Do term analysis over these segments • More interesting than total aggregate term stats • Can isolate types of users (Bots v. Mortals) 4
K-Means Algorithm defKMeans(k, docs): clusters = InitialClusterCenters(k, docs) while true: change = AssignToClusters(docs, clusters) if change: clusters = RecomputeClusterCenters(k, docs) else: break return docs, clusters 5
3 Key Factors to Clustering • Distance function • Choosing K • We’ll undershoot K • Then break these clusters into ‘tighter’ ones • Choosing the initial centroids • Not covered in this talk, but we have some cool tricks • For now assume traditional approaches (BuckShot, Random) 6
1st Step: Distance Function • How syntactically similar are two queries? • Not necessarily functionally similar • SQL term order • Column Fields and WHERE conditionals can be reordered & represent the same query • Solution: Compare token combinations • N-Grams (or Shingles) 7
We also need to clean the SQL • ‘Templatize’ SQL queries • Remove unnecessary, uncommon features • 24 REGEX cleaners and substitutions • Ex. Substitute ‘STRING’, ‘NUM’, ‘COMPARE’ • Goal here is to MAXIMIZE similarity 8
Before & After SELECTs.ra, s.dec FROM #upload u, SpecObj s WHERE u.up_plate=s.plate and u.up_mjd=s.mjd and u.up_fiber=s.fiberid select p.objID, rc.name as rc3_name, s.name as stetson_name, p.ra, p.dec, ph.name as type, p.u, p.g, p.r, p.i, p.z, o.distance from(((PhotoPrimary p inner join PhotoType ph on p.type = ph.value) left join RC3 rc on p.objid = rc.objid) left join Stetson s on p.objid = s.objid), dbo.fGetNearbyObjEq(180,-0.5,3) o where o.objid = p.objid and p.type = ph.value order by o.distance select ra dec from temp specobj where up_plate compare plate logic up_mjd compare mjd logic up_fiber compare fiberid select objid name name ra dec name u g r i z distance from photoprimary inner join phototype on type compare value left join rc3 on objid compare objid left join stetson on objid compare objid fgetnearbyobjeq where objid compare objid logic type compare value orderby distance 9
Feature Vector • What sized N-Grams should we use? • Unigrams & Bigrams most common • Any more than Tri-Grams usually results in worse clusters • But this assumes unstructured text • We have a highly constrained language • And we want to capture Joins select objid name name ra dec name u g r i z distance fromphotoprimary inner join phototype on type compare value • (Need at least size 8-grams here) • At the same time we want good results too - consistent with the literature • So bias smaller grams since they more likely to occur 10
Feature Strategy – ‘Use ‘Em All’ • Generate all 1 … 8 sized N-grams for a query • Sort the tokens within an N-gram • Why? • Increases similarity matches, decreases # N-grams (better for memory) • SQL is highly constrained - unlikely the terms reordered represent a different style query • Larger N-gram matches are unlikely, should be rewarded similarity if they are within the same N-distance neighborhood of terms • Jaccard’s Similarity Measure • |Intersection(Q1_n, Q2_n)| / |Union(Q1_n, Q2_n)| • Compute the Jaccard for each N-gram set separately, then take a weighted Fibonacci mean favoring smaller grams • Since there can be duplicate terms, we append each N-gram with a rolling index • ‘Multi-Set Jaccard’ 11
Ex. Distance between 2 Queries • A: ‘select objid ra dec r z from galaxy specobj’ • B: ‘select objid ra dec from galaxy fgetnearbyobjeq where objid compare objid’ • For simplicity, just up to size 3 • Not Shown • Sort tokens with N-Gram • “ra_dec_from” => “dec_from_ra” • Index repeated N-grams within same set size • Ex. objid_1, objid_2 Distance = (3*(0.43)+2*(0.31)+1*(0.17)) / 6 = 0.35 12
But .. this still won’t scale for 20M • We’re producing 1 … 8 grams each • Producing close to 1000+ grams per query • This is like full-scale document clustering! • 8 Jaccard’s over strings in each comparison step • Piping results in and out of SQL • O(N^2) clustering algorithm • Only have 3 months & a single machine … 13
But there are only 80K templates! • First, we only need to cluster the distinct queries (9.8M) • Second, we found MANY queries reduce down to a common template after cleaning • Number of unique query templates ~ 80K • Over 99.5% reduction! • Filter out the queries which result in errors • Brings it down to ~77K • Clustering these ~77K templates is equivalent to clustering all 20M! • We maintain the query to template mappings and book-keep template counts • Just factor the counts at the end to scale the solution for 20M 14
Let’s Cluster • IronPython + SQL Server 2005 • Didn't use Data Mining package in SQL Server • Its Clustering optimized for Euclidean distances • K = 24 (based on previous work at JHU) • Then search clusters for tighter groups • Within 70% similarity and of size >= 10 • Total clusters: 194 • Computation Time: ~22 Hrs • 15 Iterations 15
Ex. A Popular Cluster Found • Represents 18% of all the SQL queries (3,100,655 hits) • Sum of Square Errors/Cluster Size =3.58006287097031E-06 • 0.061 variance omitting top template from cluster • Tiny variance among queries 16
And since we collect N-grams … • … We can find the common phrases within a cluster • Ex. Top 15 Most PopularEight-Grams 17
Two Example Applications • Bot Detection • Query Recommendation 18
App 1: Bot Detection • Belief:Little or no variance and large cluster sizes correspond to Bot agents • Can isolate bot from human SQL traffic • Useful since we don’t have the user-agent strings in the SQL logs 19
App 1: Bot Detection • Significant # of EXACT query template matches • Top 10 Query Templates 20
Check the logs for the top template • Sample of SQL log records matching the template • Even have the same # of whitespaces & case • Just diff numbers being passed into the TVF’s • Steady rate: 2x hour for weeks at a time • Product usage pattern (All DR3, then all DR2) • All from the same user IP (NASA) • 2nd highest # queries on our top org’s list • Smells like a bot 21
App 2: Query Recommendation • Since we have a distance function … we can even return similar queries • New users, students, visitors wishing to query the sky • But there’s a learning curve • SQL (writing 3-way spatial query joins not easy) • Schemas • Optimizing & Debugging • Because queries are quite repetitive, why not suggest known correct ones to the user • Spelling Correction! 22
App 2: Example Bad User Query: SELECT TOP 10 ph.ra,ph.dec,str(ph.g - ph.r,11 ?) as color,ISNULL(s.bestObjId, 0) as bestObjId, 'ugri' FROM #x x, #upload up, BESTDR2..PhotoObjAll as ph LEFT OUTER JOIN? SpecObjAll s ON ph.objID = s.bestObjID WHERE (ph.type=3 OR ?) AND up.up_id = x.up_id ? x.objID=p ?.objID ORDER BY x.up_id => It’s ‘Cleaned’ representation: select top num ra dec str isnull bestobjid from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic logic up_id compare up_id objid compare objid orderby up_id 23
App 2: Similar Query Results * Can computed quickly by comparing only against queries within the same length neighborhood 24
Conclusion • Not discussed • Optimizations, use of SQL Server, how we broke down clusters (‘Finesse Algorithm’), initializing centers, defining a center (‘Clustroid’), theoretical properties, weighting scheme • Future Work • More cleaning, using SQL parse trees, other features representations, better n-grams (wrapping edges), optimizations (min-hashing) • Findings • We found that queries in cleaned representation fall into a small group of clusters (20M => 77K) • Queries follow very repetitive template patterns, enabling us to effectively find bots, query outliers, and do query recommendation 25
References • SkyServer Site Logs: http://skyserver.sdss.org/log/ • Sloan Digital Sky Survey SkyServer project’s website: http://skyserver.sdss.org/ • G. Abdulla, “Analysis of SDSS SQL server log files”, UCRL-MI-215756-DRAFT. Lawrence Livermore National Laboratory, 2005 • T. Malik, R. Burns, A. Chaudhary. Bypass Caching: Making Scientific Databases Good Network Citizens. In Proceedings of the 21st International Conference on Data Engineering, 2005. • Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157-1166 (1997) • Fast and effective text mining using linear-time document clusteringB Larsen, C Aone - Proceedings of the fifth ACM SIGKDD international conference …, 1999 - portal.acm.org • Web mining research: a survey R Kosala, H Blockeel - ACM SIGKDD Explorations Newsletter, 2000 - portal.acm.org • Mehran Sahami, Timothy D. Heilman: A web-based kernel function for measuring the similarity of short text snippets. WWW 2006: 377-386 • Plan to publish our web and SQL traffic research in a MSR TR as well as make the database, docs, slides & code available at http://skyserver.wordpress.com this week 26
QA 27
Gives us another optimization • Current Design Inefficient • Many Jaccard’s over very large sets of strings • Let’s use hashing to speed up Jaccard and enable us to store more features in memory • Hash(Template) => [Hashes(1-Grams), Hashes(2-Grams) …] • We can use Python’s 32-bit String hash function • But what about collisions? • Not many unique N-grams • SQL & Schema vocabulary small • What about hashing query templates? • Birthday Problem • We can expect collisions with probability greater than a 50% after hashing: • ½ + sqrt(1/4 + 2*2^32 * ln(2)) = ~77K
Clustering • Primarily two flavors of clustering • Expectation-Maximization (EM) • K-Means • Don’t want EM • Model-based clustering for unobservable data • K-Means is actually just a special case of EM • Clusters modelled by spherical Gaussian distributions • Each data item is assigned to one cluster • The mixture weights are equal
Now how do we compare two sets of N-grams? -- TODO • Jaccard’s Measure • Cardinality of the intersection of the two sets divided by the cardinality of the union of the two sets • Traditional, established metric for similarity • Captures Euclidean and Cosine properties • Since sets remove duplicates, we index n-grams with numbers to denote repetition • Multi-Set Jaccard’s Measure
(3): How well can we classify SQL? • Motivation for converting unstructured to structured text • Can optimize a common layer for SQL style representation • Reusable ML development framework • Although translating to SQL is not trivial • Has nice properties • Highly constrained, clean vocabulary, relational, propositional, parseable
But we have 8 sets of n-grams … -- TODO • We compute the Jaccard separately for each size n-gram • Not a good idea to mix sizes (lose independence) • Wish to prioritize smaller grams for better results and to maximize similarity • Since the Jaccard of smaller grams will result in a higher similarity value than larger grams • Prioritize the first 3 grams • Reverse Fibonacci sequence enables us to grow the weights proportionally and exponentially • 78% goes to first 3 grams • Comforting to know we’re taking into account larger grams (22% sounds reasonable to us) • Then take the weighted mean of the Jaccard’s scaled by their respective Fibonacci number • This is arbitrary, but we liked the results • Compared to exponential (2 based) and just plain ol’ unigrams with no scaling • Gives us something in between these two
Finally, an example: Distance Between 2 SQL Queries -- TODO • [include two queries, their n-grams, jaccard of each, multiplied by the fibonacci element, then the mean] • For now, say n-grams from 1-3 (so it can fit on a slide)
2nd Step: Choosing Initial Centers • First, what’s a center? • Common clustering done over numerical vectors • Euclidean distance and midpoint (‘Centroid’) • But we’re dealing with sets of n-grams • No clear average midpoint formula • So instead we use ‘Clustroids’ • SQL queries represent centers • Since we can’t do average midpoint, we find the query minimizes square error with everybody else within the cluster
Initialization • Hill-Climbing • No really good method that escapes local minima • Simulated Annealing • Genetic K-Means • Difficult to apply to our features and distance metric • Buckshot (1992) [include cite]) • Hierarchical Agglomerative Clustering (HAC) • Dendrogram Graph • Start w/ SQRT(N) samples, Greedily combine clusters (All-Pairs) until left with K Clustroids • O(N^2–N*K)
But this just does not scale for 20M -- TODO • How many Jaccards for just BuckShot [include] • Plus how many Jaccards for KMeans • (probably more, stops when the clusters are stables, which is unknown number of steps) • One can set a max number of iterations • So how about sample more • Just recomputing clustroids is n^2 • Need to do all-pairs to find center with min square error • Sample sqrt(n) to make it linear
Can we initialize centers faster? -- TODO • BuckShot is slow • Can we do this step faster and produce better clusters? • Idea: Cuckoo Clustering • Inspired by Cuckoo Hashing and Genetic KMeans [include cites]
Side Excursion: Cuckoo Clustering – TODO (Should be extra slides) • [describe algorithm] • K buckets • Elect better leaders • Spill over to next best bucket • [Analyze run time] • [include pseudo code]
Cuckoo Clustering – TODO (extra slides) • Nests (capacity 2) • K = 4 • Sample size = 8
Clustering 116,425 SQL Queries • SQL Queries chosen uniformly at random • K = 56 • Max K-Means Iterations = 20 • MSEWC = Mean Square Error Within Cluster
Plus … -- TODO • Each query template and the number of times it occurred • The most popular ones should represent initial centers • And, cleaned queries grouped by the same length should be assigned to the same cluster initially • Get one iteration of AssignClusters for free • Now we can do KMeans
Bootstrap Example -- TODO • [include example, before and after group bys]
Experiment -- TODO • Setup • SQL Server 2005 • IronPython <== I love this language • Did not use Data Mining package in SQL Server • 'Microsoft Clustering' designed around and optimized for Euclidean vectors • Not easy to write a SQL Server C++ plug-in for my style feature sets + Jaccard + sampling logic • Computation Time: [?] • KMeans Iterations: [?]
Some Optimizations – TODO (extra slides) • Pruning when we find similarity = 1 • Hashing • Speeds up Jaccard and reduces memory • Hash(cleanedQuery) => List of Hash(N-gram)’s • Use Python’s 32-bit string hash function • Collisions? • Not many unique N-grams • Number of SQL keywords + SDSS schema fields very small • Birthday Problem: We can expect collisions with probability greater than a ½ after hashing: • ½ + sqrt(1/4 + 2*2^32 * ln(2)) = ~77k queries • We have ~77k unique cleaned queries • Good enough
We can do better: ‘Finesse’ -- TODO • Idea: KMeans does the hard work of grouping similar queries together • Intuition: We can find 'tighter' query groups within a cluster fast • We choose a threshold (say within 85% similarity) • Value of K not really important now • [include algorithm]
Finesse Algorithm -- TODO • [include algorithm] • Going backwards from 1.0 to threshold in decremental step values key to finding optimal clusters and for speed
A Popular Cluster Found • Represents 18% of all the SQL queries (3,100,655 hits) • Sum of Square Errors/Cluster Size =3.58006287097031E-06 • Meaning tiny variation within cluster
And since we collect N-grams … • … We can find the common phrases • Ex. Top 15 Most PopularEight-Grams