130 likes | 140 Views
Learn the basics of Bayesian Inference and how to use a Naïve Bayesian Classifier to predict user inputs on web forms. Understand the Bayesian approach for form predictions and building classifiers. Explore the logic behind likelihood probability computation and challenges like missing values and privacy concerns.
E N D
Challenges and successes in predicting web form user inputs • Sumit Amar • Research Developer • Microsoft Corporation • samar@microsoft.com
Objectives • Motivation behind logging user actions • How to log your web application usage • Basics of Bayesian Inference • A Naïve Bayesian Classifier design to build form predictions
Source: http://www.magnetism.co.nz/Files/Blogs/Using%20Web%20Forms(1).png
Web UI Instrumentation • Designed to capture user interactions such as text inputs, dropdown and checkbox selections etc. • Little to No code required to plugin into existing websites • Batches multiple interactions • Online or offline propagations (DB or File to DB) • Cross browser • Can be pipelined to analytics systems (such as Omniture)
Rationale to instrument web interfaces • Understand user behavior, intentions, and trends • Gauge usability of the system • Capture true performance metrics • Generate test automation code or smoke tests • Use data mining to enhance user experience
Bayesian approach to build predictions for form entries • Based on Thomas Bayes’ ~250 year old theorem P(H|E) = P(E|H) * P(H) P(E) Probability of a hypothesis given an evidence = Probability of an evidence given the hypothesis * Probability of hypothesis, then normalized.
Bayesian approach to building predictions for form entries For example: P(E) = (P(E|H) * P(H)) / P(E) E | H (2/3 * 3/6) / 3/6 => 0.667 1 | 0 1 | 0 2 | 8 2 | 0 3 | 5 1 | 6 However, the E could be multiple columns, i.e. E = [C1,C2,...,Cn] where C=Column
Building a classifier for form Data • Data captured with instrumentation framework • But contains too much data for the classifier’s purpose
Building classifier for the form Data [Filtered view of] captured data • But, the format of data is not in the way the classifier needs
Building classifier for the form Data Transposed form of data (computed on page loads) Because E = (C1, C2 ..Cn) Where Cx = Input/Evidence variables Let C1=txtName, C2=txtLocation, H = txtQuestion For each hypothesized value of the output variable P(E|H) = P (C1|H) * P(C2|H) --- (i) Likelihood = (i) * P(H) Probability = Normalized (0-1) Likelihood
Probability computation logic • Based on hypothesis variable and resource (page), lookup classifier source table • Retrieve cardinality for each distinct hypothesis by grouping possible hypotheses (used for P(H) calculation) • Create a likelihood dictionary with key as name of E evidence and value as the values of hypotheses with their likelihoods (P(E|H)) • For each input/evidence variable E • Retrieve all possible hypotheses H where evidence was the value of E • Compute (E|H) for each (H) and store in a list with name of the key as the hypothesis value and value as the likelihood • Multiply all E|H values // P(E|H) = P(C1|H) * P(C2|H) *..* P(Cn|H) to obtain likelihoods • Multiply with P(H) i.e. the total of H divided by total of all hypothesis • Normalize likelihoods to bring them within 0 to 1 range probability • Return each possible hypothesis value along with their probabilities
Challenges and recommendations • Missing values in inputs • Monte Carlo Sampling • Gaussian Approximation, and several more • Privacy? • Don’t log PII (personally identifiable information) • Performance? • Batch requests • Use longer intervals/timeouts
Resources • Sumit Amar – samar@microsoft.com • Slides – www.amar.co.in/sumit/Web2.0TalkPredictingInputs.ppt • Demo code (PHP/MySQL) – www.amar.co.in/sumit/i.zip