210 likes | 367 Views
Classification based on online learning. Keywords: reinforcement learning and inverted indexes. Classification learning at different levels. Supervised Given appropriate “targets” for data learn to map from data to target Unsupervised
E N D
Classification based on online learning Keywords: reinforcement learning and inverted indexes
Classification learning at different levels • Supervised • Given appropriate “targets” for data learn to map from data to target • Unsupervised • Given data identify “targets” embedded in the data without hints or clues • Reinforcement • Given data and a policy, guess a target, receive reward or punishment, adjust the policy
A motivating example: SPAM • Imagine you wish to create a system to reduce spam • You cannot always predict who send you mails; only upon consideration of the content, i.e., subject, that you can determine if a mail is spam or not
SPAM detection using a classifier • Hence, the goal is to “learn” to distinguish good subjects from bad subjects • This problem is a classic classification problem
Learning a mapping function • The mapping involves identifying for each item in the “Mailbox” a relevance value in the range {0,1} R
(Relevance) Learning based on RL St SK agent Rt At Rt+1 User St+1 Agent in situation St chooses At, receives a reward R from the user at situation St+1 The goal is to learn the Policy: t(S, A) = Pr{At=a | St = S} At time t, given the situation S, the policy provides the probability that the agent’s action will be A Credit: Based on a model by Gillian Hayes
SPAM prediction • Given at time t there are n emails (situation S), predict for each email its relevance (i.e., the action which is a value between 0,1) so as to maximize the reward R from the user • Given a reward (thumbs up or down, so to speak), adjust policy and expectation about future rewards
Several Key Aspects of RL • The policy (s, a) helps deduce actions • The value function predicts future rewards or expected rewards • Actually what we need is a rank-ordered relevance vector for each item (i.e., emails) • The learning step : rate of updating the policy (associated with the policy function)
A High-Level RL Algorithm for spam classification • Initialize policy and reward representations • Look at what emails have arrived • Predict relevance of each email based on the policy function • Collect rewards from the user • Check to see if policy has converged; if not update the policy at rate • Update expected reward
An Implementation of RL: The Actual RL Environment in Spam Classification F1 F2 R Basically, the mapping function is decomposed into two functions: F1: semantic classification and F2: relevance classification
RL Implementation • From the agent’s perspective the environment is stochastic • The agent has to choose an action • That is classify the best class and present it to the user • Based on its policy and reward representations
The Policy and Reward Functions • The Policy or action prediction function is a vector of action predictions over the set of classes Ci such that Pr(Ci) represents the probability class Ci should be selected as the best (first) class • The Reward function is based on an estimated relevance vector over the set of classes Ci such that Pr(Ci) represents the probability the user will provide the maximum reward for the class Ci
Details on the key RL functions I • The reward or estimated relevance probability vector is Ŷ (i = 1, …, n), where there are n classes • In the absence of knowledge of expected rewards all elements of the Ŷ are initialized to be zero • Recall that each element represents the probability that the corresponding class Ci will receive the maximum reward
Details on the key RL functions II • The policy or action probability vector is represented with P (I=1, …, n), where we have n maximum classes • In the absence of information, at the beginning all elements of P are initialized to be equal to each other (i.e., 1/n) • Pi is probability that the corresponding class Ci will be selected as the best (first) class
The RL Spam Detection Algorithm • Classify incoming emails to classes Ci (i= 1, …, n) • Select the best class (most likely not spam) by sampling the distribution in policy representation P • Generate a pseudo-random number (PRN) and select from P accordingly (explained next) • Create a ranked list of classes Ci based on the policy and the expected reward representations (explained later) • Collect the rewards from the user and update the reward representation Ŷ • Update policy P vector
How to select an action? • By sampling the action probability distribution (policy function) • Assume a simple P representation of [0.2, 0.4, 0.1, 0.3] A PRN between 0 and 1 is generated; then class 1 will be selected if the PRN lies between 0 and 0.2, class 2 is top if PRN is between 0.2 and 0.6, class 3 first when PRN is between 0.6 and 0.7, and class 4 in other cases As the learning progresses the action probability vector converges to a unit vector, hence a single class emerges as the dominant one
Generating a ranked list of relevance values • Choose the class with the highest value in the action probability vector as the top class in instance k • Then generate a vector q(k) as below: qi(k) = 1 + Ŷi(k) if (k) = Ci = Ŷi(k) otherwise Where Ŷi(k) is ith element of vector Ŷ at instance k; as Ŷi(k) is bounded by 1, the top element of q(k) always corresponds to the class ; The remaining elements of q(k) follow the order in the estimated relevance probability vector Ŷ(k)
Updating estimated relevance / reward and the action / policy vectors • After collecting rewards (a value between 0, 1) then each element of the Ŷvector (representing the corresponding class C) is updated as a running average of the reward received for the class • Then another vector E(k) is generated in instance k such that all elements of E are set to zero except the lth element corresponding to the top class in the recently updated Ŷ vector • The P vector is subsequently updated as following: P(k+1) = P(k) + (E(k) – P(k)) where 0< < 1is the learning step Hence, an optimal unit vector (actions) is computed at every instance based on current estimates of the relevance probability vector (rewards)
Some known limitations of RL • Estimating the or the learning step • Too fast or lazy convergence • Initial latency of learning • Scarcity of rewards • Radical shifts in reward patterns • Re-learning may be slow
Acknowledgement • Snehasis Mukhopadhyay – IUPUI, Indianapolis • Gillian Hayes – Univ. of Edinburgh