1 / 24

Probabilistic Model of Sequences

Probabilistic Model of Sequences. Ata Kaban The University of Birmingham. Sequence. Example1: a b a c a b a b a c Example2: 1 0 0 1 1 0 1 0 0 1 Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 Roll a six-sided die N times. You get a sequence. Roll it again: You get another sequence.

Download Presentation

Probabilistic Model of Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Model of Sequences Ata Kaban The University of Birmingham

  2. Sequence • Example1: a b a c a b a b a c • Example2: 1 0 0 1 1 0 1 0 0 1 • Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 • Roll a six-sided die N times. You get a sequence. • Roll it again: You get another sequence. • Here is a sequence of characters, can you see it? • What is a sequence? • Alphabet1 = {a,b,c}, Alphabet2={0,1}, Alphabet3={1,2,3,4,5,6}

  3. Probabilistic Model • Model = system that simulates the sequence under consideration • Probabilistic model = model that produces different outcomes with different probabilities • It includes uncertainty • It can therefore simulate a whole class of sequences & assigns a probability to each individual sequence • Could you simulate any of the sequences on the previous slide?

  4. Random sequence model • Back to the die example (can possibly be loaded) • Model of a roll: has 6 parameters: p1,p2,p3,p4,p5,p6 • Here, p_i is the probability of throwing i • To be probabilities, these must be non-negative and must sum to one. • What is the probability of the sequence [1, 6, 3]? p1*p6*p3 • NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.

  5. Maximum Likelihood parameter estimation • The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called training set. • Example (t=tail, h=head) : [t t t h t h h t] • Count up the frequencies: t5, h3 • Compute probabilities: • p(t)=5/(5+3), p(h)=3/(5+3) • These are the Maximum Likelihood (ML) estimates of the parameters of the coin. • Does it make sense? • What if you know the coin is fair?

  6. Overfitting • A fair coin has probabilities p(t)=0.5, p(h)=0.5 • If you throw it 3 times and get [t, t, t], then the ML estimates for this sequence are p(t)=1, p(h)=0. • Consequently, from these estimates, the probability of e.g. the sequence [h, t, h, t] = …………. • This is an example of what is called overfitting. Overfitting is the greatest enemy of Machine Learning! • Solution1: get more data • Solution2: build in what you already know into the model. (Will return to it during the module)

  7. Why is it called Maximum Likelihood? • It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood). P(Data|parameters)

  8. Probabilities • Have two dice D1 and D2 • The probability of rolling I given die D1 is called P(i|D1). This is a conditional probability • Pick a die at random with probability P(Dj), j=1 or 2. The probability for picking die Dj and rolling i is is called joint probability and is P(I,Dj)=P(Dj)P(I|Dj). • For any events X and Y, P(X,Y)=P(X|Y)P(Y) • If we know P(X,Y), then the so-called marginal probability p(X) can be computed as

  9. Now, we show that maximising P(Data|parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.

  10. Why did we bother? Because in more complicated models we cannot ‘guess’ the result.

  11. Markov Chains • Further examples of sequences: • Bio-sequences • Web page request sequences while browsing • These are not anymore random sequences, but have a time-structure. • How many parameters would such a model have? • We need to make simplifying assumptions to end up with a reasonable number of parameters • The first order Markov assumption: the observation only depends on the immediately previous one, no longer history • Markov Chain = sequence model which makes the Markov assumption

  12. Markov Chains • The probability of a Markov sequence: • The alphabet’s symbols are also called states • Once the parameters are estimated from training data, the Markov chain can be used for prediction • Amongst others, Markov Chains are successful for web browsing behavior prediction

  13. Markov Chains • A Markov Chain is stationary if at any time, it has the same transition probabilities. • We assume stationary models here. • Then the parameters of the model consist of the transition probability matrix & initial state probabilities.

  14. ML parameter estimation • We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences. • The ML estimate of the transition matrix will be again very intuitive: Remember that

  15. Simple example • If it is raining today, it will rain tomorrow with probability 0.8 implies the contrary has probability 0.2 • If it is not raining today, it will rain tomorrow with probability 0.6 implies the contrary has probability 0.4 • Build the transition matrix • Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix. • Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)

  16. Examples of Web Applications • HTTP request prediction: • To predict the probabilities of the next requests from the same user based on the history of requests from that client. • Adaptive Web navigation: • To build a navigation agent which suggests which other links would be of interest to the user based on the statistics of previous visits. • The predicted link does not strictly have to be a link present in the Web page currently being viewed. • Tour generation: • Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.

  17. Building Markov Models from Web Log Files • A Web log file is a collection of records of user requests for documents on a Web site, an example: • Transition matrix can be seen as a graph • Link pair: (r - referrer, u - requested page, w - hyperlink weight) • Link graph: it is called the state diagram of the MarkovChain • a directed weighted graph • a hierarchy from the homepage down to multiple levels 177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

  18. Link Graph: an example (University of Ulster site) Zhu et al. 2002 State diagram: - Nodes: states - Weighted arrows: number of transitions

  19. Experimental Results(Sarukkai, 2000) • Simulations : • ‘Correct link’ refers to the actual link chosen at the next step. • ‘depth of the correct link’ is measured by counting the umber of links which have a probability greater than or equal to the correct link. • Over 70% of correct links are in the top 20 scoring states. • Difficulties: very large state space

  20. Simple exercise • Build the Markov transition matrix of the following sequence: [a b b a c a b c b b d e e d e d e d] State space: {…………….}

  21. Further topics • Hidden Markov Model • Does not make the Markov assumption on the observed sequence • Instead, it assumes that the observed sequence was generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian • More powerful • Estimation is more complicated • Aggregate Markov model • Useful for clustering sub-graphs of a transition graph

  22. HMM at an intuitive level • Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A,B] if the initial state is S1? The same question if the initial state is chosen randomly with equal probabilities. ANSWER: If the initial state is S1: 0.2*(0.4*0.8+0.6*0.7) = 0.148. In the second case: 0.5*0.148+0.5*0.3*(0.3*0.7+0.7*0.8) = 0.1895.

  23. Conclusions • Probabilistic Model • Maximum Likelihood parameter estimation • Random sequence model • Markov chain model --------------------------------- • Hidden Markov Model • Aggregate Markov Model

  24. Any questions?

More Related