260 likes | 623 Views
Introduction to information theory. LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/06. Today. Information theory Hw #1 Exam #1. Information theory. Information theory. Reading: M&S 2.2 It is the use of probability theory to quantify and measure “information”. Basic concepts: Entropy
E N D
Introduction to information theory LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/06
Today • Information theory • Hw #1 • Exam #1
Information theory • Reading: M&S 2.2 • It is the use of probability theory to quantify and measure “information”. • Basic concepts: • Entropy • Cross entropy and relative entropy • Joint entropy and conditional entropy • Entropy of the language and perplexity • Mutual information
Entropy • Entropy is a measure of the uncertainty associated with a distribution. • The lower bound on the number of bits that it takes to transmit messages. • An example: • Display the results of horse races. • Goal: minimize the number of bits to encode the results.
An example • Uniform distribution: pi=1/8. • Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64) (0, 10, 110, 1110, 111100, 111101, 111110, 111111) • Uniform distribution has higher entropy. • MaxEnt: make the distribution as “uniform” as possible.
Cross Entropy • Entropy: • Cross Entropy: • Cross entropy is adistance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).
Relative Entropy • Also called Kullback-Leibler divergence: • Another “distance” measure between probability functions p and q. • KL divergence is asymmetric (not a true distance):
Reading assignment #1 • Read M&S 2.2: Essential Information Theory • Questions: For a random variable X, p(x) and q(x) are two distributions: Assuming p is the true distribution. • p(X=a)=p(X=b)=1/8, p(X=c)=1/4, p(X=d)=1/2 • q(X=a)=q(X=b)=q(X=c)=q(X=d)=1/4 (a) What is H(X)? • What is H(X, q)? • What is KL divergence D(p||q)? • What is D(q||p)?
Joint and conditional entropy • Joint entropy: • Conditional entropy:
Entropy of a language(per-word entropy) • The entropy of a language L: • If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
Per-word entropy (cont) • p(x1n) can be calculated by n-gram models • Ex: unigram model
Perplexity • Perplexity is 2H. • Perplexity is the weighted average number of choices a random variable has to make. => We learned how to calculate perplexity in LING570.
Mutual information • It measures how much is in common between X and Y: • I(X;Y)=KL(p(x,y)||p(x)p(y)) • I(X;Y) = I(Y;X)
Summary on Information theory • Reading: M&S 2.2 • It is the use of probability theory to quantify and measure “information”. • Basic concepts: • Entropy • Cross entropy and relative entropy • Joint entropy and conditional entropy • Entropy of the language and perplexity • Mutual information
Hw1 • Q1-Q5: Information theory • Q6: Condor submit • Q7: Hw10 from LING570. • You are not required to turn in anything for Q7. • If you want feedback on this, you can choose to turn it in. • It won’t be graded. You get 30 points for free.
Q6: condor submission • http://staff.washington.edu/brodbd/orientation.pdf • Especially Slide #22 - #28.
For a command we can run as: mycommand -a -n <mycommand.in >mycommand.out The submit file might look like this: save it to *.cmd Executable = mycommand The command Universe = vanilla getenv = true input = mycommand.in STDIN output = mycommand.out STDOUT error = mycommand.error STDERR Log = /tmp/brodbd/mycommand.log A log file that stores the results of condor sumbission arguments = "-a -n“ The arguments for the command transfer_executable = false Queue
Submission and monitoring jobs on condor • Submission: condor_submit mycommand.cmd => get a job number • List the job queue: condor_q Status changes from “I” (idle) to “R” (run) to • “H”: means the job fails. Look at the log file specified in *.cmd • Disappeared from the queue: You will receive an email • Use “man condor_q” etc. to learn more about those commands.
The path names for files in *.cmd In the *.cmd file: Executable = aa194.exec input = file1 • The environment (e.g., ~/.bash_profile) might not be set properly • It assumes that the files are in the current directory (the dir where the job is submitted) => Use the full part names if needed.