250 likes | 339 Views
Probability and Statistics for Data Mining. COMP5318. Question 1. Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?. Probability.
E N D
Question 1 • Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Probability • Probability is the mathematical language to understand uncertainty. • We need to make decisions in the presence of uncertainty which is ever present. • Example: The Earth is warming- a phenomenon that is known as Global Warming (GW). Is modern human activity the cause of GW. • Physics driven approach • Data driven approach
Experiments and Observation • When an experiment is carried out we observe the outcome – which is often uncertain. • If not uncertain then why carry out the experiment? • We look into a random shopping basket. Does it contain a a packet of “Tofu”? • We toss a coin, does it land on “Heads”? • We ask a question: “Is it raining in Broom, WA, right now”?
Building Blocks of Probability • The space of all possible outcomes is called the sample space. • Non-trivial to decide. • Single Coin Toss. The space is {H,T}. • Shopping Basket. The space of all possible combinations of all items sold in the store. • Shopping Basket: {Tofu, Not-Tofu}.
Events • Events are subsets of the sample space. Events are often defined in familiar terms. • In the shopping basket scenario • A vegetarian shopping basket is an event. • all possible vegetarian item combinations. • Throw of a dice. The event we are looking for could be: Even Number = {2,4,6}, where the sample space = {1,2,3,4,5,6}
Events • Let G be the set of all galaxies. Characterize each galaxy by three number • d: distance from earth • a: major axis • b: minor axis • Elliptic Galaxies (EG) • EG ={(a,b,d) | a/b > 1.5} • Distant Spiral Galaxies (DSG) • DSG ={(a,b,d) | a/b <= 1.5 and d > 10}
Events • Let G be the set of all genes. Each gene can be “on” or “off”. Let E correspond to the event: all genes which are “on” when the skin cells are “starved”.
Events are Sets • At the most basic level events are sets. Therefore we can carry out set union, difference and intersection on events. • For example: • E1: shopping baskets which contain Tofu • E2: shopping baskets which contain Milk • E1 U E2: shopping baskets which contain either Tofu or Milk
Probability • Let S be the space of all possible elementary outcomes. Let = Power(S) be the power set of S. Then the probability P is function: P : [0,1] that satisfy the following properties (axioms):
Interpretation of Probability • Physical or Ontological: Long term frequency • 50% chance that a coin will land on heads. • 20% of all Woolworth shopping baskets are vegetarian. • 22% of all Woolworth shopping baskets in Northbridge plaza are vegetarian. • Epistemological : Degree of Belief • 20% chance that my neighbours are watering their lawn on “dry” days. • 99% chance that the green immovable object outside my house is a Tree. • 90% chance that Australia will win the cricket world cup.
Example • Two coin tosses. Let H1 be the event that a heads occurs on toss 1 and H2 a heads on toss 2. All events are equally likely. • Sample space = {HH, HT, TH, TT} • H1 = {HH, HT} • H2 = {HH,TH} • P(H1 U H2) = ½ + ½ - ¼ = 3/4
Example • Two events A and B are independent if • P(A ∩ B) = P(A)P(B) • P(A∩B) is also written as P(AB) and P(A,B). • If A and B are disjoint event then A and B such that P(A) > 0 and P(B) > 0 then A and B cannot be independent • P(A ∩ B) = 0. Yet P(A)P(B) > 0 • Except for this case you cannot determine independence by looking at a Venn diagram
Question • A shopping basket can either be kosher or not. The probability that it will be kosher is 3/4. Examine 10 baskets at a check out counter. What is the probability that there will be at least one kosher basket.
Answer • Let E be the event “At least one kosher basket.” Let NKi be the event that the i-th basket is non-kosher. Independence
Example • For an Online Book Seller (OBS) the conversion rate is 1/100, i.e., every 100th visitors ends up making a purchase. What is the probability that at least one purchase will be made in 10 consecutive visits (by distinct customers).
Example • Two people take turns to sink a basketball. P1 succeeds with probability 1/3 and P2 with ¼. What is the probability that P1 succeeds before P2. • Requires clever setting up of the events. • Let E be the event that P1 succeeds before P2. • Let Ai be the event that P1 succeeds before P2 on the ith trial. • Ai ∩Aj = Ø and E = [i=11Ai
Conditional Probability • Very Important Concept • P(A|B) is “fraction of occurrences of B in which A also occurs” • P(A|B) = P(A ∩ B)/P(B); P(B) > 0 • For a fixed B, P(.|B) is a probability • Therefore if A1 and A2 are disjoint then • P(A1 U A2 |B) = P(A1|B) + P(A2|B) • Note, P(A|B U C) =/= P(A|B) + P(A|C) • Also P(A|B) =/= P(B|A)
Standard Example Suppose a test is positive. What is the probability of disease? D is disease +/-; Test positive or negative
Standard Data Mining Example Suppose the data above closely resembles the behaviour of the population at large. What is the chance that those who buy a Diaper will also buy Beer. = P(Diaper ∩ Beer)/P(Diaper) = 0.6/0.8 = 0.75 Is Diaper an Event?
Conditional Independence • If A and B are independent then P(A|B)=P(A) • P(AB) = P(A|B)P(B) • Law of Total Probability.
Question 1 • Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Answer to Question 1 But what does G=F and D=Y mean? We have not even formally defined them.