440 likes | 654 Views
DATA 220 Mathematical Methods for Data Analysis September 17 Class Meeting. Department of Applied Data Science San Jose State University Fall 2019 Instructor: Ron Mak www.cs.sjsu.edu/~mak. Some Counting Principles.
E N D
DATA 220Mathematical Methods for Data AnalysisSeptember 17 Class Meeting Department of Applied Data ScienceSan Jose State UniversityFall 2019Instructor: Ron Mak www.cs.sjsu.edu/~mak
Some Counting Principles • To perform statistics, it is important to know how to count the size of a population or the size of a sample drawn from the population. • Example: Your video streaming service has these types of movies: You want to watch three movies tonight, one of each type. There are different combinations of movies you can watch. The number of ways of making a sequence of independent choices is the product of the number of choices at each step.
Some Counting Principles, cont’d • Example: There are 26 letters in the alphabet. How many 1-letter sequences can you make? 2-letter sequences? etc. The number of sequences of k objects chosen from a collection of n objects is nk.
Some Counting Principles, cont’d • Example: This time, there can be no repeated letters within a sequence. Within a sequence, each position has one fewer number of choices than the preceding position. The number of sequences of k objects chosen without repetition from a collection of n objects is
Some Counting Principles, cont’d • Example: How many 3-digit numbers are there using the digits 1 through 9 that have no repeated digits? • Example: How many of those 504 numbers are odd, i.e., how many have 1, 3, 5, 7, or 9 as their third digit?How many choices are there for the third digit? It depends on what we chose for the first two digits!
Some Counting Principles, cont’d • Solution #1: There are 4 even digits: 2, 4, 6, and 8.There are 5 odd digits: 1, 3, 5, 7, and 9.
Some Counting Principles, cont’d • Solution #2: Instead of determining the numbers of digit choices from left to right, start with the third digit and work from right to left:
Factorial Notation • The product is n factorial, written n! • n! grows rapidly:
Factorial Notation, cont’d • Factorial notation can simplify formulas. • Example: Out of 15 children, you must choose a team of 9 children. Therefore, the number of teams you can make is and 6 children are left behind.We could have written this product all the way down to 1, i.e., as 15!, but then we must remove the final six products, by cancellation: Math shorthand only! We wouldn’t want to actually calculate this way.Just multiply the numbers from 15 down to 7 as before.
Factorial Notation, cont’d • In the children’s teams example, n = 15 children total and k = 9 children chosen for the team. The number of sequences of k objects chosen without repetition from a collection of n objects is The number of sequences of k objects chosen without repetition from a collection of n objects is
Count the Complement • More details about the streaming videos: • Total possible combinations of three movies, one from each type, is • Tonight, you again want to watch three movies, one of each type, but you do not want all three movies to have car chases. • With this restriction, how many movie combinations do you have?
Count the Complement, cont’d • Solution: • The total choices without restrictions is • The number of disallowed triple features is • Therefore, the number of allowable choices must be Choose three movies,one from each type. But not all three can have car chases.
Count the Complement, cont’d • Example: You want to line up 15 children. However, Mary and John don’t get along, so you must keep them separate. How many allowable line-ups? • Solution: The total number of possible line-ups: 15!How many ways can we put Mary and John together in line? There are 14 adjacent pairs of positions in a line-up, and either Mary or John can be in the first position of the pair: bad pairs. For each bad pair, there are 13! ways to line up the remaining children, for a total of unallowable line-ups.Therefore, the number of allowable line-ups is
Uncertainty • Data scientists make conclusions (inferences) about an entire collection of data (the population) based on evidence from samples drawn from the population. • Use the methods of analytical statistics. • Because a sample is not necessarily identical to the population from which it was drawn, there is some uncertainty about the inferences. • For example, how accurately does the sample mean and sample standard deviation represent the corresponding population measures?
Probability • Probability theory creates mathematical models to study chance or randomness. • Probability is the tool that enables us to make inferences. • Our newly enhanced ability to count will be extremely useful.
Classical Interpretation of Probability • Originally arose from studying games of chance. • The probability that a flipped fair coin will land heads is 1/2. • The probability that a card drawn from a shuffled deck of 52 cards is an ace is 4/52.
Classical Interpretation, cont’d • Each distinct result is an outcome. • An event is a collection of outcomes. • The probability of an event E is the ratio of the number of outcomes Nefavorable to event Eto the total number of possible outcomes N. • Example: Let Ne = 4 ace cards and N = 52 total cards. Then P(drawing an ace card) = 4/52.
Relative Frequency Interpretation of Probability • An empirical approach that uses experiments to count the occurrences of event E. • Repeat an experiment a number of times. If event E occurs 30% of the time, then 0.3 can be a good approximation to the probability of event E. • If n is the number of trials of the experiment and event E occurs on ne of those trials, then • The approximation improves with larger values of n.
Some Basic Probability Laws • For any event A: • The probability of event A occurring ranges from never (probability 0) to always (probability 1). • The complement of an event Ais the event that A does not occur: • If A and B are mutually exclusive events (both cannot occur simultaneously), then:
Some Basic Probability Laws, cont’d • Example: In an arbitrary line-up of 15 children, what is the probability of a bad line-up, where Mary and John are together? • Example: What is the probability of a good line-up? • Example: What is the probability that a die roll will result in either an even number or the number 7? =
Break • 15 minutes
Random Variables • random variable: A numerical variable whose value depends on the outcome of a chance experiment. • Computers use algorithms to generate pseudo-random values. • True random values are often derived from measurements of natural phenomena.
Discrete Random Variables • A discrete random variable has a set of values that is a collection of isolated points on the number line. • A common set values for a discrete random variable is a subset of integers. • In a dataset, we can treat a sequence of numerical values as discrete random values if the values don’t depend on each other. • Example: The ages of the passengers in the Titanic Survival dataset.
Continuous Random Variables • A continuous random variable has a set of possible values in an entire interval of the number line. • The values are floating-point (real) numbers. • Example: You break a meter stick in two. The distance x from one end of the stick to where the break occurs is a continuous random value. x = 0.2 m is possible. So is x = 0.72 m. In fact, any value of x from 0 to 1 meter is possible. Therefore, x is a continuous random variable. What if you rounded x to the nearest millimeter?
Probability Distributions of Random Variables • Let x be a discrete random variable associated with an experiment whose results are determined by chance. • The outcome that occurs when the experiment is performed determines which value of x is observed. • The total probability for all outcomes is 1. • The probability distribution of x describes how much of this total probability is placed on each possible x value.
Graphs of Probability Distributions • We can graph a probability distribution. • The x axis is the possible values of variable x. • The y axis is the probability of each x value. • There are several common probability distributions, each with a different shape when the distributions are graphed. • The more random values from a particular distribution are graphed, the closer the graphgets to the theoretical shape of the distribution.
Graphs of Probability Distributions, cont’d • By graphing the random values of a variable, we can often determine its probability distribution. • A typical task of data analysis is to examine a sequence of values in a dataset and determine its probability distribution. • Knowing the probability distribution enables us to apply more advanced analytical procedures that are appropriate to that type of distribution.
Graphs of Probability Distributions, cont’d • Example: We drew a histogram of the ages of the Titanic passengers:From the rough shape of the graph, we may inferthat the ages have a normal distribution.
Uniform Probability Distribution • Within the interval of possible values for this continuous random variable x, every interval of equal width has an equal probability of being observed. https://courses.lumenlearning.com/odessa-introstats1-1/chapter/the-uniform-distribution/ http://www.a-levelmathstutor.com/stat-discrete-rand-vars4.php
Normal Probability Distribution • The distribution of this continuous random variable x forms a bell curve. https://statisticsbyjim.com/basics/normal-distribution/
Exponential Distribution • The values of this continuous random variable are often concerned with the amount of time until some specific event occurs. • There are fewer large values and more small values. • Examples: • The amount of time until a customer finishes browsing and actually purchases something in your store. • The length of time of a phone call. • The amount of time you need to wait until the bus arrives.
Exponential Distribution, cont’d • Example: The length of time, in minutes, that a postal clerk spends with a customer: https://courses.lumenlearning.com/introstats1/chapter/the-exponential-distribution/
Exponential Distribution, cont’d • Example: The probability that a postal clerk will spend 4 to 5 minutes with a randomly selected customer: https://courses.lumenlearning.com/introstats1/chapter/the-exponential-distribution/
Binomial Distribution • Properties of a binomial experiment: • It consists of a fixed number n of trials. • Each trial can result in only two mutually exclusive outcomes, label success (S) and failure (F). • The outcomes of different trials are independent. • The probability p that a trial results in S is the same for each trial.
Binomial Distribution, cont’d • This discretebinomial variablex is the number of successes observed when the experiment is performed. • Variable x has a binomial distribution. • It’s related to anormal distributionfor a continuousrandom variable. You will perform 20 trials of abinomial experiment, where theprobability of success of eachtrial is p = 0.25. What is theprobability that you’ll havex successes from the 20 trials? http://www.real-statistics.com/binomial-and-related-distributions/binomial-distribution/
Poisson Distribution • The values of this discrete random variable x with a Poisson distribution is the number of events over an interval of time or space. • Example: The number of cars arriving at a toll booth during a given 5-minute period of time. The event is the arrival of the car. The time is the 5 minutes. • Example: The number of plastic particles in a liter of water sampled from the ocean. The event is the discovery of a plastic particle. The space is the liter of sampled water. The amount of time between Poisson events has an exponential distribution.
Poisson Distribution, cont’d • The graph increasingly resembles a normal curve as the number of values of x increases. https://www.sciencedirect.com/topics/social-sciences/poisson-distribution
Animated Graphs • Up until now, the graphs we’ve drawn with the Python packages were static (nothing moves). • Often it can be insightful to draw dynamic (animated) graphs. • Update the graph to include new data values. • The matplotlib.animation.FuncAnimationfunction repeatedly calls a function you’ve written. • It pauses a given duration in milliseconds between calls to your function.
Demo: Animated Graph of Die Throws • If you roll a die, which face comes up is a random variable. • It’s a uniformly distributed random variable that can have face values 1, 2, 3, 4, 5, or 6. • Each value has the probability 1/6. • The more times you roll the die, the closer the frequencies of the face values become equal. • Animate the frequency (probability) distribution bar chart as the number of rolls increases.
Demo: Animated Graph of Die Throws • Function show_bar_chart(written for the Titanic dataset analysis) draws a bar chart of the frequency of each of the six face values. • Function update_framegenerates new random face values each time it’s called. Then it calls function show_bar_chart. • matplotlib.animation.FuncAnimationcalls function update_framefor each animation frame, which occurs every 25 milliseconds (or any duration you specify).
Demo: Animated Graph of Die Throws, cont’d • Run the program on the command line of a terminal window (not in a Jupyter notebook): • This program requires two arguments on the command line, the number of frames (e.g., 100) and the number of die rolls per frame (e.g., 10). • The following Python statements obtain the values of the first and second command-line parameters. ipython -iRollDieDynamic.py 100 10 import sys number_of_frames = int(sys.argv[1]) rolls_per_frame = int(sys.argv[2])
Lab Assignment #4: Animated Graphs • In four Python programs, draw animated bar charts of the following probability distributions: • normal • exponential • binomial • Poisson • Each frame should update the graph with more new random values. • You choose the parameters (range, mean, standard deviation, etc.) for each graph. You can draw frequency distributiongraphs instead. Use the dynamic dieprogram as a model
Lab Assignment #4, cont’d • Each animated graph should display: • A running count of how many random values are graphed. • The parameters used (mean, standard deviation, etc.) to generate the random values. • Labeled x and y axes. • A label at the top of each bar that indicates the bar’s value. • Go online to find the which Python functions will generate random values for each distribution.
Lab Assignment #4, cont’d • You can write Python programs instead of creating Jupyter notebooks. • Seaborn has difficulty displaying animation inside of notebooks. • Try the Eclipse Python plugin! • Submit to Canvas: Assignment #4 • Due Monday, September 23 at 11:59 PM