680 likes | 1.24k Views
Financial Data Mining and Analysis. References: Prof. Hua Chen’s Lecture note (at National Taiwan University) U.S. News and World Report's Business & Technology section, 12/21/98, by William J. Holstein Prof. Juran’s lecture note 1 (at Columbia University)
E N D
Financial Data Mining and Analysis References: Prof. Hua Chen’s Lecture note (at National Taiwan University) U.S. News and World Report's Business & Technology section, 12/21/98, by William J. Holstein Prof. Juran’s lecture note 1 (at Columbia University) J.H. Friedman (1999) Data Mining and Statistics. technical report, Dept. of Stat., Stanford University
Main Goal • Study statistical tools useful in managerial decision making. • Most management problems involve some degree of uncertainty. • People have poor intuitive judgment of uncertainty. • IT revolution... abundance of available quantitative information • data mining: large databases of info, ... • market segmentation & targeting • stock market data • almost anything else you may want to know... • What conclusions can you draw from your data? • How much data do you need to support your conclusions?
Applications in Management • Operations management • e.g., model uncertainty in demand, production function... • Decision models • portfolio optimization, simulation, simulation based optimization... • Capital markets • understand risk, hedging, portfolios, beta's... • Derivatives, options, ... • it is all about modeling uncertainty • Operations and information technology • dynamic pricing, revenue management, auction design, ... • Data mining... many applications
Portfolio Selection • You want to select a stock portfolio of companies A, B, C, … • Information: Stock Annual returns by year A 10%, 14%, 13%, 27%, … B 16%, 27%, 42%, 23%, … • Questions: • How do we measure the volatility of each stock? • How do we quantify the risk associated with a given portfolio? • What is the tradeoff between risk and returns?
Introduction • Premise: All business becomes information driven. • The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. • Competitiveness: How you collect and exploit information to your advantage? • The challenges • Most corporate data systems are not ready. • Can they share information? • What is the quality of the input information • Most data techniques come from the empirical sciences; the world is not a laboratory. • Defining good metrics; abandoning gut rules of thumb may be too "risky" for the manager. • Communicating success, setting the right expectations.
A visualization of a Naive Bayes model for predicting who in the U.S. earns more than $50,000 in yearly salary. The higher the bar, the greater the amount of evidence a person with this attribute value earns a high salary.
Data Mining and Statistics • Data Mining is used to discover patterns and relationships in data with an emphasis on large observational data bases. • It sits at the common frontiers of several fields including Data Base Management, Artificial Intelligence, Machine Learning, Pattern Recognition and Data Visualization. • From a statistical perspective it can be viewed as computer automated exploratory data analysis of large complex data sets. • Many organizations have large transaction oriented data bases used for inventory billing accounting, etc. These data bases were very expensive to create and are costly to maintain. For a relatively small additional investment DM tools offer to discover highly profitable nuggets of information hidden in these data. • Data, especially large amounts of it reside in data base management systems DBMS. • Conventional DBMS are focused on online transaction processing (OLTP); that is the storage and fast retrieval of individual records for purposes of data organization. They are used to keep track of inventory payroll records, billing records, invoices, etc.
Data Mining Techniques • Data Mining as an analytic process designed to • explore data (usually large amounts of - typically business or market related - data) in search for consistent patterns and/or systematic relationships between variables. • to validate the findings by applying the detected patterns to new subsets of data. • The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has most direct business applications. • The process of data mining consists of three stages: • the initial exploration. • model building or pattern identification with validation and verification. • deployment (i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration • It usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). • Depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.
Stage 2: Model building and validation • This stage involves considering various models and choosing the best one based on their predictive performance • Explain the variability in question and • Producing stable results across samples. • This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. • "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. • These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.
Models for Data Mining • In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. • In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. • CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. • The Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities.
CRISP • CRISP postulates the following general sequence of steps for data mining projects:
Six Sigma • This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps • The categories of activities: Define (D), Measure (M), Analyze (A), Improve (I), Control (C ). • Postulates the following general sequence of steps for data mining projects: Define (D) → Measure (M) → Analyze (A) → Improve (I) → Control (C ) - It grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries). • Define. It is concerned with the definition of project goals and boundaries, and the identification of issues that need to be addressed to achieve the higher sigma level. • Measure. The goal of this phase is to gather information about the current situation, to obtain baseline data on current process performance, and to identify problem areas. • Analyze. The goal of this phase is to identify the root cause(s) of quality problems, and to confirm those causes using the appropriate data analysis tools. • Improve. The goal of this phase is to implement solutions that address the problems (root causes) identified during the previous (Analyze) phase. • Control. The goal of the Control phase is to evaluate and monitor the results of the previous phase (Improve).
Sampling • Objective: Determine the average amount of money spent in the Central Mall. • Sampling: A Central City official randomly samples 12 people as they exit the mall. • He asks them the amount of money spent and records the data. • Data for the 12 people: Person $ spent Person $ spent Person $ spent 1 $132 5 $123 9 $449 2 $334 6 $ 5 10 $133 3 $ 33 7 $ 6 11 $ 44 4 $ 10 8 $ 14 12 $ 1 • The official is trying to estimate mean and variance of the population based on a sample of 12 data points.
Population versus Sample • A population is usually a group we want to know something about: • all potential customers, all eligible voters, all the products coming off an assembly line, all items in inventory, etc.... • Finite population: {u1, u2, ... , uN} versus Infinite population • A population parameter is a number (q) relevant to the population that is of interest to us: • the proportion (in the population) that would buy a product, the proportion of eligible voters who will vote for a candidate, the average number of M&M's in a pack.... • A sample is a subset of the population that we actually do know about (by taking measurements of some kind): • a group who fill out a survey, a group of voters that are polled, a number of randomly chosen items off the line.... • {x1, x2, ... , xn} • A sample statisticg(x1, x2, ... , xn) is often the only practical estimate of a population parameter. • We will use g(x1, x2, ... , xn) as proxies for q, but remember their difference.
Average Amount of Money spent in the Central Mall • A sample (x1, x2, ... , xn) • Its mean is the sum of their values divided by the number of observations. • The sample mean, the sample variance, and the sample standard deviation are $107, $220,854, and $144.40, respectively. • It claims that on average $107 are spent per shopper with a standard deviation of $144.40.
The variance s2of a set of observations is the average of the squares of the deviations of the observations from their mean. • The standard deviation s is the square root of the variance s2 . • How far the observations are from the mean? s2and swill be • large if the observations are widely spread about their mean, • small if they are all close to the mean.
Stock Market Indexes • It is a statistical measure that shows how the prices of a group of stocks changes over time. • Price-Weighted Index: DJIA • Market-Value-Weighted Index: Standard and Poor’s 500 composite Index • Equally Weighted Index: Wilshire 5000 Equity Index • Price-Weighted Index: It shows the change in the average price of the stock that are included in the index. • Price per share in current period P0 and price per share in next period P1. • Number of shares outstanding in current period Q0 and number of shares outstanding in next period Q1.
Data Analysis • Statistical Thinkingis understanding variation and how to deal with it. • Move as far as possible to the right on this continuum: Ignorance-->Uncertainty-->Risk-->Certainty • Information science:learning from data • Probabilistic inference based on mathematics • What is Statistics? • What is the connection if any • Fields including Data Base Management Artificial Intelligence
Probability: the study of randomness It is based on a lecture given by Professor Costis Maglaras at Columbia University.
Randomness • Coin tossing. • A phenomenon is random • if individual outcomes are uncertain but there is a regular distribution of outcomes in a large number of repetitions.
Probability • The probability of any outcome of a random phenomenon is • long term relative frequency, i.e. • the proportion of the times the outcome would occur in a very long series of repetitions. (empirical) • Trials need to be independent. • Computer simulation is a good tool to study random behavior. • The uses of probability • Begins with gambling. • Now applied to analyze data in astronomy, mortality data, traffic flow, telephone interchange, genetics, epidemics, investment...
Probability Terms • Random Experiment: A process leading to at least 2 possible outcomes with uncertainty as to which will occur. • Event: An event is a subset of all possible outcomes of an experiment. • Intersection of Events: Let A and B be two events. Then the intersection of the two events, denoted A B, is the event that both A and B occur. • Union of Events: The union of the two events, denoted A B, is the event that A or B (or both) occurs. • Complement: Let A be an event. The complement of A (denoted ) is the event that A does not occur. • Mutually Exclusive Events: A and B are said to be mutually exclusive if at most one of the events A and B can occur. • Basic Outcomes: The simple indecomposable possible results of an experiment. One and exactly one of these outcomes must occur. The set of basic outcomes is mutually exclusive and collectively exhaustive. • Sample Space: The totality of basic outcomes of an experiment.
Basic Probability Rules 1. For any event A, 0 P(A) 1. 2. If A and B can never both occur (they are mutually exclusive), then P(A and B) = P(A B) = 0. 3. P(A or B) = P(A B) = P(A) + P(B) - P(A B). 4. If A and B are mutually exclusive events, then P(A or B) = P(A B) = P(A) + P(B). 5. P(Ac) = 1 - P(A). Independent Events • Two events A and B are said to be independent if the fact that A has occurred or not does not affect your assessment of the probability of B occurring. Conversely, the fact that B has occurred or not does not affect your assessment of the probability of A occurring. 6. If A and B are independent events, then P(A and B) = P(A B) = P(A) P(B). (Markov??)
Probability models • Two parts in coin tossing. • A list of possible outcomes. • A probability for each outcome. • The Sample space S of a random phenomenon is the set of all possible outcomes. • Examples. S={heads, tails}={H,T} • General analysis is possible.
Event • An event is an outcome or a set of outcomes. (= it is a subset of the sample space) • A={HHTT,HTHT,HTTH,THHT,THTH,TTHH} • Two events A and B are independent if knowing that one occurs does not change the probability that the other occurs. • If A and B are independent,P(A and B) = P(A)P(B) • The heads of successive coin tosses are {independent, not independent}. • The colors of successive cards dealt from the same deck are {independent, not independent}.
P(AB)= P(A|B)P(B)= P(B|A)P(A) Conditional Probability • In these simple calculations, we are making use of the conditional probability formula: P(A|B) = P(A holds given that B holds) = P(A∩B)/P(B) • This relationship is known as Bayes' Law, after the English clergyman Thomas Bayes (1702-1761), who first derived it. Bayes' Law was later generalized by the French mathematician Pierre-Simon LaPlace (1749-1827).
Random Variables • A random variable is a variable whose value is a numerical outcome of a random phenomenon. • Sample spaces need not consist of numbers. • Examples: number of heads in 4 coin tossing, …
Random Variable • A random variable is called discrete if it has countably many possible values; otherwise, it is called continuous. • The following quantities would typically be modeled as discrete random variables: • The number of defects in a batch of 20 items. • The number of people preferring one brand over another in a market research study. • The following would typically be modeled as continuous random variables: • The yield on a 10-year Treasury bond three years from today. • The proportion of defects in a batch of 10,000 items. • Sometimes, we approximate a discrete random variable with a continuous one if the possible values are very close together; e.g., stock prices are often treated as continuous random variables.
Distribution: discrete • If X is a discrete random variable then we denote its pmf by PX. • The rule that assigns specific probabilities to specific values for a discrete random variable is called its probability mass function or pmf. • For any value x, PX(x) is the probability of the event that X = x; i.e., PX(x) = P(X = x) = probability that the value of X is x. • We always use capital letters for random variables. Lower-case letters like x and y stand for possible values (i.e., numbers). • The pmf gives us one way to describe the distribution of a random variable. Another way is provided by the cumulative probability function, denoted by FX and defined by FX(x) = P(X≦ x) • It is the probability that X is less than or equal to x. • The the pdf gives the probability that the random variable lands on a particular value, the cpf gives the probability that it lands on or below a particular value. In particular, FX is always an increasing function.
Distribution: continuous • The distribution of a continuous random variable cannot be specified through a probability mass function because if X is continuous, then P(X = x) = 0 for all x; i.e., the probability of any particular value is zero. Instead, we must look at probabilities of ranges of values. • The probabilities of ranges of values of a continuous random variable are determined by a density function. It is denoted by fX. The area under a density is always 1. • The probability that X falls between two points a and b is the area under fX between the points a and b. The familiar bell-shaped normal curve is an example of a density. • The cumulative distribution function or cdf of a continuous random variable is obtained from the density in much the same way a cpf is obtained from the pmf of a discrete distribution. • The cdf of X, denoted by FX, is given by FX(x) = P(X≦ x). • FX(x) is the area under the density fX to the left of x.
Expectation • The expected value of a random variable is denoted by E[X]. • It can be thought of as the “average” value attained by the random variable. • The expected value of a random variable is also called its mean, in which case we use the notation mX. • The formula for the expected value of a discrete random variable is this: E[X] =SxxPX(x). • The expected value is the sum, over all possible values x, of x times its probability PX(x). • The expected value of a continuous random variable cannot be expressed as a sum; instead it is an integral involving the density. • If g is a function (for example, g(x) = x2), then the expected value of g(X) is E[g(X)] =Sxx2PX(x). • The variance of a random variable X is denoted by either Var[X] or sX2. • The variance is defined by sX2 = E[(X- mX)2]= E[X2] - (E[X])2. • For a discrete distribution, we can write the variance as Sx (x- mX)2PX(x).
Discrete random variable • Discrete random variableX has a finite number of possible values. • The probability distribution of X lists the values and their probabilities. • The probabilities pk must satisfy ... • Every probability pi is a number between 0 and 1. • p1+ p2+... +pk=1. • Probability histogram • Possible values of X and corresponding probability.
Commonly Used Continuous Distribution The Normal Distribution • History: • Abraham de Moivre (1667-1754) first described the normal distribution in 1733. • Adolphe Quetelet (1796-1874) used the normal distribution to describe the concept of l'homme moyen (the average man), thus popularizing the notion of the bell-shaped curve. • Carl Friedrich Gauss (1777-1855) used the normal distribution to describe measurement errors in geography and astronomy.
Bernoulli Processes and the Binomial Distribution • An airline reservations switchboard receives calls for reservations, and it is found that • When a reservation is made, there is a good chance that the caller will actually show up for the flight. In other words, there is some probability p (say for now p = 0.9) that the caller will show up and buy the ticket the day of departure. • Consider a single person making a reservation. This particular reservation can either result in the person on the flight (a success) or a “no show” (a failure). Let X (a random variable) represent the result of a particular reservation. That is, we could assign a value of 1 to X if the person shows up for the flight (X = 1), and let X = 0 if the person does not. Then, P(X = 0) = 1 - p and P(X = 1) = p. • The airline is not particularly interested in the decision made by any one individual, but is more concerned with the behavior of the total number of people with reservations. • Suppose each passenger carried on the plane provides a revenue of $100 for the airline and each bumped passenger (passengers that do not find a seat due to overbooking) results in a loss of $200 for the airline. • If a plane holds 16 people, not including pilots and crew, how many reservations should be taken?
Bernoulli process • This is an example of a Bernoulli process, named for the Swiss mathematician James Bernoulli (1654-1705). • A Bernoulli process is a sequence of n identical trials of a random experiment such that each trial: • (1)produces one of two possible complimentary outcomes that are conventionally called success and failure and • (2) is independent of any other trial so that the probability of success or failure is constant from trial to trial. • Note that the success and failure probabilities are assumed to be constant from trial to trial, but they are not necessarily equal to each other. • In our example, the probability of a success is 0.9 and the probability of a failure is 0.1. • The number of successes in a Bernoulli process is a binomial random variable. • Random Variable: A numerical value determined by the outcome of an experiment.
Analysis • If the airline takes 16 reservations, what is the probability that there will be at least one empty seat? P(at least one empty seat) = = 1 - (0.9)16 = 0.815. An 81.5% chance of having at least one empty seat! So the airline would be foolish not to overbook. • Suppose we take 20 reservations for a particular flight, let Y be the number of people who show up. • Y is a binomial random variable that takes on an integer value between 0 and 20. • What is the probability function or distribution of Y? • What is the probability of getting exactly 16 passengers? A = 0.08978 • P(Y 16) = 0.133, P(Y = 17)= 0.190, P(Y = 18)=0.285, P(Y = 19)= 0.270, P(Y = 20) = 0.122 • Consider B = number of people bumped. The load L is Y - B. • The airline's total expected revenue (call this R, then R = 100L - 200B) • E(R) = E(100L - 200B) = 100E(L) - 200E(B) = 1,182.81.
How many reservation? Reservation 20 19 18 17 16 E(Load) 15.943 15.839 15.599 15.132 14.396 E(Bumps) 2.057 1.261 0.600 0.167 0.000 E(Revenue) $1,183 $1,332 $1,440 $1,480 $1,440 • In this case, the best strategy is to take 17 reservations. • Expected Value: The expected value (or mean or expectation) of a random variable X with probability function P(X = x) is E(X) = S xP(X=x) where the summation is over all x that have P(X = x) > 0. It is sometimes denoted X or . • Variance: The variance of a random variable X with probability function P(X = x) is Var(X) = S (x- E(X))2P(X=x) , where the summation is over all x such that P(X = x) > 0. It is sometimes denoted 2(X) or 2.
Inference Mean, Proportion, CLT Bootstrap
From Probability to Statistics • In all our probability calculations, we have assumed that we know all quantities needed to solve the problem: • To find the expected return and standard deviation of a portfolio, we assumed we knew the mean and standard deviation of the returns of the underlying stocks. • To find the proportion of bags below the 8-ounce minimum, we assumed we knew the mean and standard deviation of the weight of chips in each bags. • In practice, these types of parameters are not given to us; we must estimate them from data. • Statistical analysis usually proceeds along the following lines: • Postulate a probability model (usually including unknown parameters) for a situation involving uncertainty; e.g., assume that a certain quantity follows a normal distribution. • Use data to estimate the unknown parameters in the model. • Plug the estimated parameters into the model in order to do make predictions from the model.
How do we start with? • The first step, picking a model, must be based on an understanding of the situation to be modeled. • Which assumptions are plausible? • Which are not? • These questions are answered by judgment, not by precise statistical techniques. • Examples: • Assume that daily changes in a stock price follow a normal distribution. • Use historical data to estimate the mean and standard deviation. • Once we have estimates, we might use the model to predict future price ranges or to value an option on the stock. • Assume that demand for a fashion item is normally distributed. • Use historical data to estimate the mean and standard deviation. • Once we have estimates, we might use the model to set production levels.
How do we get data and make inference? • The first step in understanding the process of estimation is understanding basic properties of sampled data and sample statistics, since these are the basis of estimation. • When we talk about sampling it is always in the context of a fixed underlying population: • If we look at 50 daily changes in IBM stock, we are looking at a sample of size 50 from the population of all daily changes in IBM stock. • If the population is very large (as in these examples), we generally treat it as though it were infinite; this simplifies matters. Thus, we are primarily concerned with finite samples from infinite populations. • A single sample from a population is a random variable. Its distribution is the population distribution; e.g., • The distribution of a randomly selected daily change in IBM stock is the distribution over all daily changes
Random Sample • A random sample from a population is a set of randomly selected observations from that population. If X1,…, Xn are a random sample, then • they are independent; • they are identically distributed, all with the distribution of the underlying population. • A sample statistic is any quantity calculated from a random sample. The most familiar example of a sample statistic is the sample mean , given by = (X1 + X2 + … + Xn)/n • The sample mean gives an estimate of the the population mean m = E[Xi].
Distribution of the Sample Mean • Every sample statistic is a random variable. • Randomness is introduced through the sampling mechanism. • As noted above, the sample mean of a random sample X1,…, Xnis an estimate of the population mean m = E[Xi]. • How good an estimate is it? • How can we assess the uncertainty in the estimate? • To answer these questions, we need to examine the sampling distribution of the sample mean; that is, the distribution of the random variable . • Assume that the underlying population is normal with mean m and variance s2. • This means that Xi ~ N(m,s2) for all i. • The Xi's are independent, since we assume we have a random sample. • The sum of independent normal random variables is normally distributed. The usual rules for means and variances apply: • The expected value of the sum is the sum of the expected values. • The variance of the sum is the sum of the variances (by independence). • Any linear transformation of a normal random variable is normal; in particular, multiplication by a constant preserves normality.
Distribution of the Sample Mean • Using these two facts, we find that if Xi ~ N(m,s2) for all i, then • X1 + X2 + … + Xn ~ N(nm,ns2); • The sample mean from a normal population has a normal distribution. • First consequence: • The expected value of the sample mean is the population mean; “on average" the sample mean correctly estimates the underlying mean. • The standard deviation of a sample statistic is called its standard error. Thus, we have shown that the standard error of the sample mean is s/√n, where s is the underlying standard deviation and n is the sample size. • Second consequence: • Because the standard error of sample mean is s/√n, the uncertainty in this estimate decreases as the sample size n increases. (That's good.) • The uncertainty (as measured by the standard deviation) decreases rather slowly: to cut the standard deviation in half, we need to collect four times as much data, because of the square root. (That's not so good, but that's life.)
Example: • Suppose the number of miles driven each week by US car owners is normally distributed with a standard deviation of s = 75 miles. • Suppose we plan to estimate the population mean number of miles driven per week by US car owners using a random sample of size n = 100. • What is the probability that our estimate will differ from the true value by more than 10 miles? • Denote the population mean by m and the sample mean by . • We need to find . • By symmetry of the normal distribution, it is Thus, the probability that our estimate will be o by more than 10 miles is 18.36%. • If the underlying population is not normal, what can be done?
Central Limit Theorem • By the central limit theorem, regardless of the underlying population, the distribution of sample mean tends towards N(m,s2/n) as n becomes large. • If we accept the use of this approximation, we don't need to assume that the number of miles driven per week in the example is normally distributed (as long as our sample size n is large). • repeatedly to assess the error in X as an estimate of . • How large should n be for the normal approximation to be accurate? • There is no simple answer (it depends on the underlying distribution), but n≧ 30 is a reasonable rule of thumb. • If the underlying population is finite of size N, and if the sample size n is not a small proportion of N, we use the following small sample correction to the standard error:
Sampling Distribution of the Sample Proportion • Consider estimating any of the following quantities: • Proportion of voters who will vote for a third-party candidate in the next election. • Proportion of visits to a web site that result in a sale. • Proportion of shoppers who prefer crunchy over creamy. • In each of these examples, we are trying to estimate a population proportion. Denote a generic population proportion by the symbol p. • Estimate a population proportion using a sample proportion. • For example, if a poll surveys 1000 voters and finds that 85 of those surveyed plan to vote for a third-party candidate, then the sample proportion is 8.5%. • The population proportion is what the poll would find if it could ask every voter in the population. • Denote the sample proportion by the symbol • Once we have collected a random sample, the sample proportion is known. We use it to estimate the true, unknown population proportion p.