480 likes | 603 Views
Agenda. 1. Apply for HSC account; 2. Circle your name on the roaster 3. Course Website http://personal.health.usf.edu/ywu/cat.html 4. Last day for dropping: 09/31/07 Last day for withdrawing: 11/03/07 5. Syllabus 6. Lecture. Outline of Today’s Lecture.
E N D
Agenda 1. Apply for HSC account; 2. Circle your name on the roaster 3. Course Website http://personal.health.usf.edu/ywu/cat.html 4. Last day for dropping: 09/31/07 Last day for withdrawing: 11/03/07 5. Syllabus 6. Lecture
Outline of Today’s Lecture • Effective teaching and learning • SAS basics • Statistics basics
Why SAS • There are many other packages that can be used to perform statistical analyses, for example, SPSS, Stata, Splus, etc. • The uniqueness of SAS is its comprehensive capabilities of data manipulation that are not offered by other statistical analysis packages. • FDA only accepts the statistical analysis reports that are generated in SAS from pharmaceutical companies.
Overview of SAS Original Data Set input Data Manipulation Statistical Analysis Results output
Original Data Set Input • The ideal situation is where you have a SAS data set to work with in the first place. • In reality, however, the original data set you are supplied with is not always a SAS data set but maybe one of the following: • Plain text file • Excel file • SPSS file
Original Data Set Input • How to manipulate or do an analysis on an external existing SAS data set? • In order for SAS to get access to an external existing SAS data set, you need to provide SAS system with the information of where the SAS data set is located in your computer. This can be accomplished by the two-step procedure that follows.
Original Data Set Input • Step I: create a user-defined SAS library using the SAS statement libname. For example, let’s say that the external existing SAS file, lbw.sas7bdat, is located at c:\yougui\cat_07. Then the SAS statement libname dog "c:\yougui\cat_07"; creates a SAS library called dog that links with the physical location c:\yougui\cat_07
Original Data Set Input • Step II: use two level naming mechanism to get access to the SAS data set. For example, procfreq data=dog.lbw; table race; run;
Original Data Set Input • Note: There is a built-in SAS library called work that can be used to save SAS data sets created during a SAS session. data work.lbw1; set dog.lbw; if age>70 then age_high=1; else age_high=0; run;
Original Data Set Input • Note: However, be aware that when a SAS session is closed, the SAS data sets saved in the work SAS library will be gone.Therefore, be advised that you save the SAS data sets that you want to keep permanently into a user-defined SAS library NOT the temperary SAS library work.
Original Data Set Input • How to manipulate or do an analysis on an external existing non SAS data set? • The idea is to convert the non SAS data set into a SAS data set and treat this converted SAS data set as an external existing SAS data set discussed on previous slides.
Original Data Set Input • The conversion of some of non SAS data sets can be done within SAS system, for example, plain text files and excel files. • However, some of them can only be converted using a special conversion software, for example, DBMSCOPY.
Original Data Set Input • SAS codes for converting an Excel file: PROCIMPORT OUT= WORK.lbw DATAFILE= "C:\yougui\cat_07\lbw.xls" DBMS=EXCEL REPLACE; SHEET="data$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;
Original Data Set Input • SAS codes for converting an plain text file: data work.lbw; infile 'c:\yougui\cat_07\lbw.txt'; input low age lwt race smoke ptl ht ui ftv; run;
Original Data Set Input • In some occasions, you might want to create a SAS data set from scratch, meaning that you only have data on a piece of paper. data work.test; input age; datalines; 35 50 45 24 ; run;
Data Manipulation • Data manipulation is required when the variables that are needed in a statistical analysis are not available in the original data set. • Data manipulation can be as simple as creating a new variable based on variables in the original data set, for example,
Data Manipulation • data work.lbw1; set dog.lbw; if age>70 then age_high=1; else age_high=0; run; • In some cases, data manipulation can be so complicated that it takes pages of SAS codes to get the data set that you need for the analysis. This is where your SAS programming skills come to play.
Statistical Analysis • When you have a SAS data set that contains all the variables that are needed for a statistical analysis, you need to identify the right SAS procedure that can be used to carry out that intended analysis. For example, the procedure freq will be used to do a frequency analysis on the variable RACE in the data set lbw.sas7bdat
Statistical Analysis procfreq data=dog.lbw; table race; run; • At the analysis stage, the key is to identify the right analysis for the statistical problem you have. This is where your statistical knowledge comes to play.
Results Report • The results of a statistical analysis can be viewed and printed from the SAS output window. In most cases, the SAS output contains a lot of information that is not useful for your statistical problem. Therefore, it is important for you to know which part of the SAS output is relevant to the answer to the statistical problem you have.
Results Report • After you obtain the relevant results from the SAS output, the most challenging job to do is the interpretation of the results. This is the point where your statistical knowledge and understanding make difference.
Note • For additional free SAS learning materials on line, visit the web site: http://www.ats.ucla.edu/stat/sas/ • For those of you who want to become a SAS programmer, I recommend that you get both entry-level and advanced-level SAS certificates before you graduate; you can go to SAS home page http://www.sas.com to look for learning materials for these two certificates (not free, it might cost $200)
Three Important Statistical Concepts • Population: a set of measurements of interest. For example, the blood pressures of people in Florida. • Sample: a subset of measurements selected from a population, denoted by • Parameter: a characteristic of a population. For example, the mean of blood pressure of people In Florida, denoted by
Why Statistics • Q: Is it possible to know the true value of a population parameter, ,based on the sample? A: No, because depends on the whole population and the sample is a small portion of the population. • Note: If the answer to the question above was yes, statistics as a scientific discipline would not exist on the planet.
Why Statistics • Methodologies in Statistics can be used to answer the following two typical research questions that are related to the population parameter, : • What is the range of the likely values of ? • Whether a hypothesis about is true or not ?
Statistical Inferences • The two research questions correspond to the two classical types of statistical inferences: • Confidence interval • Hypothesis testing
Statistical Inference: Confidence Interval • Confidence Interval: an interval that covers the true parameter with certain confidence level. • Example: let be a random sample from a normal distribution with mean and variance . The 95% confidence interval of is
Statistical Inference: Confidence Interval • Interpretation: Let (80,120) is the 95% confidence intervalofcalculated based on a particular sample. The following statement can be made regarding this interval: • We have 95% confidence to believe that (80,120) covers the true value of
Statistical Inference: Confidence Interval • How to understand this statement: The CI formula is derived from the probability statement: • What this probability statement means is that the probability that the RANDOM interval covers is 0.95. • consider 100 samples including the sample at hand and construct a 95% confidence interval using each of these 100 samples, which yields 100 95% confidence intervals. Of these 100 intervals, about 95 of them cover the true value of
Statistical Inference: Hypothesis Testing • Objective: Test for whether or not the population parameter is equal to a hypothesized value (while not general enough, it does not affect the understanding) • Ideas: seek evidence from the sample to see if the evidence is against the null hypothesis. Make a final decision according to the evidence: Reject the null hypothesis if the evidence is against the null hypothesis.
Statistical Inference: Hypothesis Testing • How to find the evidence against • Estimate of the population parameter • “distance” measure(test statistics) between the estimate and the hypothesized value • The large distance is evidence against H0 • Example: large value of is the evidence against H0: . Reject H0 if is greater than a critical value, c.
Statistical Inference: Hypothesis Testing • Errors: Regardless what critical value, c , is used for constructing the rejection region, two types errors are inevitable: type I error and type II error. • Type I error: reject null hypothesis when the null hypothesis is true. It is usually denoted by , also called significance level. • Type II error: accept null hypothesis when the null hypothesis is false.
Statistical Inference: Hypothesis Testing • Is it possible to find such a c that both type I and II are minimized? No, the larger the c is, the smaller the type I error is BUT the bigger the type II error is. • Strategy: control type I error( ) and leave type II error open, i.e. find a c such that the corresponding type I error is as small as 0.05.
Statistical Inference: Hypothesis Testing • Why not control type II error? Usually the null hypothesis is specified in such a way that rejecting it when it is true is more dangerous than accepting it when it is false. • Example: H0: having a disease VS H1: not having the disease
Statistical Inference: Hypothesis Testing • Determination of critical value c for a given type I error : depending on the sampling distribution of the test statistics.
Statistical Inference: Hypothesis Testing • Two approaches to report hypothesis testing result: • Approach 1: draw a final conclusion as to whether H0 should be rejected or not. • Approach 2 : report P-value and let the readers to draw the final conclusion by themselves.
Hypothesis Testing: Approach 1 • Procedure to implement approach 1: • State the null hypothesis H0 and the alternative hypothesis H1. • Identify the test statistics and its sampling distribution under H0 • Determine the critical value and construct the rejection region • Compute the value of the test statistic from the observed data. • Reject H0 if the observed test statistics is in the rejection region and fail to reject H0 otherwise
Hypothesis Testing: Approach 1 • Example: • Hypothesis: • Significance level: • Testing statistics: • The distribution of the testing statistics under H0 :
Hypothesis Testing : Approach 1 • Example(con.) • Rejection region: • Compute the observed t: • Conclusion: Since , , which is greater than 2.306, i.e. in the rejection region, we reject the null hypothesis at significance level 0.05.
Hypothesis Testing : Approach 1 • Q: What decision do you make if the observed test statistics does not fall into the rejection region? Are you going to accept H0 or fail to reject H0? • A: In this case if you are going to accept H0, the error you are making is Type-II error, which is usually unknown. This is why we usually say “fail to reject H0”
Hypothesis Testing: Approach 2 • The problem with approach 1:consider a scenario where we have two samples and both of them lead to the rejection of the null hypothesis, but the observed test statistics calculated from one of them is close to the critical value and the other is far from the critical value.The much stronger evidence of rejection in sample 2 is not appreciated by the approach 1.
Hypothesis Testing: Approach 2 • To overcome the problem with approach 1, the concept “P-value” was invented. • P-value: the probability of observing as extreme as or more extreme than the observed test statistics. This probability reflects how far the observed test statistics is from the critical value. The smaller the P-value is, the further away the observed test statistics is from the critical value.
Hypothesis Testing: Approach 2 • P-value when sampling distribution is one tailed: the area under the sampling distribution density curve to the right of the observed test statistics
Hypothesis Testing: Approach 2 • P-value when sampling distribution is two tailed: 2 timesthe area under the sampling distribution density curve to the right of the observed test statistics if the observed test statistics is located in the right tail, and to the left of the observed test statistics otherwise.
Hypothesis Testing: Approach 2 • Procedure to implement approach 2: • State the null hypothesis H0 and the alternative hypothesis H1. • Identify the test statistics and its sampling distribution under H0 • Compute the value of the test statistic from the observed data. • Compute P-value
Hypothesis Testing: Approach 2 • Example: • Hypothesis: • Testing statistics: • The distribution of the testing statistics under H0: • Compute the observed statistics:
A Question to Ponder • Q: How is confidence interval related to P-value? Specifically, • If the 95% CI for contains the hypothesized value , what can you say about the P-value for testing the null hypothesis: • If P-value is less than 0.05, what can you say about the 95% CI for in terms of whether the CI contains the hypothesized value or not?