260 likes | 389 Views
Chocolate Cake Seminar Series on Statistical Applications. Today’s Talk: Non-poisonous Poisson Regression At Your Fingertips By Dr. Olga Korosteleva. Outline of Presentation. Poisson Regression Applications in SAS and SPSS Zero-inflated Poisson Regression Applications in SAS (no SPSS).
E N D
Chocolate Cake SeminarSeries on Statistical Applications Today’s Talk: Non-poisonous Poisson Regression At Your Fingertips By Dr. Olga Korosteleva
Outline of Presentation Poisson Regression Applications in SAS and SPSS Zero-inflated Poisson Regression Applications in SAS (no SPSS)
The Poisson Distribution Count data are observations that assume only non-negative integer values: 0, 1, 2, etc. Count data have a Poisson distribution if the frequencies of the values have the following features: • Small-valued observations are quiet common. • Starting at some value, frequencies decrease very rapidly. • The average of observations is approximately equal to their variance.
Example of Data with Poisson Distribution One of the questions in a Health Sciences survey asked how many times a respondent visited a doctor in the past month. For a sample of 150 people, the frequencies of the responses were Note that • 0 visits is quite a common response, • 1, 2, or 3 visits are the most frequent observations, • starting with 4 visits frequencies quickly decrease, • The average is 2.60 visits and is nearly equal to the variance, which is 2.63 visits squared. These are the features of a variable distributed according to a Poisson distribution.
Formula for Poisson Distribution Poisson distribution is discrete with the probability function given by where , and . By definition, Here is both the mean and variance of , and is termed rate. Note that the probabilities of small values are reasonably high, and for larger values, the probabilities decrease very fast: , , , , …, , …
The Poisson Regression Model The Poisson regression modelspecifies that the dependent variable Y, given independent variables , follows a Poisson distribution with the probability function , where the rate , or, equivalently, .
Interpretation of Coefficients • If is continuous, then the quantity represents the estimated percent change in mean response when is increased by one unit, and the other variables are held fixed. • If is a categorical variable with several levels, then represents the estimated percent ratio in mean response for the level and that for the reference level, provided the other variables are unchanged.
Goodness-of-Fit Test • A measure of goodness of fit of the Poisson regression model is obtained by computing the deviance statistic of a base model against the full model. A base model includes only the intercept, while the full model includes the intercept and all the - variables. The deviance is defined as -2 multiplied by the log-likelihood ratio, deviance = -2 ( ln L(base model) - ln L(full model) ). • The deviance is used as a test statistic for testing H0: the base model has a good fit against H1: the full model has a good fit. Under H0, the deviance has a chi-squared distribution with the degrees of freedom = number of -variables in the full model. • If the deviance is large (formally, p-value < 0.05), then H0is rejected and the conclusion is that the full model has a good fit.
Numerical Example: SAS Application In the doctor visits example, age and health status were also recorded for each respondent. The code below uses the Poisson regression model to regress the number of doctor visits on age and health. datadocvisits; input n_visitsage health $ 9. @@; datalines; 0 18 excellent 1 59 good 2 54 fair 3 37 fair 4 48 good more data lines 1 52 good 2 48 good 3 43 good 4 57 fair 8 71 poor ; data recoded; /* create dummy variables manually */ set docvisits; if health='poor' then poor=1; else poor=0; if health='fair' then fair=1; else fair=0; if health='good' then good=1; else good=0; run; proccountreg data=recoded type = poisson; /* full model */ model n_visits=age poor fair good; run; proccountreg data=recoded type=poisson; /* base model */ model n_visits=; run;
Numerical Example: Results The relevant output for the full model is
Numerical Example: Results The relevant output for the base model is
Numerical Example: Results Goodness of Fit: • The deviance = -2 (-277.00259 – (-237.17235) ) = 79.66048, hence the P-value= P((4) > 79.66048)<0.0001. The conclusion is that the full model has a good fit. Interpretation of Beta’s: • For every one-year increase in age, the estimated mean number of doctor visits increases by 100% (Exp(0.026002)-1)=2.6343%. • The estimated mean number of doctor visits for people in poor health is 100%Exp(0.58838)=180.1068% of that for people in excellent health. • The estimated mean number of doctor visits for people in fair health is 100%Exp(0.253346)=128.8329% of that for people in excellent health. • The estimated mean number of doctor visits for people in good health is 100%Exp(0.172519)=118.8294% of that for people in excellent health. Significance: • Only age and poor health are significant predictors in this model.
Numerical Example: SPSS Application • SPSS Basic Syntax GENLIN n_visits BY health (ORDER=ASCENDING) WITH age /MODEL health age INTERCEPT=YES DISTRIBUTION=POISSON LINK=LOG /PRINT FIT SUMMARY SOLUTION (EXPONENTIATED). • SPSS Point-N-Click Instructions Analyze Generalized Linear Models Generalized Linear Models Poisson loglinear (fill in the bubble) Response tab Identify dependent variable Predictors tab Identify factors and covariates Model tab Identify the model Statistics tab Include exponential parameter estimates (check the box) hit OK.
More Examples of Poisson Regression • The number of cars in line in front of you at McDonald’s drive through. Predictor may include the day of the week and the hour of the day. • The number of deaths from myocardial infarction last year. The predictors may include the type of hospital (private or public) the patient was treated and the total number of patients admitted last year. • The number of field mice per acre in mid-summer at a particular location. The predictors may include the lowest winter temperature and the total rainfall in the spring at that location. • The number of accidents last month on a specified stretch of afreeway. The predictors may include day of the week, time of the day (morning/afternoon/evening), traffic bound (east/west), and sunlight brightness.
Zero-Inflated Poisson (ZIP) Regression Example. If you randomly choose 100 students and ask them how many cigarettes they smoked yesterday. Some students will report that they smoked zero number of cigarettes. There are two possible reasons for that. Either they don’t smoke at all, or they happened not to smoke a single cigarette that day. Definition. A structural zero is recorded when the respondent’s behavior is not in the behavioral repertoire under study (e.g., the person doesn’t smoke). Definition. A chance zero is recorded when the respondent’s behavior is normally in the behavioral repertoire under study but just not during the studied time frame (e.g., just happened not to smoke yesterday).
Zero-Inflated Poisson (ZIP) Regression Example. If you randomly choose 100 students and ask them how many cigarettes they smoked yesterday. Some students will report that they smoked zero number of cigarettes. There are two possible reasons for that. Either they don’t smoke at all, or they happened not to smoke a single cigarette that day. Definition. A structural zero is recorded when the respondent’s behavior is not in the behavioral repertoire under study (e.g., the person doesn’t smoke). Definition. A chance zero is recorded when the respondent’s behavior is normally in the behavioral repertoire under study but just not during the studied time frame (e.g., just happened not to smoke yesterday).
Zero-Inflated Poisson (ZIP) Regression The presence of structural zeros inflates the number of zeros in the Poisson model, which makes the model invalid. A zero-inflated Poisson (ZIP) model is used instead. In ZIP model, the response variable
Zero-Inflated Poisson (ZIP) Regression that is, Here and where are the predictors, are the regression coefficients, are the zero-inflated predictors responsible for inflation of the number of zeros in the model, and are the zero-inflated coefficients. The parameters of the model to be estimated from the data are and
Numerical Example: SAS Application data smoking; input gender$ age cigarettes @@; datalines; F 54 6 M 37 0 F 48 12 M 27 0 M 55 0 M 32 0 F 49 12 F 45 11 more data lines F 19 7 F 35 2 M 39 0 M 43 6 ; data smoking; set smoking; if gender='F' then female=1; else female=0; keep female age cigarettes; run; /* The base model */ proccountreg; model cigarettes =/ dist=zip; zeromodel cigarettes ~ ; run; /* Model: x1=age, z1=female */ proccountreg; model cigarettes = age / dist=zip; zeromodel cigarettes ~ female; run;
Numerical Example: SAS Output The output is Base Model: ZIP:
Numerical Example: Results • Goodness of Fit: • The deviance = -2 (-120.94052 – (-103.75030) ) = 34.38044, P-value= P((2) > 34.38044)<0.0001. The conclusion is that the full model has a good fit. • Fitted Model: • In the fitted model, and • Interpretation of Coefficients: • For a one-year increase in age, the estimated mean number of cigarettes smoked in one day increases by 100% (Exp(0.047176)-1)= 4.83% (significance predictor). • The estimated odds in favor of non-smoking for women is 100% Exp(-1.467176)=23. 06% of those for men (significant predictor).
Numerical Example: SPSS Application • SPSS currently doesn’t have a procedure that fits zero-inflated Poisson models.
More Examples of ZIP Regression • The number of days a school child is tardy over a period of one year. Predictors may include gender and standardized test scores in math and language arts. Many school children are never tardy. • The number of fish caught by a party at a lake during a weekend. Predictors may include the number of people in the group, the number of children, and whether they came with a camper (a motor vehicle designed for travel). Some visitors don’t fish. • The number of insurance claims submitted by a policyholder over a one year period. Predictors may include gender, age, previous-year record. Many policyholder don’t have claims to submit. • Non-adherence to medication: how many times a patient in a study didn’t take her/his medication during the last week. Predictors may include age, race, BMI, primary disease, comorbidity. Many patients always take their medication.
Potential Problem with Poisson Regression • In Poisson regression, it is assumed that mean and variance of the response variable are approximately the same. It is rarely the case with real-life data. • Often the variance is much larger than the mean. This situation is called overdispersion. • There is a formal test for overdispersion. And the suggested remedy is to fit Negative Binomial regression model instead. • We will cover this regression during one of the presentations in the fall.