Regression & Correlation

Regression & Correlation

Outline • X, Y & Regression Models • Simple linear regression (SLR) • The logic of SLR: SST=SSR+SSE • SLR: ANOVA table & R-square • SLR、ANOVA、2-s t test的比較 • Multiple Linear Regression • Pearson’s correlation coefficient (r) • R2, r, b之間的關係 • Z, t, F, 2 之間的關係

X and Y

Univariate analysis: 1X1Y • 說明：有*的分析方法需要有以下假設: • normality • Independence.. • 名詞縮寫 • Cat.: categorical; Num.: numerical

Multivariate analysis: Xs1Y • 說明：有*的分析方法需要有以下假設: • Multivariate normality • Independence.. • 名詞縮寫 • Cat.: categorical; Num.: numerical • CART: classification and regression tree • ANOVA: analysis of variance • ANCOVA: analysis of covariance • MANOVA: multivariate analysis of variance • GEE: generalized estimating equations

Regression Models • Mathematical models to describe the relationship between Y and X • The use of regression model • Adjustment • Prediction • Finding important factors for Y

Regression Models • Definition: • Mathematical models to describe the relationship between Y and X • Purpose: The use of regression model: • Find important factors for Y and/or • Prediction

Simple linear regression (SLR) • Model:

SLR Example • 年齡跟膽固醇間是否有直線關係

SLR: parameter estimation • The least square method • Point estimate:

The logic of SLR: SST=SSR+SSE amount at Xi unexplained by regression Total amount unexplained at Xi amount at Xi explained by regression SST = SSE + SSR

SLR: parameter estimation • The least square method • min SSE: • Point estimate • 分別對截距與斜率做偏微分，可求出截距與斜率 • 截距 • 斜率

SLR example: Regression line Estimated Model: CHOL= (-57.5964988786446) + ( 5.65024919013205) * (Age)

SLR: ANOVA table & R-square R2=0.82, p=0.0001

SLR: qualitative covariate • Example: • X=treatment, 1 or 0 • Y=SBP • Hypothesis • H0: β1 = 0 • H1: β1≠0 • 與平均值檢定的比較: • H0: μ1 = μ0 • H1: μ1≠μ0 • Note: β1 = μ1 - μ0

SLR、ANOVA、2-s t test的比較 • 2-s t →ANOVA • 2-s t →SLR H0: μ1 = μ0→ H0: β1 = 0 • Dummy variable: K組需要K-1個 • ANOVA →SLR H0: μ1 = μ2 = μ3→ H0: β1 = β2 = 0

Multiple Linear Regression • Model • Example: Is Age a predictor for SBP adjusting for Sex?

male SBP female Age MLR: example

Pearson’s correlation coefficient (r) • Relationship btw X and Y • Properties of Pearson’s r • Range: • Unitless • Good for normally distributed X and Y • 相關係數 r：可視為是多維空間中，兩個向量的cos值 • Spearman’s correlation coefficient • Pearson’s r for ranked X and Y • Good for non- normally distributed X and Y

Spearman’s Rho: rank correlation • Relationship btw X and Y • Spearman’s correlation coefficient • Pearson’s r for ranked X and Y • Good for non- normally distributed X and Y

Assumptions in Regression • Linear • Independent • Normal distribution • Equal Variance • 說明：For all the values of x, • εare independent, • normally distributed, • have the same SD σ = σ (ε) • mean μ = 0 Yi = α0 + β1Xi +εi α and β are the unknown parameters ε = random error fluctuations

R2, r, b之間的關係 • r and b • r2: Coefficient of Determination: • The proportion of the variability among the observed values of Y that is explained by the linear regression of Y on X. • Y的變異量可以被X迴歸後所解釋的百分比

r大b小 r小b大 r, b之間的關係: 正負同號

迴歸線的幾個標準差1：

迴歸線的幾個標準差2： • The Standard Error of the Estimate • SE of RL • SE of prediction

迴歸線的幾個標準差3： • Note (a): b1的變異數 • Note (b): b0的變異數

例題： • 10位30-39歲男子於最初所做的血膽固醇量(X)，與相隔10年後所做的量(Y)兩次的比較如下(資料來源:彭游生物統計學，89年，P374) ，請問： • 迴歸係數是多少？截距是多少？ • 相關係數r是多少 • 相關係數是否有統計上的意義？已知F0.05 (1,8) =5.32 • 有多少10年後膽固醇值的變異是由10年前膽固醇值的變異所引起的？ • 樣本的迴歸係數是否具統計意義？ • 某個男性目前的膽固醇為350，請預測10年後的膽固醇和其95%CI • 某群男性的平均膽固醇為350，則其10年後的膽固醇和其95%CI為多少？ • 部分解答:

例題：部分解答(續)

Logistic Regression • 主題：Y為類別變項的預測 • Predicting Nominal or categorical outcome • 有無生病；有無死亡 • Odds Ratio ( 勝算比; 危險對比值 ) • 研究設計： • 橫斷法：Cross sectional study • 世代追蹤法：Cohort study (Follow-up study) • 個案對照法：Case-control study • 臨床實驗法：Clinical trial

Odds ratio • Odds是機率的另一種表示方法 • Odds就是賠率 • 危險對比值(Odds ratio) • 暴露組發病率: p1 = A / (A+B) • 對照組發病率: p0 = C / (C+D) • 世界杯足球賽巴西隊的賭盤為1賠1，中國隊則為1賠100 • 巴西與中國的勝算比為何?

流行病學的研究設計： • 橫斷法：Cross sectional study • 世代追蹤法：Cohort study (Follow-up study) • 個案對照法：Case-control study • 臨床實驗法：Clinical trial

流行病學的偏差(bias) • 選擇性偏差: selection bias • 資訊性偏差: information bias • 錯誤歸類: misclassification • 干擾因子: confounding

橫斷法 • 研究目的： • 盛行率調查 • 衛生行政需求 • 研究關鍵： • 研究對象要有代表性：隨機抽樣 • 研究限制： • 沒有時序性，無法確定因果關係

E E D D 個案對照法 • 研究目的： • 因果分析 • 個案組與對照組的暴露率比較 • 研究關鍵： • 對照組的挑選 • 對照組要能代表個案組所來自的母群體的暴露經驗 • 研究限制： • 時序性 • 回憶偏差(recall bias)

世代研究法(追蹤研究法) E E • 研究目的： • 因果分析 • 暴露組與非暴露組的疾病發生率比較 • 研究關鍵： • 追蹤 • 研究限制： • 失去追蹤 D

Obesity MI Cholesterol 干擾因子Confounding factors • 干擾因子的定義： • 本身單獨與疾病有相關；本身是危險因子 • 干擾因子與危險因子有相關 • 干擾不能是中介變項： • X1X2Y

臨床實驗法 • 研究目的：評估介入(intervention)效果 • 介入：藥物治療，衛生教育 • 研究關鍵： • 隨機分派(randomization)：控制干擾因子 • 安慰劑效應(placebo effect) • 研究限制： • 倫理道德問題

E E covariate, confounder 各種Study Designs之間的關係 • Case-control study • Matched case-control study • Cohort study • Matched cohort study • Randomization clinical trial • Complete matched cohort study • Causality and correlation • Y=a+b1X1+b2X2+b3X3+b4X4+b5X5…

Logistic regression: • Simple linear regression: • Logistic regression: Y為二分類別變項 • 如何使Y從(0,1)到(- ∞, ∞)? • Logistic transformation

Logistic regression係數與OR • OR：exp(beta) • 若該X變項是三組以上的類別變項，表示與參考組比較的OR • 若該X變項是連續變項，表示每增加一單位的X，會增加多少OR • 若model有多個X變項，解讀相同，但要加上「其他X變項保持不變下」的條件 • 舉例: • X代表性別，男性x=1，女性x=0；Y代表自殺的有無

課本例子：LR • men with unintentional injury • Soderstrom, 1997 Table 10-5,p247 • 結論： • 週末的晚上到急診室的白人，有較高的機率血中酒精濃度過高(BAC>50mg/Dl)； • 年紀則沒有統計差異。

Z, t, F, 2 之間的關係 • Z2 , chi-square • 母群體平均值已知： • 定義：　　　　　　　　　　　或　　　　 • 結論：

Z, t, F, 2 之間的關係 • F ,chi-square • 母群體平均值未知： • 定義： • 結論：

Z, t, F, 2 之間的關係

Regression & Correlation