180 likes | 317 Views
Introduction to resampling in MATLAB. So you've done an experiment. Two independent datasets: control experimental Have n numbers in each dataset, representing independent measurements Question: Did the experimental manipulation have an effect, i.e. did it make a difference?.
E N D
So you've done an experiment... • Two independent datasets: • control • experimental • Have n numbers in each dataset, representing independent measurements • Question: Did the experimental manipulation have an effect, i.e. did it make a difference?
The question restated • Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets? • What is the probability that the two datasets are subsets of two different/independent distributions? • What is the probability that the two datasets are NOT subsets of the same underlying distribution?
Answering the question • What do we need? • A number(s) to represent/summarize each dataset → datadescriptors • A test for comparing these numbers → test statistic • A way to assess the significance of the test statistic → distribution of test statistics generated from datasets that ARE from the same underlying distribution
Descriptors and tests • Know your data! • Visualize it • Does it have a central tendency? (i.e. is histogram vaguely mound shaped?) • If yes, use the mean as descriptor • If no, median may be better • Typical test: compute difference (subtraction) between descriptors
We'll discuss how to assess the significance of your test statistic soon, that's the resampling part... • But first let's visualize some data
Simulate and inspect data ctrl = 10.*(5.75+randn(25,1)); exp = 10.*(5+randn(25,1)); boxplot([ctrl exp]) groups = {'control''experimental'} boxplot([ctrl exp],'labels',groups)
Central tendency? hist(ctrl) hist(exp) title('Control group') title('Experimental group')
Compute descriptors and test stat • Our data looks pretty mound shaped, so let's use the mean as data descriptor • Take the absolute value of the difference as our test stat ctrl_d = mean(ctrl); exp_d = mean(exp); test_stat = abs(ctrl_d - exp_d);
Assessing the test statistic • What we would really like to test: • The probability that the two datasets are subsets of two different/independent distributions • Problem: • We don't know what those hypothetical independent distributions are, if we did we wouldn't have to go through all of this!
Assessing the test statistic • Solution: • Compute the probability that the two datasets are NOT subsets of the same underlying distribution • How to do this? • Start by assuming datasets are subsets of same distribution → null hypothesis • See what test statistic looks like when null hypothesis is true
Assessing the test statistic • Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution → distribution of test stat under null hypothesis • Then we can quantify how (a)typical the value of our test stat is under null hypothesis • if typical, our datasets are likely from same distribution → no effect of experiment • if atypical, there is a good chance datasets are from different distributions → experiment had an effect
Generate the null distribution using bootstrap • Basic procedure: • Combine datasets • Randomly sample (w/ replacement) from combined dataset to create two pseudo datasets w/ same n as real datasets • Compute descriptors and test statistic for pseudo datasets • Repeat 10,000 times, keeping track of pseudo test stat
Combine datasets and resample box = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end
Now what? • Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution • This numerically approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution → p-value
Compute p-val and visualize null distribution bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')
How to interpret the p-value? • Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic Restated → • Assuming both datasets really are from the same underlying distribution, the p-value is the likelihood that we would see a difference as big as we do
function [] = bootstrap2groupsMEANS (ctrl, exp) % % function [] = bootstrap2groupsMEANS (ctrl, exp) % % 'ctrl' and 'exp' are assumed to be column vectors, this is % necessary for creation of 'box' % % This function will use bootstrap to compute the probability % that data in 'ctrl' and 'exp' are from the same distribution, % using the group means as descriptors, and the absolute value % of the difference between means as the test statistic % (similar to Student's t-test) % --- prepare for resampling --- ctrl_d = mean(ctrl); % compute descriptors exp_d = mean(exp); % 'ctrl_d' and 'exp_d' test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat' box = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end % --- end resampling --- bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')