190 likes | 257 Views
Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28. So far, all methods are one-gene-at-a-time First these methods are simple and intuitive, then they begin to become complicated.
E N D
Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28
So far, all methods are one-gene-at-a-time • First these methods are simple and intuitive, then they begin to become complicated. • Eg., Efron has to use a tricky logistic regression to estimate the prior density which is not too easy.
The general problem with microarray of data is, although similar in regression setup, the “design matrix” is never of full rank.
In the setup Y=X * \beta + error X is n by p, with n<100, p>1000. I have seen a case with n=7, but p>6000.
Let us say there is a way to “Do the statistical problem” (say, with traditional methods), with a smaller p, say p=p_1=3 or 30, depending on the value of n we have. • Let us assume a model with the first p_1 parameteres only (the other betas are all 0, say)
With our traditional method, we may find the likelihood function – with n observation and p_1 parmateres • And we go through the text book method to do inference about the selected p_1 parameters. • And obtain an estimator of the p_1-dim parameter (together with a sd or p-value)
Repeat the procedure B times, each time with a “simple random sample without replacement of size p_1” from the p genes in the problem.
In this way we change an unsolvable problem (in our classical statistical sense) to B problems, all of them can be done with traditional methods • It is very time-consuming, but sometimes it works
Lo, S haw-Hwa and Tien Zheng (2002) Backward haplotype transmission association algorithm – a fast multi-marker screening method To appear: Human Heredity
Instead of genes, they use markers. • P-markers, n-patient • For each patient, we have data from father and mother • So we have n pieces of parents – child data.
The problem is to identify which are the disease-causing markers
They pick out r markers at a time, r<<p • A statistics T(r) is constructed, which tells the “amount of information” for a n-patient, r-marker sub-problem • Markers in this subproblem are deleted one by one, the least important one first, until all markers left are important
This gets us the group 1 of important markers. • We do the same thing for another subset of r markers, and get the group 2 of important markers, …. • Do it B times, B pretty large, say 5000
Combine all markers together, those with highest frequencies are selected. • More specifically, markers whose returning frequencies are more than the 3-rd quartile plus 1.8 times IQR will be selected (about 3.1 sd from mean) • About 10^{-3} type I error.
The difficult part of the problem is to formulate a likelihood function for the r selected markers. • The next problem is to derive a test statistic, together with its properties. But these are problem-specific…
It is the generality of the setup that is important. • Because it considers r markers at a time, so the likelihood function is with respect to the r selected markers. If there is any interaction between 2 or 3 markers, this process has a potential to pick them up
This is not possible with all the one-gene-at-a-time processes.
All known methods, data mining or not, for analysis of micro array type of data are ad hoc and rather primitive. • Amount of theory is limited. • It has the tendency that these methods will eventually become statistical in nature, because an assessment of risk is still a very important factor in scientific work
Subject-matter relevancy is the key • Other keys: good data other scientists effective computation don’t wait