150 likes | 592 Views
Chris. Leon. Spanish Inquisition. Yan. Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas. Weka Filtering. Used CFS with BestFirst Search Reduced the number of attributes from 1544 to 125
E N D
Chris Leon Spanish Inquisition Yan Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas
Weka Filtering • Used CFS with BestFirst Search • Reduced the number of attributes from 1544 to 125 • CFS stands for Correlation-based Feature Selection. Basic hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [1]
CFS Algorithm - Searching • Any search algorithm can be plugged into CFS – author describes three - forward selection, backward elimination, and best first. They are all essentially greedy heuristic search algorithms. The greedy search approach reduces the complexity of generating the feature subset. • “Best first can start with either no features or all features. In the former, the search progresses forward through the search space adding single features; in the latter the search moves backward through the search space deleting single features. To prevent the best first search from exploring the entire feature subset search space, a stopping criterion is imposed. The search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset.” [1]
Accuracy (Error Rate) of algorithms before and after applying CFS/BestFit filtering
ROC – Receiver Operating Characteristic • ROC graphs “depict the tradeoff between hit rates and false alarm rates of classifiers “ [2] • “one point in ROC space is better than another if it is to the northwest (tp rate is higher, fp rate is lower, or both) of the first” [2] • Therefore, Area Under Curve, or AUC is an accurate numerical value that can be used to compare classifiers.
MeV Analysis • Initial Hierarchical Clustering
FLJ13710 and GATA3 Lowly expressed in basal-like samples. Highly expressed in luminal samples.
GATA3 • GATA3 levels are a known indication of breast cancer prognosis. (Basal-like is worse than Luminal.) • Associated with estrogen receptor alpha, which is often highly expressed in the early stages of breast cancer.
FLJ13710 • Mentioned in a paper on finding prognostic signatures for breast cancer. • Couldn’t find any in-depth studies on this gene.
References • Mark Hall, “Correlation-based Feature Selection for Machine Learning”, http://www.cs.waikato.ac.nz/~mhall/thesis.pdf • Tom Fawcett, “An introduction to ROC analysis“, doi:10.1016/j.patrec.2005.10.010 – enter into http://dx.doi.org/ 3) Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway”,Molecular Cancer 2008, 7:49. http://www.molecular-cancer.com/content/7/1/49 4) Hayashi, SI., et al. “The expression and function of estrogen receptor alpha and beta in human breast cancer and its clinical application”, http://erc.endocrinology-journals.org/cgi/content/abstract/10/2/193 5) “Suppl. Table 2: List of probe sets significantly differentially expressed between luminal cell lines and basal cell lines. Probe sets are ordered according to decreasing DS (discriminating score). “www.nature.com/onc/journal/v25/n15/extref/1209254x4.xls 6) Carrivick, L., et al. “Identification of Prognostic Signatures in Breast Cancer Microarray Data using Bayesian Techniques.” http://www.enm.bris.ac.uk/cig/pubs/2005/rs4.pdf