740 likes | 746 Views
Explore the HDLSS asymptotics concept from a dual view perspective, studying distance to origin, pairwise distance, and angle from origin. Also, investigate the NCI 60 Cancer Cell Lines Microarray Data and visualize the effect of DWD adjustment on class separation.
E N D
Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • In spirit of classical math’sl statistics • But limit as , not usual • Saw variation goes into random rotation • Modulo rotation, have fixed structure • Convergence to vertices of unit simplex • Gave statistical insights (e.g. methods all come together for high d)
HDLSSAsymptotics Interesting Idea from Travis Gaydos: Interpret from viewpoint of dual space Recall from Aug. 25: for Distance to origin: Pairwise Distance: Angle from origin:
HDLSSAsymptotics – Dual View Study these for simple Examples: Over range Look in dual space: Dimension (easy to visualize) Entries of appear as points: Relate HDLSS phenomena to these points
HDLSSAsymptotics – Dual View Notes: Upper left: Dual view of data Upper right: Dual view of squares Lower: renormalized to see data addition: Lower right: study distance to origin Lower left: study pairwise distance
HDLSSAsymptotics – Dual View Summary of insights: Distance to Origin Study via Expected to converge to 1 Upper & Lower Right shows squares Average shown in green: Can see convergence to 1 (stable green lines)
HDLSSAsymptotics – Dual View Summary of insights: Pairwise Distance: Study via Expected to converge to Lower Left shows Sqrt(Average) Can see convergence (stable red line)
HDLSSAsymptotics – Dual View Summary of insights: Angle from origin: Study via Expected to converge to 0 Upper left text shows Convergence
HDLSSAsymptotics – Dual View Would be interesting to try: Study (i.e. explore conditions for): Consistency Strong Inconsistency for PCA direction vectors, from this viewpoint Perhaps other things as well…
NCI 60 Data Recall from: • Aug. 28 • Aug. 30 NCI 60 Cancer Cell Lines Microarray Data • Explored Data Combination • cDNA & Affymetrix Measurements • Right answer is known
Interesting Benchmark Data Set • NCI 60 Cell Lines • Interesting benchmark, since same cells • Data Web available: • http://discover.nci.nih.gov/datasetsNature2000.jsp • Both cDNA and Affymetrix Platforms • Different from Breast Cancer Data • Which had no common samples
NCI 60 Leave out many slides studied on 8/28/07
NCI 60: Fully Adjusted Data,Melanoma Cluster BREAST.MDAMB435 BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257
NCI 60: Fully Adjusted Data,Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR
Another DWD Appl’n: Visualization • Recall PCA limitations • DWD uses class info • Hence can “better separate known classes” • Do this for pairs of classes (DWD just on those, ignore others) • Carefully choose pairs in NCI 60 data • Shows Effectiveness of Adjustment
DWD Visualization of NCI 60 Data • Most cancer types clearly distinct (Renal, CNS, Ovar, Leuk, Colon, Melan) • Using these carefully chosen directions • Others less clear cut • NSCLC (at least 3 subtypes) • Breast (4 published subtypes) • DWD adjustment was very effective (very few black connectors visible)
DWD Views of NCI 60 Data • Interesting Question: • Which clusters are really there? • Issues: • DWD great at finding dir’ns of separation • And will do so even if no real structure • Is this happening here? • Or: which clusters are important? • What does “important” mean?
Real Clusters in NCI 60 Data • Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Deeper Approach • Formal Hypothesis Testing • (Done later)
Revisit Real Data (Cont.) Heuristic Results: Strong Clust’s Weak Clust’s Not Clust’s MelanomaC N S NSCLC LeukemiaOvarianBreast RenalColon Later: will find way to quantify these ideas i.e. develop statistical significance
Needed final verification of Cross-platform Normal’n Is statistical power actually improved? Is there benefit to data combo by DWD? More data more power? Will study later now
Real Clusters in NCI 60 Data? From Aug. 30: Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Some types appeared signif’ly different • Others did not Deeper Approach: Formal Hypothesis Testing
HDLSSHypothesis Testing Approach: DiProPerm Test DIrection – PROjection – PERMutation Ideas: • Find an appropriate Direction vector • Project data into that 1-d subspace • Construct a 1-d test statistic • Analyze significance by Permutation
HDLSS Hypothesis Testing – DiProPerm test DiProPerm Test Context: • Given 2 sub-populations, X & Y • Are they from the same distribution? • Or significantly different? H0: LX = LY vs. H1: LX≠LY
HDLSSHypothesis Testing – DiProPerm test Reasonable Direction vectors: • Mean Difference • SVM • Maximal Data Piling • DWD (used in the following) • Any good discrimination direction…
HDLSSHypothesis Testing – DiProPerm test Reasonable Projected 1-d statistics: • Two sample t-test (used here) • Chi-square test for different variances • Kolmogorov - Smirnov • Any good distributional test…
HDLSSHypothesis Testing – DiProPerm test DiProPerm Test Steps: • For original data: • Find Direction vector • Project Data, Compute True Test Statistic • For (many) random relabellings of data: • Find Direction vector • Project Data, Compute Perm’d Test Stat • Compare: • True Stat among population of Perm’d Stat’s • Quantile gives p-value
HDLSSHypothesis Testing – DiProPerm test Remarks: • Generally can’t use standard null dist’ns… • e.g. Student’s t-table, for t-statistic • Because Direction and Projection give nonstandard context • I.e. violate traditional assumptions • E.g. DWD finds separating directions • Giving completely invalid test • This motivates Permutation approach
DiProPermSimple Example 1, Totally Separate • Clearly Distinct Populations in This Example • Ignore this “Extreme Labelling” for now • Will become important later…
DiProPerm Simple Example 1, Totally Separate . . . Repeat this 1,000 times To get: