Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • In spirit of classical math’sl statistics • But limit as , not usual • Saw variation goes into random rotation • Modulo rotation, have fixed structure • Convergence to vertices of unit simplex • Gave statistical insights (e.g. methods all come together for high d)

HDLSSAsymptotics Interesting Idea from Travis Gaydos: Interpret from viewpoint of dual space Recall from Aug. 25: for Distance to origin: Pairwise Distance: Angle from origin:

HDLSSAsymptotics – Dual View Study these for simple Examples: Over range Look in dual space: Dimension (easy to visualize) Entries of appear as points: Relate HDLSS phenomena to these points

HDLSSAsymptotics – Dual View

HDLSSAsymptotics – Dual View Notes: Upper left: Dual view of data Upper right: Dual view of squares Lower: renormalized to see data addition: Lower right: study distance to origin Lower left: study pairwise distance

HDLSSAsymptotics – Dual View Summary of insights: Distance to Origin Study via Expected to converge to 1 Upper & Lower Right shows squares Average shown in green: Can see convergence to 1 (stable green lines)

HDLSSAsymptotics – Dual View Summary of insights: Pairwise Distance: Study via Expected to converge to Lower Left shows Sqrt(Average) Can see convergence (stable red line)

HDLSSAsymptotics – Dual View Summary of insights: Angle from origin: Study via Expected to converge to 0 Upper left text shows Convergence

HDLSSAsymptotics – Dual View Would be interesting to try: Study (i.e. explore conditions for): Consistency Strong Inconsistency for PCA direction vectors, from this viewpoint Perhaps other things as well…

NCI 60 Data Recall from: • Aug. 28 • Aug. 30 NCI 60 Cancer Cell Lines Microarray Data • Explored Data Combination • cDNA & Affymetrix Measurements • Right answer is known

Interesting Benchmark Data Set • NCI 60 Cell Lines • Interesting benchmark, since same cells • Data Web available: • http://discover.nci.nih.gov/datasetsNature2000.jsp • Both cDNA and Affymetrix Platforms • Different from Breast Cancer Data • Which had no common samples

NCI 60: Raw Data, Platform Colored

NCI 60: Raw Data

NCI 60: Raw Data,BeforeDWD Adjustment

NCI 60:Before&AfterDWD adjustment

NCI 60 Leave out many slides studied on 8/28/07

NCI 60: Fully Adjusted Data, Platform Colored

NCI 60: Fully Adjusted Data,Melanoma Cluster BREAST.MDAMB435 BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

NCI 60: Fully Adjusted Data,Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR

Another DWD Appl’n: Visualization • Recall PCA limitations • DWD uses class info • Hence can “better separate known classes” • Do this for pairs of classes (DWD just on those, ignore others) • Carefully choose pairs in NCI 60 data • Shows Effectiveness of Adjustment

NCI 60: Views using DWD Dir’ns (focus on biology)

DWD Visualization of NCI 60 Data • Most cancer types clearly distinct (Renal, CNS, Ovar, Leuk, Colon, Melan) • Using these carefully chosen directions • Others less clear cut • NSCLC (at least 3 subtypes) • Breast (4 published subtypes) • DWD adjustment was very effective (very few black connectors visible)

DWD Views of NCI 60 Data • Interesting Question: • Which clusters are really there? • Issues: • DWD great at finding dir’ns of separation • And will do so even if no real structure • Is this happening here? • Or: which clusters are important? • What does “important” mean?

Real Clusters in NCI 60 Data • Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Deeper Approach • Formal Hypothesis Testing • (Done later)

Random Relabelling #1

Revisit Real Data

Revisit Real Data (Cont.) Heuristic Results: Strong Clust’s Weak Clust’s Not Clust’s MelanomaC N S NSCLC LeukemiaOvarianBreast RenalColon Later: will find way to quantify these ideas i.e. develop statistical significance

Needed final verification of Cross-platform Normal’n Is statistical power actually improved? Is there benefit to data combo by DWD? More data  more power? Will study later now

Real Clusters in NCI 60 Data? From Aug. 30: Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Some types appeared signif’ly different • Others did not Deeper Approach: Formal Hypothesis Testing

HDLSSHypothesis Testing Approach: DiProPerm Test DIrection – PROjection – PERMutation Ideas: • Find an appropriate Direction vector • Project data into that 1-d subspace • Construct a 1-d test statistic • Analyze significance by Permutation

HDLSS Hypothesis Testing – DiProPerm test DiProPerm Test Context: • Given 2 sub-populations, X & Y • Are they from the same distribution? • Or significantly different? H0: LX = LY vs. H1: LX≠LY

HDLSSHypothesis Testing – DiProPerm test Reasonable Direction vectors: • Mean Difference • SVM • Maximal Data Piling • DWD (used in the following) • Any good discrimination direction…

HDLSSHypothesis Testing – DiProPerm test Reasonable Projected 1-d statistics: • Two sample t-test (used here) • Chi-square test for different variances • Kolmogorov - Smirnov • Any good distributional test…

HDLSSHypothesis Testing – DiProPerm test DiProPerm Test Steps: • For original data: • Find Direction vector • Project Data, Compute True Test Statistic • For (many) random relabellings of data: • Find Direction vector • Project Data, Compute Perm’d Test Stat • Compare: • True Stat among population of Perm’d Stat’s • Quantile gives p-value

HDLSSHypothesis Testing – DiProPerm test Remarks: • Generally can’t use standard null dist’ns… • e.g. Student’s t-table, for t-statistic • Because Direction and Projection give nonstandard context • I.e. violate traditional assumptions • E.g. DWD finds separating directions • Giving completely invalid test • This motivates Permutation approach

DiProPermSimple Example 1, Totally Separate • Clearly Distinct Populations in This Example • Ignore this “Extreme Labelling” for now • Will become important later…

DiProPermSimple Example 1, Totally Separate

DiProPerm Simple Example 1, Totally Separate

DiProPermSimple Example 1, Totally Separate

DiProPerm Simple Example 1, Totally Separate . . . Repeat this 1,000 times To get:

Object Orie’d Data Analysis, Last Time