1 / 74

Object Orie’d Data Analysis, Last Time

Explore the HDLSS asymptotics concept from a dual view perspective, studying distance to origin, pairwise distance, and angle from origin. Also, investigate the NCI 60 Cancer Cell Lines Microarray Data and visualize the effect of DWD adjustment on class separation.

nkang
Download Presentation

Object Orie’d Data Analysis, Last Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • In spirit of classical math’sl statistics • But limit as , not usual • Saw variation goes into random rotation • Modulo rotation, have fixed structure • Convergence to vertices of unit simplex • Gave statistical insights (e.g. methods all come together for high d)

  2. HDLSSAsymptotics Interesting Idea from Travis Gaydos: Interpret from viewpoint of dual space Recall from Aug. 25: for Distance to origin: Pairwise Distance: Angle from origin:

  3. HDLSSAsymptotics – Dual View Study these for simple Examples: Over range Look in dual space: Dimension (easy to visualize) Entries of appear as points: Relate HDLSS phenomena to these points

  4. HDLSSAsymptotics – Dual View

  5. HDLSSAsymptotics – Dual View

  6. HDLSSAsymptotics – Dual View

  7. HDLSSAsymptotics – Dual View

  8. HDLSSAsymptotics – Dual View

  9. HDLSSAsymptotics – Dual View

  10. HDLSSAsymptotics – Dual View

  11. HDLSSAsymptotics – Dual View Notes: Upper left: Dual view of data Upper right: Dual view of squares Lower: renormalized to see data addition: Lower right: study distance to origin Lower left: study pairwise distance

  12. HDLSSAsymptotics – Dual View Summary of insights: Distance to Origin Study via Expected to converge to 1 Upper & Lower Right shows squares Average shown in green: Can see convergence to 1 (stable green lines)

  13. HDLSSAsymptotics – Dual View Summary of insights: Pairwise Distance: Study via Expected to converge to Lower Left shows Sqrt(Average) Can see convergence (stable red line)

  14. HDLSSAsymptotics – Dual View Summary of insights: Angle from origin: Study via Expected to converge to 0 Upper left text shows Convergence

  15. HDLSSAsymptotics – Dual View Would be interesting to try: Study (i.e. explore conditions for): Consistency Strong Inconsistency for PCA direction vectors, from this viewpoint Perhaps other things as well…

  16. NCI 60 Data Recall from: • Aug. 28 • Aug. 30 NCI 60 Cancer Cell Lines Microarray Data • Explored Data Combination • cDNA & Affymetrix Measurements • Right answer is known

  17. Interesting Benchmark Data Set • NCI 60 Cell Lines • Interesting benchmark, since same cells • Data Web available: • http://discover.nci.nih.gov/datasetsNature2000.jsp • Both cDNA and Affymetrix Platforms • Different from Breast Cancer Data • Which had no common samples

  18. NCI 60: Raw Data, Platform Colored

  19. NCI 60: Raw Data

  20. NCI 60: Raw Data,BeforeDWD Adjustment

  21. NCI 60:Before&AfterDWD adjustment

  22. NCI 60 Leave out many slides studied on 8/28/07

  23. NCI 60: Fully Adjusted Data, Platform Colored

  24. NCI 60: Fully Adjusted Data,Melanoma Cluster BREAST.MDAMB435 BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

  25. NCI 60: Fully Adjusted Data,Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR

  26. Another DWD Appl’n: Visualization • Recall PCA limitations • DWD uses class info • Hence can “better separate known classes” • Do this for pairs of classes (DWD just on those, ignore others) • Carefully choose pairs in NCI 60 data • Shows Effectiveness of Adjustment

  27. NCI 60: Views using DWD Dir’ns (focus on biology)

  28. DWD Visualization of NCI 60 Data • Most cancer types clearly distinct (Renal, CNS, Ovar, Leuk, Colon, Melan) • Using these carefully chosen directions • Others less clear cut • NSCLC (at least 3 subtypes) • Breast (4 published subtypes) • DWD adjustment was very effective (very few black connectors visible)

  29. DWD Views of NCI 60 Data • Interesting Question: • Which clusters are really there? • Issues: • DWD great at finding dir’ns of separation • And will do so even if no real structure • Is this happening here? • Or: which clusters are important? • What does “important” mean?

  30. Real Clusters in NCI 60 Data • Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Deeper Approach • Formal Hypothesis Testing • (Done later)

  31. Random Relabelling #1

  32. Random Relabelling #2

  33. Random Relabelling #3

  34. Random Relabelling #4

  35. Revisit Real Data

  36. Revisit Real Data (Cont.) Heuristic Results: Strong Clust’s Weak Clust’s Not Clust’s MelanomaC N S NSCLC LeukemiaOvarianBreast RenalColon Later: will find way to quantify these ideas i.e. develop statistical significance

  37. Needed final verification of Cross-platform Normal’n Is statistical power actually improved? Is there benefit to data combo by DWD? More data  more power? Will study later now

  38. Real Clusters in NCI 60 Data? From Aug. 30: Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Some types appeared signif’ly different • Others did not Deeper Approach: Formal Hypothesis Testing

  39. HDLSSHypothesis Testing Approach: DiProPerm Test DIrection – PROjection – PERMutation Ideas: • Find an appropriate Direction vector • Project data into that 1-d subspace • Construct a 1-d test statistic • Analyze significance by Permutation

  40. HDLSS Hypothesis Testing – DiProPerm test DiProPerm Test Context: • Given 2 sub-populations, X & Y • Are they from the same distribution? • Or significantly different? H0: LX = LY vs. H1: LX≠LY

  41. HDLSSHypothesis Testing – DiProPerm test Reasonable Direction vectors: • Mean Difference • SVM • Maximal Data Piling • DWD (used in the following) • Any good discrimination direction…

  42. HDLSSHypothesis Testing – DiProPerm test Reasonable Projected 1-d statistics: • Two sample t-test (used here) • Chi-square test for different variances • Kolmogorov - Smirnov • Any good distributional test…

  43. HDLSSHypothesis Testing – DiProPerm test DiProPerm Test Steps: • For original data: • Find Direction vector • Project Data, Compute True Test Statistic • For (many) random relabellings of data: • Find Direction vector • Project Data, Compute Perm’d Test Stat • Compare: • True Stat among population of Perm’d Stat’s • Quantile gives p-value

  44. HDLSSHypothesis Testing – DiProPerm test Remarks: • Generally can’t use standard null dist’ns… • e.g. Student’s t-table, for t-statistic • Because Direction and Projection give nonstandard context • I.e. violate traditional assumptions • E.g. DWD finds separating directions • Giving completely invalid test • This motivates Permutation approach

  45. DiProPermSimple Example 1, Totally Separate • Clearly Distinct Populations in This Example • Ignore this “Extreme Labelling” for now • Will become important later…

  46. DiProPermSimple Example 1, Totally Separate

  47. DiProPermSimple Example 1, Totally Separate

  48. DiProPerm Simple Example 1, Totally Separate

  49. DiProPermSimple Example 1, Totally Separate

  50. DiProPerm Simple Example 1, Totally Separate . . . Repeat this 1,000 times To get:

More Related