Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • In spirit of classical math’sl statistics • But limit as , not usual • Saw variation goes into random rotation • Modulo rotation, have fixed structure • Convergence to vertices of unit simplex • Gave statistical insights (e.g. methods all come together for high d)

HDLSSAsymptotics Interesting Idea from Travis Gaydos: Interpret from viewpoint of dual space Recall from Aug. 25: for Distance to origin: Pairwise Distance: Angle from origin:

HDLSSAsymptotics – Dual View Study these for simple Examples: Over range Look in dual space: Dimension (easy to visualize) Entries of appear as points: Relate HDLSS phenomena to these points

HDLSSAsymptotics – Dual View

HDLSSAsymptotics – Dual View Notes: Upper left: Dual view of data Upper right: Dual view of squares Lower: renormalized to see data addition: Lower right: study distance to origin Lower left: study pairwise distance

HDLSSAsymptotics – Dual View Summary of insights: Distance to Origin Study via Expected to converge to 1 Upper & Lower Right shows squares Average shown in green: Can see convergence to 1 (stable green lines)

HDLSSAsymptotics – Dual View Summary of insights: Pairwise Distance: Study via Expected to converge to Lower Left shows Sqrt(Average) Can see convergence (stable red line)

HDLSSAsymptotics – Dual View Summary of insights: Angle from origin: Study via Expected to converge to 0 Upper left text shows Convergence

HDLSSAsymptotics – Dual View Would be interesting to try: Study (i.e. explore conditions for): Consistency Strong Inconsistency for PCA direction vectors, from this viewpoint Perhaps other things as well…

NCI 60 Data Recall from: • Aug. 28 • Aug. 30 NCI 60 Cancer Cell Lines Microarray Data • Explored Data Combination • cDNA & Affymetrix Measurements • Right answer is known

Interesting Benchmark Data Set • NCI 60 Cell Lines • Interesting benchmark, since same cells • Data Web available: • http://discover.nci.nih.gov/datasetsNature2000.jsp • Both cDNA and Affymetrix Platforms • Different from Breast Cancer Data • Which had no common samples

NCI 60: Raw Data, Platform Colored

NCI 60: Raw Data

NCI 60: Raw Data,BeforeDWD Adjustment

NCI 60:Before&AfterDWD adjustment

NCI 60 Leave out many slides studied on 8/28/07

NCI 60: Fully Adjusted Data, Platform Colored

NCI 60: Fully Adjusted Data,Melanoma Cluster BREAST.MDAMB435 BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

NCI 60: Fully Adjusted Data,Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR

Another DWD Appl’n: Visualization • Recall PCA limitations • DWD uses class info • Hence can “better separate known classes” • Do this for pairs of classes (DWD just on those, ignore others) • Carefully choose pairs in NCI 60 data • Shows Effectiveness of Adjustment

NCI 60: Views using DWD Dir’ns (focus on biology)

DWD Visualization of NCI 60 Data • Most cancer types clearly distinct (Renal, CNS, Ovar, Leuk, Colon, Melan) • Using these carefully chosen directions • Others less clear cut • NSCLC (at least 3 subtypes) • Breast (4 published subtypes) • DWD adjustment was very effective (very few black connectors visible)

DWD Views of NCI 60 Data • Interesting Question: • Which clusters are really there? • Issues: • DWD great at finding dir’ns of separation • And will do so even if no real structure • Is this happening here? • Or: which clusters are important? • What does “important” mean?

Real Clusters in NCI 60 Data • Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Deeper Approach • Formal Hypothesis Testing • (Done later)

Random Relabelling #1

Revisit Real Data

Revisit Real Data (Cont.) Heuristic Results: Strong Clust’s Weak Clust’s Not Clust’s MelanomaC N S NSCLC LeukemiaOvarianBreast RenalColon Later: will find way to quantify these ideas i.e. develop statistical significance

Needed final verification of Cross-platform Normal’n Is statistical power actually improved? Is there benefit to data combo by DWD? More data  more power? Will study later now

Real Clusters in NCI 60 Data? From Aug. 30: Simple Visual Approach: • Randomly relabel data (Cancer Types) • Recompute DWD dir’ns & visualization • Get heuristic impression from this • Some types appeared signif’ly different • Others did not Deeper Approach: Formal Hypothesis Testing

HDLSSHypothesis Testing Approach: DiProPerm Test DIrection – PROjection – PERMutation Ideas: • Find an appropriate Direction vector • Project data into that 1-d subspace • Construct a 1-d test statistic • Analyze significance by Permutation

HDLSS Hypothesis Testing – DiProPerm test DiProPerm Test Context: • Given 2 sub-populations, X & Y • Are they from the same distribution? • Or significantly different? H0: LX = LY vs. H1: LX≠LY

HDLSSHypothesis Testing – DiProPerm test Reasonable Direction vectors: • Mean Difference • SVM • Maximal Data Piling • DWD (used in the following) • Any good discrimination direction…

HDLSSHypothesis Testing – DiProPerm test Reasonable Projected 1-d statistics: • Two sample t-test (used here) • Chi-square test for different variances • Kolmogorov - Smirnov • Any good distributional test…

HDLSSHypothesis Testing – DiProPerm test DiProPerm Test Steps: • For original data: • Find Direction vector • Project Data, Compute True Test Statistic • For (many) random relabellings of data: • Find Direction vector • Project Data, Compute Perm’d Test Stat • Compare: • True Stat among population of Perm’d Stat’s • Quantile gives p-value

HDLSSHypothesis Testing – DiProPerm test Remarks: • Generally can’t use standard null dist’ns… • e.g. Student’s t-table, for t-statistic • Because Direction and Projection give nonstandard context • I.e. violate traditional assumptions • E.g. DWD finds separating directions • Giving completely invalid test • This motivates Permutation approach

DiProPermSimple Example 1, Totally Separate • Clearly Distinct Populations in This Example • Ignore this “Extreme Labelling” for now • Will become important later…

DiProPermSimple Example 1, Totally Separate

DiProPerm Simple Example 1, Totally Separate

DiProPermSimple Example 1, Totally Separate

DiProPerm Simple Example 1, Totally Separate . . . Repeat this 1,000 times To get:

Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

Presentation Transcript

Remote Sensing in Precision Agriculture

Time series Decomposition

XRD analysis

Object-Oriented Systems Analysis and Design Using UML

Microarray Data Analysis Using BASE

Motion Analysis Summer Course

Time Series Analysis

Analysis of Algorithms

What is the problem? Broad Data and Infrastructure Analysis

DATA ANALYSIS

5. Analysis

Data Mining Tools

Time Series Analysis: Method and Substance Introductory Workshop on Time Series Analysis

Chapter 2 Everything is an Object

Gene Expression Data and Cluster Analysis

Object Orie’d Data Analysis, Last Time

Object-Oriented Analysis and Design with UML2 and Rational Software Modeler

DOM (Document Object Model)

Time Series Analysis in AFNI

Object-Oriented Analysis and Design

Time Space and Time-Space