490 likes | 639 Views
DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES. BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012. What is Data Integration ?. Integration of Multiple Indicators Existence of several different indicators
E N D
DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012
What is Data Integration ? • Integration of Multiple Indicators • Existence of several different indicators • Desired to provide an AGGREGATE OR Over-all Measure….in an objective and statistically sound approach • Multiple Criteria Decision Making [MCDM] • Advocated by Hwang & Yoon (1981) : Multiple Attribute Decision Making : Methods & Applications : A State-of-the-Art-Survey. Springer-Verlag, Berlin
HEALTH ISSUES • Air / Surface / Water Pollution : Different Sources & Their Effects • ******************* • US EPA : TRI Data Base • Toxic Release Inventory [TRI] Data • EPA’s 33/50 Program • TRI Data for 17 Chemicals during long years :1987-1994 for 50 States & DC
TOXIC RELEASE INVENTORY[TRI]: US EPA TOXIC CHEMICALS…… BENZENE CADMIUM CARBON • TETRACHLORIDE CHOLOFORM • CYANIDE LEAD MERCURY • NICKEL TOLUENE M-XYLENE… • TRI Data…..expressed as % … • Less the Better….More the Worse
NATURE OF DATA & PROBLEM States VS Chemicals : TRI Data [Coded] Benzene I 7% II 12% Q. Which State is the Least III 17% Hit by Benzene ? IV 9% Ans. VI V 14% AND Worst Hit ? III VI 6% Single Chemical….. VII 15% NO PROBLEM AT ALL VIII 16% TO RANK THE STATES FROM BEST TO WORST...
ADD ONE MORE CHEMICAL... States VS Chemicals : TRI Data Benzene CADMIUM • I 7% 13 % • II 12% 9% III 17% 4% • IV 9% 11% • V 14% 10% • VI 6% 11% • VII 15% 9% VIII 16% 11% Q. Combine the Two Chemicals : Which State is Worst ? How to Combine ?
AND ADD MORE…… • States \ Chemicals [TRI Data] • Be Cd Ca Tr Ch Cy … • I : 7% 13% 21% 2% 34% 21% … • CONCEPT OF DATA MATRIX • X = (( XiJ )), 1 i K; 1 j N • K Locations & N Data Sources • DATA INTEGRATION FOR OVER-ALL TRI INDEX FOR GLOBAL COMPARISON
Data Matrix States VS Chemicals : TRI Data • Be Cd Ca Tr Ch Cy L • I 7% 13% 21% 2% 34% 21% 17% • II 12% 9% 18% 3% 42% 28% 11% • III 17% 4% 23% 7% 22% 19% 23% • IV 9% 11% 17% 5% 25% 23% 19% • V 14% 10% 13% 8% 21% 19% 25% • VI 6% 11% 19% 5% 33% 21% 22% • VII 15% 9% 13% 4% 38% 19% 28% • VIII 16% 11% 10% 5% 33% 20% 25%
Application Areas • Disease Prevalence Statistics • Disease Symptom Statistics • Health Statistics Demographic Statistics • Human Development Index Statistics *********** Data Integration : Common Problem Techniques are quite general ….
Nature of Data • Locations versus Features : Quantitative data providing impacts of features on the locations based on similarity principle • Purpose : Overall Ranking of Locations based on Combined Evidence from a Pool of Features • Features may / may not have equal importance in the process of ‘combining evidences’
Aggregate Methods….. • Some kind of “aggregate” …..pooling of TRI Data to a single value for each State for over-all comparison • TRI Data • Total TRI for I = 115 [over 7 features] • Average TRI for I = 16.43% • Compute Average for Each State & Compare the averages across all states
Aggregate Methods…. • AM…..GM…..HM….. • Use of Median as Representative of TRI • TRI Data • I…Median = 17% II…Median = 12% • III….19% ETC ETC….. Q. ARE ALL CHEMICALS EQUALLY HARMFUL ? Ans. Possibly NOT ! Q. Are all Features Equally Important ?
Concept of Weight….. • Subject Specialist’s Knowledge….. • Choice of Weights : Rel. Importance • Wts.[TRI] : 2.0 3.5 1.0 4.5 5.0 7.0 2.0 • Interpretation of weights….. • Total of Weights = 25.0 • Rel. Wts : 2.0/25 = 8%, 3.5/25 =14% • etc etc….for all chemicals….. • Total of Rel Wts. = 1 OR 100 % • Use Rel. Wts. to compute Weighted AM, GM, HM etc
Use of Ranks….. Convert Scores into Ranks for Each Item TRI Data Matrix :Convert into Rank Matrix Benzene Cadmium etc etc TRI Scores Ranks • I 7% ……...2 • II 12% …….4 • III 17% …….8 • IV 9% ……...3 • V 14% …….5 • VI 6% …..1 • VII 15% …...6 • VIII 16% …...7
Rank Matrix…. States VS Chemicals : TRI Data ….ranks • Be Cd Ca Tr Ch Cy L • I 2 • II 4 • III 8 • IV 3 etc etc etc • V 5 for each chemical • VI 1 • VII 6 Then use “aggregate” methods • VIII 7 based on ranks
Why Ranks ….. • Raw Scores…… • Aggregate methods are sensitive to Outliers….too high or too low values… • Extreme Values…. • Use of Trimmed Mean • Ranking…..recommended for Robust Results……
Less Known Methods…. • TOPSIS METHOD • ELECTRE METHOD [computation-intensive…..] • Concepts : TOPSIS Method • Features…..Locations…. Ideal Location Anti-Ideal Location • Distance from Ideal….from Anti-ideal • Within Feature Variation • Composite Index
TOPSIS METHODTechnique for Ordering Preferences by Similarity to Ideal Solution Uses Concepts of • Ideal & Anti-Ideal Locations • Distance from Ideal & Anti-Ideal Locations • Weight of Features • Sum of Squares for each feature
Philosophy for TOPSIS • TOPSIS (Technique for Ordering Preferences by Similarity to Ideal Solution) • In the absence of a natural course of action for over-all summary measure and ranking….next best alternative course of action would be to assign top rank to the one which has shortest distance from the ideal and farthest distance from the anti-ideal…..
Concepts: Ideal & Anti-Ideal States States VS Chemicals : TRI Data • Be Cd Ca Tr Ch Cy L • I 7% 13% 21% 2% 34% 21% 17% • II 12% 9% 18% 3% 42% 28% 11% • III 17% 4% 23% 7% 22% 19% 23% • IV 9% 11% 17% 5% 25% 23% 19% • V 14% 10% 13% 8% 21% 19% 25% • VI 6% 11% 19% 5% 33% 21% 22% • VII 15% 9% 13% 4% 38% 19% 28% • VIII 16% 11% 10% 5% 33% 20% 25% ********************************************************* • Ideal... 6% 4% 10% 2% 21% 19% 11% • Anti- 17% 13% 23% 8% 42% 28% 28% • Ideal
Ideal & Anti-Ideal…… • Hypothetical Locations! • Abs. Best / Worst States ……Hypothetical • Setting up the Limits for others….. • Ranking of the others….. • Better - Placed States ? • Closer to Ideal : Distance from Ideal….small • AND ALSO Far from Anti-Ideal : Distance from Anti-Ideal…Large
Concepts of Distance….. Euclidean Distance….. Ideal : 6% 4% 10% 2% 21% 19% 11% Anti-17% 13% 23% 8% 42% 28% 28% Ideal Squarred Distance between Ideal & Its Anti = (6-17)^2 + (4-13)^2 + …. + (11-28)^2
Computations…. • Distance between Location & ID OR NID…. • I : 7% 13% 21% 2% 34% 21% 17% • ID: 6% 4% 10% 2% 21% 19% 11% NID 17% 13% 23% 8% 42% 28% 28% Sq.Dis. [ I vs ID] (7- 6)^2 =1, (13 – 4)^2 =81, …… Sq. Dis. [ I vs NID] (7-17)^2 =100, (13 – 13)^2=0, … ….
Sq. Dist. Comp. vs Ideal Features\ Chemicals Locations 1 2 3 4 5 6 7 I 1 81 121 0 169 4 36 II 36 25 64 1 441 81 0 III 121 0 169 25 1 0 144 IV 9 49 49 9 16 16 64 V 64 36 9 36 0 0 196 VI 0 49 81 9 144 4 121 VII 81 25 9 4 289 0 289 VIII 100 49 0 9 144 1 256
Sq. Dist. Comp. vs Anti-Ideal Features\ Chemicals Locations 1 2 3 4 5 6 7 I 100 0 4 36 64 49 121 II 25 16 25 25 0 0 289 III 0 81 0 1 400 81 25 IV 64 4 36 9 289 25 81 V 9 9 100 0 441 81 9 VI 121 4 16 9 81 49 36 VII 4 16 100 16 16 81 0 VIII 1 4 169 9 81 64 9
Choice of Weights….. • Wts.[TRI] : 2.0 3.5 1.0 4.5 5.0 7.0 2.0 • Rel. Wts : .08 .14 .04 .18 .20 .28 .08 Sum of Squares for each feature over all locations Be : 7^2+12^2+17^2+9^2+14^2+6^2 +15^2+16^2 = 1236
Computation of Feature-wiseSum of Squares Features \ Sum of Squares 1[Be] 1236 2 [Cd] 810 3 [Ca] 2382 4 [Tr] 217 5 [Ch] 5386 6 [Cy] 3678 7 [L] 3818
Formation of Composite Indices…. Ingredients • Distances, Weights & Sum of Squares • Composite Index [CI] : 2 Components derived from Ideal & Anti-Ideal locations For Each Location : Added over all Features Sq.Distance x Wt of Feature Divided by Sum of Squares of feature
Details of Computations….. • State I : L2 [I, IDR] = [(7- 6)^2 x 0.08 / 1236 + ….]1/2 L2 [I, NIDR =[ (7-17)^2 x 0.08 / 1236 + …]1/2 • CI = Composite Index = L2 [I, IDR] / {L2 [I, IDR} + L2 [I,NIDR]} • It is a RATIO between 0 and 1 ….smaller the ratio, better is the placement of the State in over-all comparison across states …..
Computational Details : Ideal • Locations / Features for Sq Distance wrt Ideal x Weight / SS of Features I.000065 .014 .002032 0.00 .006276 .000304 .000754 II.002330 .004321 .001075 .000829 .016376 .006166 0.00 III .007832 0.00 .002838 .020725 .000037 0.00 .003016 IV.000582 .008469 .000823 .007461 .000594 .001216 .001340 V .004144 .006222 .000151 .029844 0.00 0.00 .004105 VI 0.00 .008469 .001360 .007461 .005348 .000304 .002534 VII .005241 .004321 .0000015 .003316 .010732 0.00 .006053 VIII .006470 .008469 0.00 .007461 .005348 .000076 .005362
Computational Details : Anti-Ideal Locations / Features for Sq Dis. wrt Anti-Ideal x Weight / SS of Features I.006472 0.00 .000067 0.029862 .002411 .003724 .002534 II
Final Ranking Table… • States L2 [., IDR] L2 [., NIDR] CI Rank • I • II • III • IV etc etc etc • V • VI • VII • VIII
Choice of Weights…. • Internal & External Importance of Environmental factors…. • Use of Shanon’s Entropy Measure …. • Define piJ = XiJ / iXiJ = proportion…. • Compute for each item • (J) = - ipiJ ln piJ / ln (K) • Use • w(J) = (1 - (J)) / r(1- (r)) • Alternatively….use w(J) proportional to cv2 of Item J …coeff of variation [cv] computed from the data matrix…..
Extensions….. • Ranking depends critically on Choice of Distance Measure & Choice of Weights • Distance Measure : Squared Distance [L2] • Mean Deviation : L1 – Norm
Results of TOPSIS Analysis : Two Sets of Weights [Entropy & CV] & Two Distance Measures [L-1 & L-2]
Questions….. • Q1. Can the indicators be expressed in original units of measurement or we only need % ? • Ans. Yes….original units will do since the formulae indicate unit-free computations. Also see US Original Pollution Data at the end. • Q2. What about interdependence among the indicators ?
Questions…. • Ans. It is believed that the indicators are seemingly uncorrelated. If there is any functional dependence, only the smallest subset of them should be used. • Q3. What about PCA ? • Ans. That won’t lead to ranking of the locations. Also it will be difficult to interpret the linear combinations of the indicators.