190 likes | 334 Views
Measurements and Data. Topics. • • • • • . Types of Data Distance Measurement Data Transformation Forms of Data Data Quality. Types of Measurement • Ordinal, – e.g., excellent=5, very good=4, good=3… • Nominal – e.g., color, religion, profession – Need non-metric methods
E N D
Topics • • • • • TypesofData DistanceMeasurement DataTransformation FormsofData DataQuality
TypesofMeasurement • Ordinal, – e.g.,excellent=5,verygood=4,good=3… • Nominal – e.g.,color,religion,profession – Neednon-metricmethods • Ratio – e.g.,weight – hasconcatenationproperty,twoweightsaddtobalancea third:2+3=5 • Interval – e.g.,temperature,calendartime
ExamplesofMetrics • EuclideanDistancedE – Standardized(dividebyvariance) – WeighteddWE • Minkowskimeasure – ManhattanDistance • MahanalobisDistancedM – UseofCovariance • BinarydataDistances
UseofCovarianceinDistance • Similaritiesbetweencups • Supposewemeasurecup-height100times anddiameteronlyonce – heightwilldominatealthough99oftheheight measurementsarenotcontributinganything • Theyareveryhighlycorrelated • Toeliminateredundancyweneedadata- drivenmethod – approachistonotonlytostandardizedataineach directionbutalsotousecovariancebetween variables
CovariancebetweentwoScalarVariables • Ascalarvaluetomeasurehowxandyvarytogether • Largepositivevalue – iflargevaluesofxtendtobeassociatedwithlargevaluesofyandsmallvaluesofx withsmallvaluesofy • Largenegativevalue – iflargevaluesofxtendtobeassociatedwithsmallvaluesofy • Withdvariablescanconstructadxdmatrixofcovariances €
CorrelationCoefficient ValueofCovarianceisdependentuponrangesofxandy Dependencyisremovedby dividingvaluesofxbytheirstandarddeviation andvaluesofybytheirstandarddeviation
CorrelationMatrix Housingrelatedvariables acrosscitysuburbs(d=11) 11x11pixelimage(White1,Black-1) Columns12-14havevalues-1,0,1for pixelintensityreference Remainingrepresentcorrrelationmatrix Variables3and4arehighlynegatively correlatedwithVariable2 Variable5ispositivelycorrelatedwithVariable11 Variables8and9arehighlycorrelated Referencefor-1,0,+1
GeneralizingEuclideanDistance MinkowskiorLλmetric • λ=2givestheEuclideanmetric • λ=1givestheManhattanorCity-blockmetric • λ=∞yields
DistanceMeasuresforBinaryData • MostobviousmeasureisHammingDistancenormalizedbynumberofbits Proportionofvariables onwhichobjectshavesamevalue • Ifwedon’tcareaboutirrelevantpropertieshadbyneitherobjectwehave JaccardCoefficient Example:twodocuments donothavecertainterms • DiceCoefficientextendsthisargument – If00matchesareirrelevantthen10and01matchesshouldhavehalfrelevance
TransformingtheData Modeldependsonformofdata IfYisafunctionofX2thenwecoulduse quadraticfunctionorchooseU=X2 andusealinearfit
V1isnon- linearly RelatedtoV2 V2 V1 V3=1/V2islinearly relatedtoV1
Varianceincreases Squareroottransformation keepsthevarianceconstant
FormsofData StandardData(DataMatrix) MultirelationalData • String • Sequenceofsymbolsfromafinitealphabet EventSequence • Sequenceofpairsoftheform{event, occurrencetime}
MultirelationalData (multipledatamatrices) PayrollDatabase Name DepartmentAgeSalary Name DepartmentTable DepartmentBudget Manager Name Canbecombinedtogethertoformadatamatrixwithfields name,department-name,age,salary,budget,manager Orcreateasmanyrowsasdepartment-names Flatteningrequiresneedlessreplication(Storageissues)
DataQualityforIndividualMeasurements • DataMiningDependsonQualityofdata • Manyinterestingpatternsdiscoveredmay resultfrommeasurementinaccuracies. • Sourcesoferror – Errorsinmeasurement – Carelessness – Instrumentationfailure
PrecisionandAccuracy • PreciseMeasurement – Smallvariability(measuredbyvariance) – Repeatedmeasurementsyieldsamevalue – Manydigitsofprecisionisnotnecessarily accurate(resultsofcalculationsgivemanydigits) • Accurate – Notonlysmallvariabilitybutclosetotruevalue
DataQualityforCollectionsofData • CollectionsofData – Muchofstatisticsisconcernedwithinferencefrom asampletoapopulation – Howtoinferthingsfromafractionaboutentire population – Twosourcesoferror: • samplesizeandbias