1 / 19

Measurements and Data

Measurements and Data. Topics. •  •  •  •  • . Types of Data Distance Measurement Data Transformation Forms of Data Data Quality. Types of Measurement •  Ordinal, –  e.g., excellent=5, very good=4, good=3… •  Nominal –  e.g., color, religion, profession –  Need non-metric methods

twyla
Download Presentation

Measurements and Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MeasurementsandData

  2. Topics •  •  •  •  •  TypesofData DistanceMeasurement DataTransformation FormsofData DataQuality

  3. TypesofMeasurement • Ordinal, – e.g.,excellent=5,verygood=4,good=3… • Nominal – e.g.,color,religion,profession – Neednon-metricmethods • Ratio – e.g.,weight – hasconcatenationproperty,twoweightsaddtobalancea third:2+3=5 • Interval – e.g.,temperature,calendartime

  4. ExamplesofMetrics • EuclideanDistancedE – Standardized(dividebyvariance) – WeighteddWE • Minkowskimeasure – ManhattanDistance • MahanalobisDistancedM – UseofCovariance • BinarydataDistances

  5. UseofCovarianceinDistance • Similaritiesbetweencups • Supposewemeasurecup-height100times anddiameteronlyonce – heightwilldominatealthough99oftheheight measurementsarenotcontributinganything • Theyareveryhighlycorrelated • Toeliminateredundancyweneedadata- drivenmethod – approachistonotonlytostandardizedataineach directionbutalsotousecovariancebetween variables

  6. CovariancebetweentwoScalarVariables  • Ascalarvaluetomeasurehowxandyvarytogether • Largepositivevalue – iflargevaluesofxtendtobeassociatedwithlargevaluesofyandsmallvaluesofx withsmallvaluesofy • Largenegativevalue – iflargevaluesofxtendtobeassociatedwithsmallvaluesofy • Withdvariablescanconstructadxdmatrixofcovariances €

  7. CorrelationCoefficient ValueofCovarianceisdependentuponrangesofxandy Dependencyisremovedby dividingvaluesofxbytheirstandarddeviation andvaluesofybytheirstandarddeviation

  8. CorrelationMatrix Housingrelatedvariables acrosscitysuburbs(d=11) 11x11pixelimage(White1,Black-1) Columns12-14havevalues-1,0,1for pixelintensityreference Remainingrepresentcorrrelationmatrix Variables3and4arehighlynegatively correlatedwithVariable2 Variable5ispositivelycorrelatedwithVariable11 Variables8and9arehighlycorrelated Referencefor-1,0,+1

  9. GeneralizingEuclideanDistance MinkowskiorLλmetric • λ=2givestheEuclideanmetric • λ=1givestheManhattanorCity-blockmetric • λ=∞yields

  10. DistanceMeasuresforBinaryData • MostobviousmeasureisHammingDistancenormalizedbynumberofbits Proportionofvariables onwhichobjectshavesamevalue • Ifwedon’tcareaboutirrelevantpropertieshadbyneitherobjectwehave JaccardCoefficient Example:twodocuments donothavecertainterms • DiceCoefficientextendsthisargument – If00matchesareirrelevantthen10and01matchesshouldhavehalfrelevance

  11. TransformingtheData Modeldependsonformofdata IfYisafunctionofX2thenwecoulduse quadraticfunctionorchooseU=X2 andusealinearfit

  12. V1isnon- linearly RelatedtoV2 V2 V1 V3=1/V2islinearly relatedtoV1

  13. Varianceincreases Squareroottransformation keepsthevarianceconstant

  14. FormsofData StandardData(DataMatrix) MultirelationalData • String • Sequenceofsymbolsfromafinitealphabet EventSequence • Sequenceofpairsoftheform{event, occurrencetime}

  15. MultirelationalData (multipledatamatrices) PayrollDatabase Name DepartmentAgeSalary Name DepartmentTable DepartmentBudget Manager Name Canbecombinedtogethertoformadatamatrixwithfields name,department-name,age,salary,budget,manager Orcreateasmanyrowsasdepartment-names Flatteningrequiresneedlessreplication(Storageissues)

  16. DataQualityforIndividualMeasurements • DataMiningDependsonQualityofdata • Manyinterestingpatternsdiscoveredmay resultfrommeasurementinaccuracies. • Sourcesoferror – Errorsinmeasurement – Carelessness – Instrumentationfailure

  17. PrecisionandAccuracy • PreciseMeasurement – Smallvariability(measuredbyvariance) – Repeatedmeasurementsyieldsamevalue – Manydigitsofprecisionisnotnecessarily accurate(resultsofcalculationsgivemanydigits) • Accurate – Notonlysmallvariabilitybutclosetotruevalue

  18. DataQualityforCollectionsofData • CollectionsofData – Muchofstatisticsisconcernedwithinferencefrom asampletoapopulation – Howtoinferthingsfromafractionaboutentire population – Twosourcesoferror: • samplesizeandbias

More Related