240 likes | 434 Views
Matching. Lecture 11. Topics. ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement. ID Parade Frames. Classifying volunteers as clean Matching suspect to volunteers Reservation of parade facility, officers, volunteers
E N D
Matching Lecture 11
Topics • ID parade Frames • Matching Examples • Fuzzy Matching • Metric Spaces • Scales of measurement
ID Parade Frames • Classifying volunteers as clean • Matching suspect to volunteers • Reservation of parade facility, officers, volunteers • Managing long-running process from decision to hold parade to payment of volunteers • Accounting – payment to volunteers and billing of police authorites • Historical record and analysis
Merging multiple frames • Each frame produces its own model of the actors. • E.g. Models of volunteer • For matching with suspect • For classification • For payment • For reservation • For database, problem is called ‘view integration’
Miscellaneous Matching applications • Many systems have a matching task at their core: • Shazam – sound sample matching • De-duping mailing lists • CD DB - CD recognition • COTS selection • IS development selection • fingerprint matching • patient/donor matching for transplant surgery • blood typing and matching • patients to clinical trials • interns to placements in hospitals • DNA samples • search request to locate relevant documents • incoming news items to information subscribers • number plate recognition in London’s Congestion Congestion Charging System • speech and writing recognition • patterns to material to minimise wastage
Shazam - 2580 • Shazam is a mobile phone application • It can recognise 1.7 million tracks from a 30 sec sample – new tracks added at 5,000 a week • The track details are texted back within about 30secs • It costs 50p + 9p call charge (surcharge only if successful) • Your personal page shows the tracks you have tagged • www.shazam.com
De-duping A catalogue from O’Reilly C Wallace West England University Coldharbour Lane Frenchay Bristol BS16 1QY Ms C Wallace Univ. of the West of England Frenchay Campus Coldharbour Lane Bristol BS16 1QY One person or two? Mailing lists are reported with 25 – 40% duplicates.
CD DB • Database of 2.5 million CD’s, track details and supporting matter run by gracenote (www.gracenote.com) • Used by media players to obtain track info • Player sends signature of CD [sequence of track lengths in 1/4sec] to match against the database (via HTTP) • Application searches DB for best match and returns track info to media player. • Matching algorithm described in US Patent 6,061,680
Commercial Of the Shelf Software (COTS) • Software exists for most business needs: • payroll • order processing • general ledger • human resources • e-commerce • e.g. SAP, SAGE .. • but analysts need to match business needs to COTS capability, and customise generic software for local business rules.
Chatbots • Chatbots like ALICE simulate a human response to typed input • Most are for fun or annoyance • Increasingly being used for customer service, helpdesks, marketing • Based on matching patterns in text • The patterns are in an XML application called AIML
Police ID parade • Currently: • Suspect matched to Volunteers visually by officer • Information System • Suspect and Volunteers modelled in database • System provides list of matching volunteers
Matching in general • Matching task typically involve: • two sets of individuals : e.g. • the suspect / sampled track / DNA sample - The Requirement • the volunteers / 1.7 million stored tracks / DNA on file – The Resource • ‘adequate’ representations of both • a ‘fitness’ function which calculates how well matched a Resource is to the Requirement • a process to achieve the matching goal • Matching processes: • Single or Batch? • Single: One Req to many Resources • Batch: Many Reqs to many Resources (e.g. cutting) • Automatic, Interactive, Assistive • Automatic: Matching fully automated • Interactive: User makes final selection, adjusts weights • Assistive : Computer produces analyses which aid human selection
Single Allocation • Allocation to a single Requirement: • ‘long list’ the Resources - eliminate the obviously unsuitable • compute fitness between Requirement and each remaining Resource • rank the Resources in fitness order for a ‘short list’ • ? user selection from short list on basis of additional information unknown to system • Interactive • User adjusts: • description of Requirement (e.g the search term in Google) • fitness function (e.g. the weights in the ID parade) • and retries
Simple Matching • Resource and Requirement are of the same kind • Fitness = least distance between objects • String Matching • Levenshtein distance • Soundex and Metaphone • Age difference
String Matching • How close are two strings – words, DNA sequences? • Levenshtein distance • is the number of single character edits required to change one to the other using the operations of: • inserting a letter • deleting a letter • replacing a letter • E.g. • Distance(receipt,tecept) = 2 • Distance(receipt,reciept) = 2 • Need a theory of why the strings are different • Better theory for typing would be to count transposition as 1 edit instead of 2 • Better theory for texting would be to count a replace by a letter on the same key less than a letter on a different key. • mutations in DNA matching
Soundex and Metaphone • Surnames in English have multiple spellings for similar sounds • Wallace and Wallis, Smith and Smythe • Errors caused by similar phonetics having different spelling • Useful where sound-text transliteration occurs in data capture • e.g. Smith and Smythe • Soundex (Odell and Russell 1922) reduces every word to a letter and 3 digits – S530 for both • Metaphone (Philips 1990) smarter about English phonetics – SM0 for both • Double Metaphone – improved and two codes – one english, one ‘foreign’ • Comparison of algorithms
Matching is subjective • How close are two ages? • Is the answer different for the identity parade and a dating agency? distance 0.0 age Suspect Volunteer Ideal Person Date
Multi-valued Matching • How to combine multiple values to create a single distance? • Age and Height are different to Build, Eye-colour, Gender and Ethnic origin. • Distance in 2-D space: Sqrt(dx^2 + dy^2) dy y dx x
Metric space • Formally, a metric space M is a set of points with an associated distance function (also called a metric) d : M × M -> R (where R is the set of real numbers). • For all x, y, z in M, this function is required to satisfy the following conditions: • d(x, y) ≥ 0 • d(x, x) = 0 • if d(x, y) = 0 then x = y (identity of indiscernibles) • d(x, y) = d(y, x) (symmetry) • d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).
Multi-attribute matching • Extract shows a simple Excel spreadsheet containing a suspect age, weight and gender, and the same attributes for 10 volunteers • Representation • Age is measured in years • Height in cm • Gender is M or F • Fitness function • Calculate difference between suspect and volunteer attributes • Normalise differences to 0…1 • Multiple by weights to express importance of each attribute • Sum of squared differences as Fitness function • Best fit volunteer has minimum value for Fitness
Scales of Measurement • Nominal – names or categories • E.g. Eye-colour, Ethnic origin, Telephone number, ISBN • Valid operations: =, not = • Partly Ordered Scales e.g. grandparent, parent, uncle, child, cousin • Pairs are ordered but no overall ordering • Ordinal – ranks • E.g. 1,2,3 in Derby, 1st ,2.1, 2.2, 3rd class, slight, medium heavy build • Valid operations: <, = , > • Invalid operations : + , - ( gap between 1 and 2, is not the same as between 2 and 3) • Non-parametric statistics may apply • Interval - arbitrary zero value • E.g. Temperature in degrees F, date in Julian Calendar • Valid Op : - (minus) • Invalid: + , * (but differences are Ratio) • Ratio • E.g. Length, age • Valid Ops: + , * , /, standard statistical operations • Multi-dimensional scales (index numbers) • E.g. Miles/gallon, IQ • Compound of several scales of measurement
Suspect/Volunteer attributes • Nominal – names or codes • Ordinal – ranks • Interval - no zero value • Ratio
Transforming and Scaling • To combine different attributes, we need to transform Nominal, Ordinal and Interval values to Ratio scales • This cannot be done objectively, so judgement involved • Scaling and weights need to be adjustable to fine-tune matching • => Learning Frame (later)
Sensitivity Analysis • Arbitrary weights can be adjusted to see what effect their variation has on the final selection • ? How much would each weight have to change before the first choice is demoted? • Excel analysis