340 likes | 355 Views
Data Mining, Information Theory and Image Interpretation. Sargur N. Srihari Center of Excellence for Document Analysis and Recognition and Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14260 USA. Data Mining.
E N D
Data Mining, Information Theory and Image Interpretation Sargur N. Srihari Center of Excellence for Document Analysis and Recognition and Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14260 USA
Data Mining • Search for Valuable Information in Large Volumes of Data • Knowledge Discovery in Databases (KDD) • Discovery of Hidden Knowledge, Unexpected Patterns and new rules from Large Databases
Information Theory • Definitions of Information: • Communication Theory • Entropy (Shannon-Weaver) • Stochastic Uncertainty • Bits • Information Science • Data of Value in Decision Making
Image Interpretation • Use of knowledge in assigning meaning to an image • Pattern Recognition using Knowledge • Processing Atoms (Physical) as Bits (Information)
Knowledge source (K) Mail stream (S) Postal address Directory (D) Address Interpretation (AI) Address image (x) Interpretation I(x) Address Interpretation Model
Typical American AddressAddress Directory Size: 139 million records
ZIP Code: 14221 Primary number: 276 Assignment Strategies Typical street address Database query Address encoding Results Word Recognizer selects (after lexicon expansion) Delivery point: 142213557
Australian Address Delivery Point ID: 66568882 Postal Directory Size: 9.4 million records
Canadian Address Postal code: H1X 3B3 Postal Directory: 12.7 million records
United Kingdom Address Postcode: TN23 1EU (unique postcode) Delivery Point Suffix: 1A (default) Address Directory Size: 26 million records
Motivation for Information Theoretic Study • Understand information interaction in postal address fields to overcome uncertainty in fields • Compare the efficiency of assignment strategies • Rank processing priority for determining a component value • Select most effective component to help recover an ambiguous component
520 Lee Entrance STE 202 Amherst NY 14228 - 2583 Address Fields in a US Postal Address • Address fields Sargur N. Srihari f5 primary number f6 street name f7 secondary designator abbr. f8 secondary number f1 city name f2 state abbr. f3 5-digit ZIP Code f4 4-digit ZIP+4 add-on • Delivery point:142282583
No. of ZIP’s with | f6 | = 1 => 6,264 (14.97%) | f3 | = 41,840 Mean | f6 | = 95.04 Max | f6 | = 1,513 No. of (ZIP, pri) with | f6 | = 1 => 34,102,092 (69.11%) | (f3 , f5 ) | = 49,347,888 Mean | f6 | = 2.21 Max | f6 | = 542 Size of street name lexicon Size of street name lexicon (3.80, 1) (7.53, 1) log (Number of ZIP Codes) log (No. of (ZIP, primary) pairs) Probability Distributionof Street Name Lexicon Size | f6 |
Definitions • A component c is an address field fi, a portion of fi (e.g., a digit), or a combination of components. 1. Entropy H (x)= information provided by component x (assuming uniform distribution) H (x) = log2 | x | bits 2.Conditional Entropy Hx(y) = uncertainty of component y when component x is known where xi is a value of component x; yj is a value of component y pij is the joint probability of p(xi , yj) 3. Redundancy of component x to y Rx(y) = (H (x) + H (y) - H (x, y)) / H (y) 0 <= Rx(y) <= 1 Higher value of Rx(y) indicates that more information in y is shared by x.
field B field A field C B1 B2 (0,1,9) (0,1) (a,b,c,d) (e,f) Example of Information Measure Value sets: pa10 = 1/5, pae = 2/5, etc. Address records Information measure
Field f3 ZIP Code Field f1 City name Field f2 State abbr. f31 f32 f33 f34 f35 Value sets 39,795 62 42,880 D1 = 79,343 Measure of Information from National City State File, D1 (July 1997) • Measure: • H(x);x: any combination of f1, f2, and f3i • Hx(f3);x: any combination of f1, f2, and f3i
f4 (ZIP+4 add-on) f5 (Primary No.) f6 (Street name) Value sets 9,999 1,155,740 1,220,880 f7 (Secondary Abbr.) f8 (Sec. No.) f9 (Building/firm) Value sets 24 123,829 946,199 D2 = 139,080,291 Measure of Information from Delivery Point Files, D2 (July 1997) • Measure: • H(x);x: any combination of f3, f4 , f5 , f6 , f7 , f8, and f9 • Hx(f4);x: f3 with any combination of f3 ~ f9
Measure of Information from D Uncertainty in ZIP Code when City, State or a digit is known Uncertainty in component • To determine f3 (5-digit ZIP) from f1, f2 and f3i: • - City name reduces uncertainty the most
knowing 1 component knowing 2 components knowing 3 components knowing 4 components knowing 5 components 12.08 1st 12.07 1.22 0.37 0.03 0.002 2nd 1st 1st 12.09 1.20 0.36 0.03 0.001 3rd 1st 2nd H(f3) 2nd 1st 4th state 12.12 15.39 1.17 0.33 0.01 0.000 2nd 5th 2nd 3rd Hf1f35f34f33(f3) Hf1f35f34f33f2(f3) state 3rd 12.07 0.89 0.10 0.02 state 3rd 4th Hf1f35f34 (f3) 4th city state 9.98 0.63 0.33 5th Hf1f35 (f3) state 2.01 1.02 Hf1(f3) Ranking Processing Priority for Confirming ZIP Code f1: City name f2: State abbreviation f3: ZIP Code Processing flow: city, 5th, 4th, 3rd, state
(1 + log | yx |) * s(y) • Costx(y) =Hx(y) * l(y) * r(y) * e(y) * p(y) Modeling Processing Cost • For component y Location rate = l(y) 0 <= l(y) <= 1 Recognition rate = r(y) 0 <= r(y) <= 1 Processing speed = s(y) in msec Existence rate = e(y) 0 <= e(y) <= 1 Patron rate = p(y) 0 <= p(y) <= 1 Lexicon size of y, given x = | yx | = 2(H (x,y) -H (x)) • Costof processing component y given component x
process 1st component process 2nd component process 3rd component process 4th component process 5th component process 6th component 318.57 232.01 26.57 8.56 0.55 1st 1st 231.69 318.76 25.71 7.62 0 0.896 3rd 2nd 4th 1st 232.09 319.63 3rd 15.82 0.73 0 318.31 0.02 5th 1st state 4th 1st 1st 230.87 318.31 5th 9.46 14.08 3rd 3rd 3rd state state state 373.39 692.16 1027.6 44.88 4th 4th city state city 5th 188.21 state Ranking Processing Priority for Confirming ZIP CodeBased on Cost Processing flow based on cost: 2nd, city, 5th, 4th, 3rd, 1st Processing flow based on Hx(y): city, 5th, 4th, 3rd, state
NY ?4228 f2 f31 f32 f33 f34 f35 Recovery of 1st ZIP-Code Digit, f31, from State Abbr. (f2) and Other ZIP-Code Digits (f32-f35) • Usage: If recognition of a component (e.g., f31) fails, this component has higher probability of recovery by knowing another component with largest redundancy (f2). • There are 62 state abbr’s. In 60 of them, 1st ZIP digit is unique. For NY and TX, there are two valid 1st ZIP-Code digits.
Measure of Information from Mail Stream, S • Eighteen sets, each from a mail processing site, of mail pieces • We measure • Information provided by H(f2), H(f3i) • Uncertainty of f3 by Hf2(f3), Hf3i(f3) • Each set is measured separately • The results are shown on the average of these sets
Comparison of Results from D and S • ZIP-Code uncertainty from S < from D • Information from S is more effective for determining a ZIP Code • The most effective processing flow of using f3i and f2 to determine f3 is (consistent between S and D) f2 -> f35 -> f34 -> f33 -> f32 -> f31
Locality Outward postcode Post town/ county UK Address InterpretationField Recognition & Database Query • Fields of interest: • Locality • Post town • County • Outward postcode • Target: Outward postcode • Control flow: Based on data mining
Last line parsing (shape, syntax) Line segmentation Address block image Word separation Pre-scan digit recognition Chaincode generation Last line resolution Assigned outward postcode Field assignment Y Field recognition & Database query Outward postcode assigned N Candidate outward postcodes Other choices Y N UK Address InterpretationLast Line Parsing & Resolution
Discussion(Reliability of information) • For selecting effective processing flow in address interpretation, the prediction is accurate when the information can be the most representative in the current processing situation • Use of unreliable information for determining a candidate value may cause error. • Unreliable information used to choose an effective processing flow is less effective.
Reliability of information • Measure of information from D • Not reflecting the current processing situation • Full coverage of all valid values • Measure of information from S • Assuming that site specific preceding history represents current processing situation • Mail distribution could be season-specific • Should consider the coverage of valid samples • Should consider the information bias if valid samples are from AI engine
Complexity of collecting mail information (S) • Information from mail streams should be collected automatically and only high confidence information is collected • Address interpretation is not ideal • Some error cases would be collected • Address interpretation may always reject a certain patterns of mail pieces, resulting in biased collected information
Conclusion • Information content of postal addresses can be measured • The efficiency of assignment strategies can be compared • Redundancy of two components can be measured • An uncertain component has higher probability of recovery when another component with larger redundancy is known • Information measure can suggest most effective processing flow • Information Theory is an effective tool for Data Mining