On robustness of the supervised multiclass classifier for autocoding system

On robustness of the supervised multiclass classifier for autocoding system Yukako Toko, Shinya Iijima, Mika Sato-Ilic National Statistics Center, Japan University of Tsukuba, Japan NTTS 2019, March2019

Classifier for Autocoding for the Family Income and Expenditure Survey Conventional Classifier One Feature into One Class Uncertainty of Training Data * Semantic problem * Interpretation problem * Insufficiently detailed input information Problem Development of Overlapping Classifier Uncertainty from data Probability Measure One Feature into Multiple Classes Utilized the idea of FuzzyPartition Entropy Uncertainty from latent classification structure in data Fuzzy Measure New Reliability Score Considering Uncertainties from Both Measures Utilize Difference of Measures for Uncertainties (Y. Toko, S. Iijima, M. Sato-Ilic, 2018)

Classifier for Autocoding for the Family Income and Expenditure Survey Uncertainty of Training Data * Semantic problem * Interpretation problem * Insufficiently detailed input information Overlapping Classifier based on Reliability Score Necessity of Guarantee based on the Algorithm *Investigation of Robustness of the Classifier (Y. Toko, S. Iijima, M. Sato-Ilic, 2018)

Consideration of Uncertainty of Frequency Table (Teacher) Consideration of Status of Classification Structure of Each Feature System structure Feature Frequency table Feature extraction Training dataset Training process Reliability Score calc. Feature extraction Candidates retrieval Output Input Classification process (overlapping classification) Class X Feature A Feature A Class X To address the unrealistic restriction : one feature is classified to a single class -> proposed an algorithm that allows the assignment of one feature is classified to multiple classes Class Y

Overlapping Classifier based on Reliability Score Step 1 : Calculate the probability of j-th feature (j=1,…,J) to a class k (k=1,…,K)as Degree of Reliability : Reliability Score of j -th feature to a class k : Number of text descriptions in a class 𝑘 with j-th feature in the training dataset Explanation the uncertainty of the training data Utilization the deference of measurements of uncertainty Arrange in descending order and create , such as Step 2 : Determine at most promising candidate classes for each feature based on Create Probability When there are same values in , then we select as many as possible different codes for each feature j Fuzzy Probability of feature j to class k Classification status of feature j over the classes Step 3 : Calculate the Reliability Score Transformation from to classification status of feature j When the number of target text descriptions is T, and each text description includes features, corresponding for l-th text description can be represented as Reliability score of j-th feature included in l-th text description to a code k Step 4 : Determine top L () candidate classes

Robustness for the Classifier Step 1: Code assignment : Extract features, retrieve candidate classes, and calculate the reliability score Step 2: Generate normal random numbers as Step 3: Calculate Determine the promising candidates for each feature based on calculated Step 4: Calculated the reliability scoresbased on Step 5: Determine top classes based on the reliability scores Step 6: Set different σ and repeat Step 2 to 5

Result for Robustness of the Classifier Data : Family Income and Expenditure survey Data : Family Income and Expenditure survey Only foodstuff and dining-out data were used We assigned 11 classification codes for this experiment approx. 450million forTraining Data size :approx. 520 million instances approx. 10,000 forTraining Data size :approx. 11,000 instances approx.65million for Evaluation approx.1,000 for Evaluation [ Classification accuracy for each ] [ Classification accuracy for each ] [ Difference of classification accuracy compared to the normal classification] *Note : classes that have negative value of were neglected among promising classes for each feature (On average, 5.68% of selected promising classes were neglected over all ) [ Status of for each when are negative value]

Result for Robustness of the Classifier [ Difference of classification accuracy compared to the normal classification] The difference of accuracy compared to the normal classification : The number of text descriptions that match with i-th candidate class under n-th different σ : The number of text descriptions that match with i-th candidate class under the use of

Summary Developed an overlapping classifier based on Reliability Scores Assumption: Difficulty of Autocoding of Some Text Descriptions is Caused by Uncertainty of Training Data Consideration of Both Uncertainties: Uncertainty from data and Uncertainty obtained from latent classification structure in data The classifier performs stable code assignments Robustness of Classifier based on the Reliability Scores

Reference [1] Y. Toko, S. Iijima, M. Sato-Ilic, Overlapping classification for autocoding system, Journal of Romanian Statistical Review, Vol. 4, (2018), 58-73. [2] W. Hacking and L. Willenborg, Coding; interpreting short descriptions using a classification, Statistics Methods (2012), Statistics Netherlands, The Hague, Netherlands, Available at: https://www.cbs.nl/en-gb/our-services/methods/statistical-methods/throughput/throughput/coding (accessed August 2018). [3] H. Gweon, M. Schonlau, L. Kaczmirek, M. Blohm, S. Steiner, Three methods for occupation coding based on statistical learning, Journal of Official Statistics, Vol. 33, No. 1, (2017), 101-122. [4] Y. Toko, K. Wada, M. Kawano, A supervised multiclass classifier for an autocoding system, Journal of Romanian Statistical Review, Vol. 4, (2017), 29-39. [5] H. Tsubaki, K. Wada, Y. Toko, An extension of Taguchi's T method and standardized misclassification rate for supervised classification with only binary inputs, presented in poster session in the ANQ Congress, Kathmandu, Nepal (2017). [6] T. Shimono, K. Wada, Y. Toko, A supervised multiclass classifier using machine learning algorithm for autocoding, Research Memoir of Official Statistics, Vol. 75, (2018), 41-60 (in Japanese). [7] Y. Toko, K. Wada, S. Iijima, M. Sato-Ilic, Supervised multiclass classifier for autocoding based on partition coefficient, Intelligent Decision Technologies 2018, Smart Innovation, Systems and Technologies, Springer, Switzerland, Vol. 97, (2018), 54-64. [8]J. C. Bezdek, J. Keller, R. Krisnapuram, N.R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer Academic Publishers, New York (1999). [9] T. Kudo, K. Yamamoto, Y. Matsumoto, Applying conditional random fields to Japanese morphological analysis, in the 2004 Conference on Empirical Methods in Natural Language Processing on proceedings (2004), 230-237. Thank you for your attention!!

On robustness of the supervised multiclass classifier for autocoding system

On robustness of the supervised multiclass classifier for autocoding system

Presentation Transcript

On Feature Combination for Multiclass Object Classification

Reducing Multiclass to Binary

Robustness

Selection for robustness?

Impact of Configuration Errors on DNS Robustness

On the Robustness of Soft State Protocols

multiclass

Attributes for Classifier Feedback

Supervised and semi-supervised learning for NLP

Impact of Configuration Errors on DNS Robustness

On the robustness of dictatorships: spectral methods.

Robustness

The Effectiveness of Request Redirection on CDN Robustness

Robustness

Fuzzy Learning Classifier System for Intrusion Detection

On the robustness of power law random graphs

Robustness of the Unidimensional IRT Model

Multiclass object recognition

On the Robustness of Soft-State Protocols

Building Maximum Entropy Text Classifier Using Semi-supervised Learning

-- classifier, forward neural network, supervised learning

Multiclass boosting with repartitioning