100 likes | 112 Views
This study examines the robustness of a supervised multiclass classifier for autocoding in the Family Income and Expenditure Survey, considering the uncertainties of training data and the classification structure of each feature. An overlapping classifier based on reliability scores is developed to address the semantic and interpretation problems caused by insufficiently detailed input information. The proposed algorithm allows for the assignment of one feature to multiple classes, resulting in stable code assignments. The results demonstrate the robustness of the classifier in handling uncertain data.
E N D
On robustness of the supervised multiclass classifier for autocoding system Yukako Toko, Shinya Iijima, Mika Sato-Ilic National Statistics Center, Japan University of Tsukuba, Japan NTTS 2019, March2019
Classifier for Autocoding for the Family Income and Expenditure Survey Conventional Classifier One Feature into One Class Uncertainty of Training Data * Semantic problem * Interpretation problem * Insufficiently detailed input information Problem Development of Overlapping Classifier Uncertainty from data Probability Measure One Feature into Multiple Classes Utilized the idea of FuzzyPartition Entropy Uncertainty from latent classification structure in data Fuzzy Measure New Reliability Score Considering Uncertainties from Both Measures Utilize Difference of Measures for Uncertainties (Y. Toko, S. Iijima, M. Sato-Ilic, 2018)
Classifier for Autocoding for the Family Income and Expenditure Survey Uncertainty of Training Data * Semantic problem * Interpretation problem * Insufficiently detailed input information Overlapping Classifier based on Reliability Score Necessity of Guarantee based on the Algorithm *Investigation of Robustness of the Classifier (Y. Toko, S. Iijima, M. Sato-Ilic, 2018)
Consideration of Uncertainty of Frequency Table (Teacher) Consideration of Status of Classification Structure of Each Feature System structure Feature Frequency table Feature extraction Training dataset Training process Reliability Score calc. Feature extraction Candidates retrieval Output Input Classification process (overlapping classification) Class X Feature A Feature A Class X To address the unrealistic restriction : one feature is classified to a single class -> proposed an algorithm that allows the assignment of one feature is classified to multiple classes Class Y
Overlapping Classifier based on Reliability Score Step 1 : Calculate the probability of j-th feature (j=1,…,J) to a class k (k=1,…,K)as Degree of Reliability : Reliability Score of j -th feature to a class k : Number of text descriptions in a class 𝑘 with j-th feature in the training dataset Explanation the uncertainty of the training data Utilization the deference of measurements of uncertainty Arrange in descending order and create , such as Step 2 : Determine at most promising candidate classes for each feature based on Create Probability When there are same values in , then we select as many as possible different codes for each feature j Fuzzy Probability of feature j to class k Classification status of feature j over the classes Step 3 : Calculate the Reliability Score Transformation from to classification status of feature j When the number of target text descriptions is T, and each text description includes features, corresponding for l-th text description can be represented as Reliability score of j-th feature included in l-th text description to a code k Step 4 : Determine top L () candidate classes
Robustness for the Classifier Step 1: Code assignment : Extract features, retrieve candidate classes, and calculate the reliability score Step 2: Generate normal random numbers as Step 3: Calculate Determine the promising candidates for each feature based on calculated Step 4: Calculated the reliability scoresbased on Step 5: Determine top classes based on the reliability scores Step 6: Set different σ and repeat Step 2 to 5
Result for Robustness of the Classifier Data : Family Income and Expenditure survey Data : Family Income and Expenditure survey Only foodstuff and dining-out data were used We assigned 11 classification codes for this experiment approx. 450million forTraining Data size :approx. 520 million instances approx. 10,000 forTraining Data size :approx. 11,000 instances approx.65million for Evaluation approx.1,000 for Evaluation [ Classification accuracy for each ] [ Classification accuracy for each ] [ Difference of classification accuracy compared to the normal classification] *Note : classes that have negative value of were neglected among promising classes for each feature (On average, 5.68% of selected promising classes were neglected over all ) [ Status of for each when are negative value]
Result for Robustness of the Classifier [ Difference of classification accuracy compared to the normal classification] The difference of accuracy compared to the normal classification : The number of text descriptions that match with i-th candidate class under n-th different σ : The number of text descriptions that match with i-th candidate class under the use of
Summary Developed an overlapping classifier based on Reliability Scores Assumption: Difficulty of Autocoding of Some Text Descriptions is Caused by Uncertainty of Training Data Consideration of Both Uncertainties: Uncertainty from data and Uncertainty obtained from latent classification structure in data The classifier performs stable code assignments Robustness of Classifier based on the Reliability Scores
Reference [1] Y. Toko, S. Iijima, M. Sato-Ilic, Overlapping classification for autocoding system, Journal of Romanian Statistical Review, Vol. 4, (2018), 58-73. [2] W. Hacking and L. Willenborg, Coding; interpreting short descriptions using a classification, Statistics Methods (2012), Statistics Netherlands, The Hague, Netherlands, Available at: https://www.cbs.nl/en-gb/our-services/methods/statistical-methods/throughput/throughput/coding (accessed August 2018). [3] H. Gweon, M. Schonlau, L. Kaczmirek, M. Blohm, S. Steiner, Three methods for occupation coding based on statistical learning, Journal of Official Statistics, Vol. 33, No. 1, (2017), 101-122. [4] Y. Toko, K. Wada, M. Kawano, A supervised multiclass classifier for an autocoding system, Journal of Romanian Statistical Review, Vol. 4, (2017), 29-39. [5] H. Tsubaki, K. Wada, Y. Toko, An extension of Taguchi's T method and standardized misclassification rate for supervised classification with only binary inputs, presented in poster session in the ANQ Congress, Kathmandu, Nepal (2017). [6] T. Shimono, K. Wada, Y. Toko, A supervised multiclass classifier using machine learning algorithm for autocoding, Research Memoir of Official Statistics, Vol. 75, (2018), 41-60 (in Japanese). [7] Y. Toko, K. Wada, S. Iijima, M. Sato-Ilic, Supervised multiclass classifier for autocoding based on partition coefficient, Intelligent Decision Technologies 2018, Smart Innovation, Systems and Technologies, Springer, Switzerland, Vol. 97, (2018), 54-64. [8]J. C. Bezdek, J. Keller, R. Krisnapuram, N.R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer Academic Publishers, New York (1999). [9] T. Kudo, K. Yamamoto, Y. Matsumoto, Applying conditional random fields to Japanese morphological analysis, in the 2004 Conference on Empirical Methods in Natural Language Processing on proceedings (2004), 230-237. Thank you for your attention!!