240 likes | 415 Views
Using Word Based Features for Word Clustering. Department of Electronics and Communications, Faculty of Engineering Cairo University. Research Team: Farhan M. A. Nashwan Prof. Dr. Mohsen A. A. Rashwan. Presented By: Farhan M. A. Nashwan.
E N D
Using Word Based Features for Word Clustering Department of Electronics and Communications, Faculty of Engineering Cairo University Research Team: Farhan M. A. Nashwan Prof. Dr. Mohsen A. A. Rashwan Presented By: Farhan M. A. Nashwan The Thirteenth Conference on Language Engineering 11-12, December 2013
Contribution: • Reduce vocabulary • Increase speed The Thirteenth Conference on Language Engineering 11-12, December 2013
Proposed Approach: Preprocessing and Word segmentor Generated Image word Word Grouping Clustering Groups and Clusters for Holistic Recognition The Thirteenth Conference on Language Engineering 11-12, December 2013
Grouping: • Extraction subwords (PAW) • Extraction dots and diacritics • Used it to select the group The Thirteenth Conference on Language Engineering 11-12, December 2013
Grouping: Generated Image Word Preprocessing and Word segmentor Secondaries separation using contour analysis Secondaries Recognition using SVM Grouping Process Groups The Thirteenth Conference on Language Engineering 11-12, December 2013
Grouping Example: PAW=1 Grouping code (1,21,2) Down Sec.= 2 & 1 Upper Sec.=2 Down Sec.=0 PAW=3 Upper Sec.=2 Grouping Code (3,0, 2) Grouping Code (4,11, 12) Upper Sec.=1 & 2 Down Sec.=1&1 PAW=4 PAW=3 Down Sec.=2 Upper Sec.=2 &1 Grouping Code (3,2, 21) Upper Sec.=2 PAW=2 Down Sec.=0 Grouping Code (2,0, 2) The Thirteenth Conference on Language Engineering 11-12, December 2013
Grouping based on: • PAWs • Down secondaries • Upper secondaries Challenges • Sticking • Sensitive to noise Treatments • Overlapping • SVM The Thirteenth Conference on Language Engineering 11-12, December 2013
Clustering: • Complementary of grouping • LBG algorithm used • Done on groups contain large words • Euclidean distance used Feature Extraction Clustering using LBG Groups Clusters & Groups The Thirteenth Conference on Language Engineering 11-12, December 2013
Features : 1- (ICC): Image centroid and Cells 2- (DCT):Discrete Cosine Transform 3- (BDCT):Block Discrete Cosine Transform 4-(DCT-4B):Discrete Cosine Transform 4-Blocks 5- (BDCT+ICC):Hybrid BDCT with ICC. 6- (ICC+DCT): Hybrid DCT with ICC 7- (ICZ):Image Centroid and Zone 8- (DCT+ICZ): Hybrid DCT and ICZ. 9- (DTW ):Dynamic Time Warping 10- The Moment Invariant Features The Thirteenth Conference on Language Engineering 11-12, December 2013
Results : TABLE 1: CLUSTERING RATE OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES The Thirteenth Conference on Language Engineering 11-12, December 2013
TABLE 2: PROCESSING TIME FOR FEATURE EXTRACTION AND CLUSTERING OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES The Thirteenth Conference on Language Engineering 11-12, December 2013
Conclusion: • based on their holistic features: • Recognition speed increased • unnecessary entries in the vocabulary removed • Total average time of ICC or Moments (0.29 ms) is better than that of other methods. • but the clustering rates are not the best (98.69% for ICC and 82.61% for Moment). • the clustering rate of DCT (99.19%) is the better, but time is the worst (~12 ms). • With two parameters (clustering rate and time) ICC may be a good compromise. The Thirteenth Conference on Language Engineering 11-12, December 2013
Thanks for your attention.. The Thirteenth Conference on Language Engineering 11-12, December 2013
counting the number of black pixels Vertical transitions from black to white horizontal transitions from black to white Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
DCT .-Applying DCT to the whole word image -The features are extracted in a vector form by using the DCT coefficient set in a zigzag order. -Usually we get the most significant DCT coefficients(160 coef.) Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
Block Discrete Cosine Transform (BDCT) Apply the DCT transform for each cell Get the average of the differences between all the DCT coefficients Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
Discrete Cosine Transform 4-Blocks (DCT-4B) 1- Compute the center of gravity of the input image. 2- Divide the word image into 4-parts taking the center of gravity as the origin point. 3- Apply the DCT transform for each Part. 4- Concatenate the features taken from each part to form the feature set of the given word. Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
Image Centroid and Zone (ICZ) Compute the average distance among these points (in a given zone) and the centroid of the word image Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
DTW (Dynamic Time Warping) Features. DTW) is an algorithm for measuring similarity between two sequences The distance between two time series x1 . . . xM and y1 . . . yN is D(M,N), that is calculated in a dynamic programming approach using The three types of features are extracted from the binarized images and used in our DTW techniques: X-axis and Y-axis Histogram Profile Profile Features(Upper, Down, Left and Right) Forground/Background Transition Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
DTW (Dynamic Time Warping) Features. Figure 1: The Four Profiles Features: (A) Left Profile. B) Up (C) Down Profile. D) Right Profile Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
The Moment Invariant Features Hu moments:Hu defined seven values, computed from central moments through order three Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013
Moments 12 The moment invariant descriptors are calculated and fed to the feature vector. 16 Go Back The Thirteenth Conference on Language Engineering 11-12, December 2013