Hanaa M. Hussain, Khaled Benkrid School of Engineering Edinburgh University, Edinburgh

School of Engineering University of Guelph An Adaptive Implementation of a DynamicallyReconfigurable K-Nearest Neighbor ClassifierOn FPGA(2012) Hanaa M. Hussain, Khaled Benkrid School of Engineering Edinburgh University, Edinburgh Scotland, U.K. {h.hussain, k.benkrid}@ed.ac.uk HuseyinSeker Bio-Health Informatics Research Group De Montfort University, Leicester England, U.K. hseker@dmu.ac.uk DuniaJamma PhD Student Prof. ShawkiAriebi Course Instructor

Outline • Introduction • Background of KNN • KNN and FPGA • The proposed architectures • Dynamic Partial Reconfigurable (DPR) part • The achievements • Advantages and Disadvantages • Conclusion

Introduction • K-nearest neighbour (KNN) is a supervised classification technique • Applications of KNN (Data Mining, Image processing of satellite and medical images ... etc.) • KNN is known to be robust and simple to implement when dealing with data of small size • KNNperforms slowly when data are large and have high dimensions • KNN classifier is sensitive to parameter (K) Number of nearest neighbours • The selection of the label for the new query depends on voting on those K points.

1-Nearest Neighbor 3-Nearest Neighbor

KNN Distance Methods • To calculate the distance between the new queries and the K’s points the Manhattan distance was used • The Manhattan is chosen in this work for its simplicity and lower cost compared to the Euclidean Xi: The new query’s matrix Yi: The trained sample's matrix K: # of samples

KNN and FPGA • KNN classifiers can benefit from the parallelism offered by FPGAs • Distance computation is time consuming • Parallelizing the distance computation part • They propose two adaptive FPGA architectures (A1 and A2) of the KNN classifier, and compare the implementations of each one of them with an equivalent implementation running on a general purpose processor (GPP) • They propose a novel dynamic partial reconfiguration (DPR) architecture of the KNN classifier for K

Used tools • Hardware implementation: • The hardware implementation targeted the ML 403 platform board which has a Xilinx XC4VFX12 FPGA chip on it • JTAG cable • Xilinx PlanAhead 12.2 tool along with Xilinx partial reconfiguration flow (DPR) • Software implementation: • Matlab (R2009b) bioinformatics toolbox • Intel Pentium Dual-Core E5300, running at 2.60 GHz and 3 GB RAM workstation • Using of Verilog as HDL configuration language

The used data L M N Factors M: training samples N: Training Vectors L: Label Y: trained data X: New query Y = X =

The proposed architectures • The KNN classifier has been divided into three modular blocks (Distance computation, KNN finder, and Query label finder) + FIFO M-Dist PEs K-KNN PEs PE = M + K +1 A1 Architecture N-Dist PEs N- KNN PEs PE = 2N +1 A2 Architecture

The functionality of PEs Previous accumulative distance Dist 2 L2 Yi Dist1 Min L1 Max

Distance computation • The distance computations are made in parallel every clock cycle • The latency of Dist PE is M cycles • A1: the throughput is one distance result every clock cycle A2: the throughput is one distance result every M clock cycle Complete Training

Dist PE inner architecture

K-Nearest Neighbour Finder • This block becomes active after M clock cycles The function of this block is completed after an M +N clock cycle

Dynamic Partial Reconfigurable part (DPR) • The value of K parameter was dynamically reconfigured, when N, M, B, and C are fixed for a given classification problem. • Two cores (A1) • Distance computation core - Static • KNN core (KNN PE, Label PE) - Dynamic • The size of the RP is made large enough to accommodate the logic resources required by largest K • Advantages: saving in reconfiguration time, Power • Difficulties: • Limitations (resources), the cost, the verification of the interfaces between the static region and RP for all RMs

The achievement • This DPR implementation offers 5x speed-up in the reconfiguration time of a KNN classifier on FPGA

Advantages • Variation which allows the user to select the most appropriate architecture for the targeted application (available resources, performance, cost) • Enhancement in Performance • Parallelism-speed up • DPR-reconfigurable time • Efficiency in term of KNN performance - the DPR for K • Using the Manhattan’s theorem (simplicity and lower cost)

Disadvantages • The amount of used resources • The not worthy achieved speed (5X) for DPR part comparing to the amount of used resources and effort • Constraints in A2 architecture and the DPR (area) • The latency due to pipelining manner of producing the results

Conclusion • Efficient design for different KNN classifier applications • Two architectures A1 and A2 and the user can choose one of them • A1 can be used to target applications whereby N>>M, whereas A2 is used to target applications whereby N<<M • DPR part (could be reproduced with ICAP) • Achievements comparing to GPP • Speedup by 76X for A1 and 68X by A2 • Speedup by 5X in DPR

Any question?

Extra Slides

Memory • Each FIFO is associated with one distance PE • The query vectors gets streamed to the PEs to be stored in registers- they will be required every clock cycle Where: B is the sample wordlength M is the number of samples N is the number of training vectors

Class Label Finder • The block consists mainly of C counters each associated with one of the class labels • The hardware resources depends on user defined parameters K and C • The architecture of this block is identical in both A1 and A2

A2 Architecture • N FIFOs are used to store the training set with each of them having a depth of M • The class labels get streamed and stored in registers within the distance PEs • A2 requires more CLB slices than A1, when N, M, and K are the same • the first distance result becomes ready after all samples are processed i.e., after M clock cycles

DPR for K Maximum BW for JTAG is 66Mbps Maximum BW for ICAP is 3.2Gbps ICAP > 48x JTAG

Dynamic Partial Reconfigurable part (DPR) The JTAG was used (BW = 66Mbps) Using of ICAP instead would decrease the configuration time (BW = 3.2Gbps) 26

Hanaa M. Hussain, Khaled Benkrid School of Engineering Edinburgh University, Edinburgh