HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen

Introduction • Conditional random fields (CRFs) are probabilistic models for segmenting and labeling sequential data. They offer many advantages over other models. • Despite the advantages, CRFs suffer from a main drawback – the training process needs very high computational resources. • The training process may last for days or even weeks. • To address the inefficiency problem, we present our training algorithm – HOCT (Highly Optimized CRF Trainer).

Main Idea • We found that the training algorithms failed to make effective utilization of modern hardware. • Our main idea is to leverage features of modern hardware to accelerate CRF training. • To the best of our knowledge, this is the first study to explore computer architectures to accelerate CRF training.

Related Work • There have been some methods proposed to address the inefficiency problem of CRF training. The motivation behind their methods is to reduce the computation time by approximate the result of exact inference. Unlike these methods, we improved CRF training performance while not affecting the final results. • There are also several studies which exploited modern computer architectures to improve algorithm performance.

Our methods • We improved the performance of CRF training through the following approaches: • We improved the cache performance of CRF training by leveraging software prefetching of modern processors. • We utilized SIMD technology to improve the parallelism of CRF training. • We improved the performance of training by letting our algorithm manage the disk operations.

Prefetching • Training a CRF model needs frequent access to large matrices, size of which can vary from tens to hundreds MB. The access pattern to these matrices appears to be random. Thus, large amounts of cache misses will occur. • Modern processors provide software prefetch instructions to allow data to be moved into the cache before it is actually used. If used properly, prefetching can hide much of the cache miss latency by overlapping them with other computations. • In our algorithm, when performing computations on some data from the matrices, we prefetch into cache the data that will be accessed in the near future. Thus hiding the cache miss latency.

Prefetching

SIMD • SIMD stands for “Single Instruction Multiple Data”. CPUs that support SIMD feature can perform basic operations on several data elements in parallel. Most modern processors are equipped with this feature. • In CRF training, there are many operations on large vectors, such as addition, subtraction and dot-product. • In our algorithm, we leveraged SIMD to accelerate these computations on large vectors.

Memory-Disk Management • For large tasks, the memory cost of training is quite large. This is due to the huge number of features. • When the memory requirement exceeds the size of the physical memory, the OS will use some disk space as an extension to the physical memory. This will cause a drastic degradation of the performance. • In our algorithm, we let HOCT manage the memory-disk operation. When performing computations, we write to disk the data that will not be used in the near future, and read it into a buffer when it is needed. • This strategy has improved the efficiency by a huge amount.

Memory-Disk Management

Final Experimental Result

Thank you!

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware