250 likes | 458 Views
Deep Learning and HPC. Adam Coates Visiting Scholar at IU Informatics Post-doc at Stanford CS. What do we want computers to do with our data?. Label: “Motorcycle” Suggest tags Image search …. Images/video Audio Text. Speech recognition Music classification Speaker identification ….
E N D
Deep Learning and HPC Adam Coates Visiting Scholar at IU Informatics Post-doc at Stanford CS
What do we want computers to do with our data? Label: “Motorcycle” Suggest tags Image search … Images/video Audio Text Speech recognition Music classification Speaker identification … Web search Anti-spam Machine translation …
Computer vision is hard! Motorcycle Motorcycle Motorcycle Motorcycle Motorcycle Motorcycle Motorcycle Motorcycle Motorcycle
What do we want computers to do with our data? Label: “Motorcycle” Suggest tags Image search … Images/video Audio Text Machine learning performs well on many of these problems, but is a lot of work. What is it about machine learning that makes it so hard to use? Speech recognition Music classification Speaker identification … Web search Anti-spam Machine translation …
Machine learning for image classification “Motorcycle”
But the camera sees this: Why is this hard? You see this:
Machine learning and feature representations pixel 1 Learning algorithm pixel 2 Input Motorbikes “Non”-Motorbikes Raw image pixel 2 pixel 1
Machine learning and feature representations pixel 1 Learning algorithm pixel 2 Input Motorbikes “Non”-Motorbikes Raw image pixel 2 pixel 1
Machine learning and feature representations pixel 1 Learning algorithm pixel 2 Input Motorbikes “Non”-Motorbikes Raw image pixel 2 pixel 1
What we want handlebars Feature representation Learning algorithm wheel Input E.g., Does it have Handlebars? Wheels? Motorbikes “Non”-Motorbikes Raw image Features pixel 2 Wheels pixel 1 Handlebars
How is computer perception done? Images/video Detection Vision features Image Audio Coming up with features is difficult, time-consuming, requires expert knowledge. When working on applications of learning, we spend a lot of time tuning the features. Audio Audio features Speaker ID Text classification, Machine translation, Information retrieval, .... Text Text features Text
Deep Learning • Find algorithms that can learn representations/features from data. • Deep neural networks. • “Unsupervised feature learning” • Learn representations without knowing task.
Deep Learning • Build multi-stage pipelines from simple pieces. • Classic system: deep neural net. • Generally: compositions of differentiable functions. Optimize weights inside network to give correct answers on training data. “Motorcycle”
Basic algorithmic components • In a loop over entire training set: • Evaluate deep network. • Usually process a batch of training examples (e.g., 100) at once • Compute gradient of loss function w.r.t parameters. • Sum up gradients over batch of examples. • Update trainable parameters using gradient.
Scaling Up Deep Learning at Stanford • Most DL networks built on a few primitives. • Mostly large dense matrix/vector operations. • A few “block” matrices for widely-used cases. • Communication hidden in distributed arrays. • Most operations are hardware-friendly. • Not far from sgemm throughput. • Relatively low communication / IO needs. • But hard to avoid doing many iterations. • Have to focus on making each loop very fast.
Scaling Up Deep Learning at Stanford • In-house MPI+CUDA infrastructure. • Up to 11.2B parameter networks. • Typical experiment: ~14M images (Image-Net). [Coates et al., ICML 2013]
Scaling Up Deep Learning at Stanford • Duplicated “Google Brain” with 3 machines. • Compared to 1000+ machines. • Unsupervised learning from 10M YouTube frames. • Largest artificial neural nets ever trained. • 6.5x larger than previous system. … but what should we do with it!? Surprisingly hard to find a problem big enough that such models matter! [Coates et al., ICML 2013]
Applications • Building universal representations • “One neural net to rule them all.” Object Recognition Localization Tagging Depth Estimation … … … Shared representation for many tasks. … [E.g., Collobert et al., 2011]
Applications • Autonomous Driving 1 year * 1 Hz = ~30M frames [Actually have to drive for 1 year!] Can we train from a few hundred 1080pframes per second?
Applications: why these? • High impact. • Universal representations: many applications with diffused value. • Driving: single application with high value. • Train once, deploy everywhere. • Training is hard, expensive. • Deploying is easy, cheap. • A supercomputer can generate an artifact that gets re-used by others.
Things that work • Find common cases; tightly optimize • Surprisingly few core pieces. E.g., 10. • Distributed arrays • Massive time-saver; easy to think about. • Easy to save and restore from Lustre. • Load shards and sanity-check them in Matlab. • High-level language bindings • Low-level code in C++/CUDA (JIT)
Challenges • Experiment turn-around time is still long. • Maybe 3-5 experiments running at once. • Weeks for big models / big datasets. • Productivity is still much lower than, e.g., Matlab. • Lack of strong tools at every level except lowest. • Many DL hackers are not systems hackers. • Lots of hard-won lessons that are trapped in our group.
Laundry list from Stanford infrastructure • Job control and scripting is painful • Zombies • PBS/Torque mostly works • JIT compilation • JIT compile C/C++ code • Flexible enough to do many things. • Easier to use CUDA runtime, templatizing, etc. • Avoids Driver API, which is much less convenient. • Easier to link with high-level languages. • Needs to be thread-savvy • Caching of compiled modules • Avoiding deadlocks or locking problems in cache(s) • Ideally invisible to users • But first use of kernels is really slow. • Debugging • Unclear what to do here. Support for common tools? NVTX, VampirTrace…? • Distributed arrays • Stanford implementation is rough. Should have pursued more standard approach. • MATLAB’s Co-distributed arrays; ScaLapack-style arrays. • Multi-dimensional array with a “distributor” that maps indices to ranks. • Support to re-distribute array. • Support to save/load arrays even when process grid changes. • Distribution-aware implementations of most functionality. • Execution structure • Imperative programming is just easier (esp. with students + scientists). • DAGs, etc. are static and difficult to alter. Works OK for us; but many headaches. • CUDA streams+events semantics is really nice. • Solves the same problem: hide massive parallelism from the caller. • But allows arbitrary scheduling on the fly. Easy to understand behavior as viewed by the host. • If you want custom functionality, you just have to write the parallel code. • In CUDA, you have to write the kernel. • For ScaLapack, you had to write code on top of BLACS. • Single-rank case should look like 100-rank case. • Students can prototype single-rank. Easier to think about. • IO tools • We spend a lot of time writing file loaders. • Application-specific, but lots of boiler-plate. • Many common cases in ML. E.g., a list of samples, where each sample = video, image, string, vector. • Currently difficult to handle distributed saving/loading of large arrays of data.