High Performance, Big Data, Deep Learning at Scale

Hello World!

I’m not sure the title really nailed it well enough, but we are going to talk about solving VERY big problems as fast as we possibly can using highly sophisticated techniques.  This blog article is really a high level overview of what you want to set up as opposed necessarily to the usual how to set it up.  There are a ton of steps to the actual how to; I thought it best to just provide an overview in this article to what you want to do instead of how to do it.

The Problem

The problem we encounter in big data, especially deep learning is collaboration and money.  If I were a super scientist with unlimited money, I would just get a DGX-1 and stick it under my desk with a petabyte of attached storage.  Unfortunately I am merely mortal and am not ridiculously wealthy (yet).  So I have a team of A.I. engineers and a finite budget that must span across a year.  So this introduces two things I must do.  First, I need a central repository of data and labels.  I also need a way to spin up my very expensive cloud based compute boxes and shut them down quickly and easily.

What about giving everybody dual GPU workstations?

Well, I could do that, which would be $5,500 per engineer.  I have 4 engineers, this means: $22,000.  Ok, everybody can have a copy of the data, and we probably buy a router and share the data via a distributed data system between all workstations, so we have some form of redundancy there.  Here are some issues.

  • My team is distributed; so sharing is across normal ISP networks (our data set is currently 120 GB without processing, I expect this to grow to 200 GB and then upwards of 500 GB once we onboard customers).  So this is obviously not feasible for distributed data.
  • 2 TitanX cranking at full power on upwards of 500 GB is relatively slow compared to other options, which are actually cheaper if done right.  Not only are they cheaper, they are more scale-able as my team grows and more sustainable.  I really don’t want to have to upgrade my team’s GPU boxes every few years.
  • I don’t have $22,000.

Revisiting the Data Centralization Problem

So lets start just a little more core why do we want data centralization.  Bob is working on classification, Sue is working on Detection and Billy is working on Clustering.  They all start with the same data because our objective is to build a Machine Learning Solution, not a simple algorithm.  Put data in, get a great answer out; not a crappy answer you need to chain with something else to achieve your ultimate goal.  So we decided we want to Cluster, Recognize and Detect.  There are many switches in there so each of those primary categories might have 20 models specifically built for that exact task in the same ML Solution.  So this means that from the initial data ingest, we will be generating a ton of extra data be it synthetic, cropped variations, extracted regions and more.  So I might start with 100 GB, but during the process my 3 engineers will likely generate another 100 GB each.  Now each engineer should be able to share their results with their other engineers quickly and easily.  Not only that, they should be able to easily replicate each other’s experiments and solutions.

The Data Centralization Solution

Before I go too far, there are 100 different ways you can solve this problem.  I just happen to have found an answer I really like.  Everything gets stored in Azure Data Lake.  There are a few key reasons here.

  • Infinite Data Scale (even in a single file)
  • Works seamlessly with Hadoop
  • Works with CNTK, TensorFlow, Caffe, MXNet, Torch and a few others
  • Optimized for Analytics workloads
  • Python API available

I’ll show some simple code that our team uses from within VS Code in an interactive environment to ingest more data for sharing after generating extra data.

#%% Imports and Path Mods
import sys
from Utilities import DataLake

#%% App
source = 'C:\\data\\suit_name\\sub_data\\'
dest = 'suite_name/sub_data/images/'

    DataLake.Upload(source, dest, threads = 32, overwrite = True)
except FileExistsError as err:

That’s it!  My team can focus more on their actual task than data ingest/egress.  Not only that, this actually was hitting my ISP throttle, so it is at least as fast as 15 mbps upload.  After this task is done, my entire team can now access the data they are interested in quickly and easily using the development language we have decided to centralize on.

Revisiting the Compute Problem

So we have solved the Data Centralization problem, but we have introduced a new problem.  You can download ALL of the data (which is getting into the hundreds of GB, possibly TB or larger).  This is really just not practical.  Luckily, Azure has NC24 machines.  The NC24 has 4 K-80 GPUs in it, 24 cores and 224 GB RAM.  At the time of this writing they cost ~ $3/hour.  If leveraged properly, what you can do in a single hour of training on just one of these is just absolute insanity.  OK, so the cloud obviously has the compute nodes we need, but how do we get data onto the compute nodes and how do we do scheduling etc?

Azure Batch-Shipyard

This is probably one of the best features out there.  Azure Data Lake + NC24 Batch Pool + ShipYard = Infinitely Scale-able deep learning clusters on demand :D.  Yeah, that is for real.  Want a freaking super compute cluster.  This is how you build it.  By the way, here is the config file to get started.  The best part is it spins up and shuts down the machines for you.  This means you only pay for the time you use; greatly reducing costs.

How it works

We need to start with an understanding first of how Azure Batch works.  Below you will find an architecture diagram.


The way it works is that you already have your data in Azure Storage (Data Lake) and we provide a task script, which is then executed on a series of Batch Nodes depending on the configuration file provided.  Azure Batch-Shipyard simply provides a set of Batch Configuration files for how to provision and execute tasks specific to various frameworks.  The data is then moved from the Azure Storage and into the Batch Pool (pool of machines provisioned for this).  Task output is then dropped in a specified storage location.

Each engineer can now schedule their jobs back to back.  Depending on resources available they may execute in tandem or sequence.  If you follow the paradigm of provision, run, de-provision; clusters will be stood up completely on demand and each engineer can have all the power in the data center for the duration of their job.

How much does it cost?

You only pay for data storage and compute.  Data Lake up to 1 TB is $35/month.  Each NC24 machine is $3/hour.  For object detection, I’ll run that on a single NC24 machine for 8 hours.  That is only $24 for that job to convergence.  If I need it faster, I might double or triple up.  Lets say I have a TON of data to run through and I set a super low learning rate.  For the same 8 hours across 3 machines.  Thats $72.  Ok, so a single dual GPU workstation for my devs is $5500 (btw, this isn’t apples to apples, as the Azure boxes are way superior to the $5500 workstations).  I get 1,833.33 compute hours in the cloud per workstation.  Because they compute faster and my team can share data, the compute hours are spent actually computing and not all the other tertiary tasks around computing and I can get results faster and cheaper.  Once my organization gets into the habit of this, it scales better as well.


So there you have it.  If you want to do Deep Learning against Big Data sets with super performance; this is currently the fastest and cheapest way to do it while keeping your team effective.  Azure Data Lake + Azure Batch + Ship Yard + NC24 machines in your batch pool + CNTK GPU w/1 bit SGD (and MPI).  We dropped $22,000 down to $5,000 or so and increased the effectiveness of the team while ensuring the organization can grow and we don’t have to change our process.  Good luck!


Leave a Reply

Your email address will not be published. Required fields are marked *