Optimizing ML Classification Algorithms

125k_coverage_bucket

Hello World!

Today we are going to do a little exercise around optimizing an algorithm.  I was working with a customer who was using open data (and we know how that can be) to perform an initial set of predictions to show some value while adding in some collection capabilities so they can roll one with more reliable data later.

The data can be collected from here: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2015&prodType=document

The sample notebook can be located here: https://github.com/drcrook1/MLPythonNotebooks

Continue reading

Understanding Tensor Flow Input Pipelines Part 1

TFLogo

Hello World!

Alright; so this whole input pipeline thing in pretty much every framework is the most undocumented thing in the universe.  So this article is about demystifying it.  We can break down the process into a few key steps:

  1. Acquire & Label Data
  2. Process Label Files for Record Conversions
  3. Process Label Files for Training a Specific Network Interface
  4. Train the Specific Network Interface

This is part 1.  We will focus on the 3rd item in this list; processing the files into TF Records.  Note you can find more associated code in the TensorFlow section of this git repository: https://github.com/drcrook1/CIFAR10

Continue reading

Prepping Label Files for ML Training on Specific Machine

AzureDC

Hello World!

So you likely will run into this at some point.  You are reading data from somewhere and it is relative path based; but that doesn’t necessarily always help load data in especially if you are storing data and your code in separate paths (which is common) or if you are sharing data with a team; or even if your data is just somewhere totally different.

Anyways; this article will help convert a .csv label file with actual named labels to a label file with full path with a numerical label that can be more easily one hot encoded during the reading process.  Note for deep learning often this is a two step process.  Step 1: Convert from relative pathing to specific pathing & numerical labels.  Step 2: Convert to framework specific storage format for input reading pipeline (which varies framework to framework).  Here we cover Step 1.  We will be using the CIFAR 10 data set which can be downloaded from here: https://www.kaggle.com/c/cifar-10/data

Continue reading

Writing Files to Persisted Storage from PySpark

filesave

Hello World!

So here is the big ticket item; How in the world do I write files to persisted storage from PySpark?  There are tons of docs on RDD.toTextFile() or things of that nature; but that only matters if you are dealing with RDD’s or .csv files.  What if you have a different set of needs.  In this case; I wanted to visualize a decision decision forest I had built; but there are no good bindings that I could find between PySpark’s MLLIB and Matplot lib (or similiar) to visualize the decision forest.

Continue reading

High Performance, Big Data, Deep Learning at Scale

containership

Hello World!

I’m not sure the title really nailed it well enough, but we are going to talk about solving VERY big problems as fast as we possibly can using highly sophisticated techniques.  This blog article is really a high level overview of what you want to set up as opposed necessarily to the usual how to set it up.  There are a ton of steps to the actual how to; I thought it best to just provide an overview in this article to what you want to do instead of how to do it.

Continue reading

Dealing with Pesky Image Names in Cocos

CocoImage

Hello World!

Coco_2

If you are not familiar with Microsoft CoCos, you should be.  Its a treasure trove of data for your learning pleasure!  There just happens to be one pesky problem with it, and that is the fact that when attempting to find the files for training/testing; the Annotation file that ships with MS CoCo does not include the actual file name, but rather the image id.  This sounds fine, except the data when you download it has a bunch of trailing stuff!  In this article we will go through how to get it ready.

Continue reading

Microsoft Cognitive Toolkit + VS Code = Awesome

CNTKIntellisense

Hello World!

In this article I’m going to go through how to set up CNTK with Visual Studio Code and take advantage of those PASCAL GPUs I know everybody has these days.  I will also do a breif overview of what CNTK and Visual Studio Code are and why they are so incredible for machine learning scientists.

CNTKIntellisense

Continue reading

Operationalize Deep Learning with Azure ML

TeslaK80

Hello World!

So today we are going to do something really awesome.  Operationalize Keras with Azure Machine Learning.  Why in the world would we want to do this?  Well we can configure Deep Neural Nets and train them on GPU.  In fact, in this article, we will train a 2 depth neural network which outputs a linear prediction of energy efficiency of buildings and then operationalize that GPU trained network on Azure Machine Learning for production API usage.

Continue reading

Building a Self-Racing Race Car – The Journey Begins

Matthew_v1

Hello World!

So today I have decided to actually begin a write up on where I’m headed.  Its taken a long while to come up with what my next adventure after food trucks was going to be, especially with all of the things you can do with embedded technologies and machine learning.  So here it is, a self driving race car.

In this article I aim to lay out the high level plan of attack for what my team and I are building and some of the direction on where we are headed.

Continue reading