Complex Neural Network Data Modelling with CNTK

dl_robo_arm

Hello World,

This article is kinda exciting for me; because once you can internalize how this works, the world really becomes your oyster as far as what you can model with what kind of data.  In this example we are going to take some sample images and some random vector features and merge them together.  In a more realistic example you may take something like an image as well as some contextual tabular data and want to merge those two data sets together into a single prediction.

Continue reading

Running Jupyter in Kubernetes with an SLA

jupyter_logo

Hello World!

So Jupyter is a great tool for experimental science.  Running a jupyter notebook though can be tricky; especially if you want to maintain all of the data that is stored in it.  I have seen many strategies; but I have come up with one that I like best of all.  It is based on my “Micro Services for Data Science” strategy.  By using decoupled data and compute we can literally thrash our Jupyter notebook and all of our data and notebooks still live.  So why not put it in a self healing orchestrater and deploy via Kubernetes :D.

Continue reading

Optimizing ML Classification Algorithms

125k_coverage_bucket

Hello World!

Today we are going to do a little exercise around optimizing an algorithm.  I was working with a customer who was using open data (and we know how that can be) to perform an initial set of predictions to show some value while adding in some collection capabilities so they can roll one with more reliable data later.

The data can be collected from here: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2015&prodType=document

The sample notebook can be located here: https://github.com/drcrook1/MLPythonNotebooks

Continue reading

Prepping Label Files for ML Training on Specific Machine

AzureDC

Hello World!

So you likely will run into this at some point.  You are reading data from somewhere and it is relative path based; but that doesn’t necessarily always help load data in especially if you are storing data and your code in separate paths (which is common) or if you are sharing data with a team; or even if your data is just somewhere totally different.

Anyways; this article will help convert a .csv label file with actual named labels to a label file with full path with a numerical label that can be more easily one hot encoded during the reading process.  Note for deep learning often this is a two step process.  Step 1: Convert from relative pathing to specific pathing & numerical labels.  Step 2: Convert to framework specific storage format for input reading pipeline (which varies framework to framework).  Here we cover Step 1.  We will be using the CIFAR 10 data set which can be downloaded from here: https://www.kaggle.com/c/cifar-10/data

Continue reading

Writing Files to Persisted Storage from PySpark

filesave

Hello World!

So here is the big ticket item; How in the world do I write files to persisted storage from PySpark?  There are tons of docs on RDD.toTextFile() or things of that nature; but that only matters if you are dealing with RDD’s or .csv files.  What if you have a different set of needs.  In this case; I wanted to visualize a decision decision forest I had built; but there are no good bindings that I could find between PySpark’s MLLIB and Matplot lib (or similiar) to visualize the decision forest.

Continue reading

Saving those Magic Ubuntu Environment Variables

Hello World!

This one is more for me than for you.  I often find a piece of software that needs just some magic environment variable set with some magic path that never seems to get properly configured during installation.  Below is an example of how to get that path set, and then ensure it is always set when you log on to the server from then on out.

# These instructions are for bash
$ echo $SHELL
/bin/bash

# Check the current value of your envvar
$ echo $CAFFE_ROOT

# Add the envvar to ~/.profile so it will load automatically when you login
$ echo "export CAFFE_ROOT=/home/username/caffe/" >> ~/.profile

# Load the new configuration
$ source ~/.profile

# Check the new envvar value
$ echo $CAFFE_ROOT
/home/username/caffe/

Dealing with Pesky Image Names in Cocos

CocoImage

Hello World!

Coco_2

If you are not familiar with Microsoft CoCos, you should be.  Its a treasure trove of data for your learning pleasure!  There just happens to be one pesky problem with it, and that is the fact that when attempting to find the files for training/testing; the Annotation file that ships with MS CoCo does not include the actual file name, but rather the image id.  This sounds fine, except the data when you download it has a bunch of trailing stuff!  In this article we will go through how to get it ready.

Continue reading

Microsoft Cognitive Toolkit + VS Code = Awesome

CNTKIntellisense

Hello World!

In this article I’m going to go through how to set up CNTK with Visual Studio Code and take advantage of those PASCAL GPUs I know everybody has these days.  I will also do a breif overview of what CNTK and Visual Studio Code are and why they are so incredible for machine learning scientists.

CNTKIntellisense

Continue reading

Linking CuDNN

Hello World,

This is quick and dirty post, because I run into this problem all the time and need a place to find the answer quickly.

Here is what happens:

ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory

Here is the answer:

drcrook@BigBen:/usr/local/cuda$ sudo cp include/cudnn.h /usr/include
drcrook@BigBen:/usr/local/cuda$ sudo cp lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
drcrook@BigBen:/usr/local/cuda$ sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*

If you are struggling getting your GPU initialized with Theano, Tensorflow or really any deep learning framework, this is probably something you may want to do.