The Cognitive Revolution

Hello World!

So here is a pretty raw blog article; not unlike most of my articles.  The cognitive revolution.  I’m going to coin this term today.  What the heck is this thing?  What does it mean for you?  What does it mean for me?  Where did it come from?  What is it?  These are questions I aim to answer in this blog article.

Continue reading

Docker, Tensor Flow and Scientific Computing

Hello World,

So this blog post is to get you operational with Docker, image and volume management with a pivot towards scientific computing and tensor flow.  So I am working on building a Jupyter Notebook for the local mahcine learning meetup to learn the ins and outs of Tensor Flow and deploy this thing up to Azure.  Part of getting this to work is not only managing the Docker Containers, but also the data on the volumes so when we deploy up to Azure and somebody opens up the notebook it comes pre-loaded with all the necessary tutorial data.

Continue reading

Decoding Woes Solved – Python

Hello World,

This is a short post.  Basically I had a data set come in, where there were some funky characters involved.  I was getting “Can’t read this; doesn’t appear to be UTF-8”.  Looked around on stackoverflow for a while to little avail.  I came up with this, which works.

dataPath = "C:\\data\\CompanyA\\DavidCrook\\davidData_Session1.csv"
fil = open(dataPath)
txt = fil.readlines()
txt = ''.join(txt)
works = pd.read_csv(StringIO(txt), index_col = 0)
doesntWork = pd.read_csv(dataPath, index_col = 0)

Just read the sucker with the standard file open and line reader, push it into a StringIO and then read into a data frame.  Guess what I’m doing from now on.

#MicroBlogPost 🙂

Standardize Continuous Data Shape for Neural Networks

Hello World,

So this is an interesting problem.  You are collecting data from somewhere and you want to feed it into a neural network for classification.  There is one main problem with this.  The shape of the data!  Neural networks and really just anything require specifically shaped data, you can’t just like give it something of ambiguous size.  There are tons of papers out there on dimensionality reduction, but nothing on dimensionality reduction to a specified size.  This article explains my approach.

Continue reading

Time Series Discovery with Python

Hello World,

This article is loosely based on a time series challenge from customer data.  I have fabricated 3 data files such that they represent the same challenge and we will go through the process of discovering that data.  The primary challenge in this data set is that it is from a sleep study and the researchers left the date portion of the time stamp off.  What this means is that at midnight, the data plots at the beginning of the x-axis.  The second challenge is lining up data to see if there is anything interesting with the time.  So yes, you can simply plot using the index that python generates, however I’m also interested in the actual time itself as this is a study involving humans.

Continue reading

Becoming a Functional Data Scientist

Hello World,

So today, I was asked to put some thought into what we should focus our entry level data scientists on in terms of tech skills.  After I put a bunch of thought into it, I ended up coming up with this.  I decided that the most important aspect of this was a few items fold

  1. Don’t overload them
  2. Can deliver to production where the target can be anything, including IoT.
  3. They will not be concerned with building front ends.

I have to say, the result greatly surprised me.

Continue reading

K-Means under the hood with Python

plot_6

Hello World!

This article is meant to explain how the K-Means Clustering algorithm works while simultaneously learning a little Python.

What is K-Means?

K-Means Clustering is an unsupervised learning algorithm that tells you how similar observations are by putting them into groups or “clusters”.  K-Means is often used as a discovery step on new data to discover what various categories might be and then apply something such as a k-nearest-neighbor as a classifier to it after understanding the centroid labels.  Where a centroid is the center of a “cluster” or group.

Continue reading

My Production Data Science Workflow

Hello World,

So I’ve spent a while now looking at 3 competing languages and I did my best to give each one a fair shake. Those 3 languages were F#, Python and R. I have to say it was really close for a while because each language has its strengths and weaknesses. That said, I am moving forward with 2 languages and a very specific way I use each one. I wanted to outline this, because for me it has taken a very long time to learn all of the languages to the level that I have to discover this and I would hate for others to go through the same exercise.

Continue reading

Merging Data Sets in Python

Hello World,

So this article is inspired by a customer doing financial analysis who can only grab a certain amount of data at a time from the data steward’s stores in chunks based on time windows. As time is constantly moving, what happens is that occasionally you get duplicate data in each request. If you attempt to grab exactly on the edges, you have a chance of missing something, so its best to have a bit of an overlap and just deal with that overlap.
Continue reading

Getting Started with Linear Algebra in Python

Hello World!

So here I am after trying for a long time to not learn Python learning Python.  It just seems like I might get a hit or two more on my blog with some Python content.  Well whats the first thing I need to figure out aside from getting it up and running in my environment and installing some libraries… Thats right, find a numerical computing library and see how it ticks.

Lets just start with my environment, because I painstakingly chose one.

Continue reading