Writing Files to Persisted Storage from PySpark

filesave

Hello World!

So here is the big ticket item; How in the world do I write files to persisted storage from PySpark?  There are tons of docs on RDD.toTextFile() or things of that nature; but that only matters if you are dealing with RDD’s or .csv files.  What if you have a different set of needs.  In this case; I wanted to visualize a decision decision forest I had built; but there are no good bindings that I could find between PySpark’s MLLIB and Matplot lib (or similiar) to visualize the decision forest.

Continue reading

Dealing with Pesky Image Names in Cocos

CocoImage

Hello World!

Coco_2

If you are not familiar with Microsoft CoCos, you should be.  Its a treasure trove of data for your learning pleasure!  There just happens to be one pesky problem with it, and that is the fact that when attempting to find the files for training/testing; the Annotation file that ships with MS CoCo does not include the actual file name, but rather the image id.  This sounds fine, except the data when you download it has a bunch of trailing stuff!  In this article we will go through how to get it ready.

Continue reading

Intro to Data Manipulation with R

Hello World,

Here is a recorded version of an in-person training I have been doing.  Enjoy.  I end up coming back to this myself even for reference.

This episode is all about performing data manipulation to derive raw insights from your data using the R programming language.  Data manipulation is the core to anything and everything you do in business intelligence and machine learning.  This episode sets the base for all R based intelligence sessions from here on out.

Part 1: Introduction to Microsoft R Open.

Part 2: Introduction to R Data Structures

Part 3: Data Manipulation with R

Part 4: Beautiful Visualizations with R

Continue reading

Intro to R Data Structures

Hello World,

This article is a video tutorial on introduction to the very bare basics of R.  Its a bit dry, but it is the underlying components of everything covered in the interesting stuff.  Can’t do cool stuff without understanding the basics first.

Part 1: Introduction to Microsoft R Open.

Part 2: Introduction to R Data Structures

Part 3: Data Manipulation with R

Part 4: Beautiful Visualizations with R

Continue reading

Data Analytics Architectural Blueprint

Here is a video show casing a sample architecture for doing Data Analytics on Azure.  Enjoy 🙂

SFL Emerging Tech Group – Project Teddy Talk

https://onedrive.live.com/redir?resid=BA8DC4B28555902A!3078&authkey=!AGYWrb_WYLUtkhs&ithint=file%2cpptx

 

Project Teddy – Embedded Systems + Big Data with David Crook

Friday, Aug 21, 2015, 6:00 PM

PricewaterhouseCoopers, LLP.
600 Silks Run Suite# 2210 Hallandale Beach, FL

18 ETs Went

Project Teddy is an IoT/Big Data project currently under development by the South Florida Developer Evangelist Team.  The goal of PT is to build a teddy bear that you can literally have a normal conversation with.  This meetup is to discuss the current progress of project teddy from a technology perspective, how it is built, why it is being built t…

Check out this Meetup →

Powering AzureML with Hadoop HBase

Hello World!

Today is a freaking cool day.  Why do you ask?  Because today I am writing an article on how to use two of the coolest freaking big data/data science tools out there together to do epic shit!  Lets start with HBase.  HBase is a way to have a big data solution with query performance at an interactive level.  So many folks are starting to just dump data into HBase.  In the project teddy solution, we are dumping tweets, dialogue and dialogue annotations to power our open domain conversational api.  There really is no other way that is easy to use for us to do this.

The second part of project teddy is to predict based on an incoming conversational component, what sort of response the speaker is attempting to illicit from the teddy bear.  If we power our teddy bear with predictive analytics and big data, this would be perfect.  What better platform to do this quickly and easily than AzureML?

This is a follow up article to this one: http://indiedevspot.com/2015/06/30/writing-tweets-to-hbase-simply/

Continue reading

How to Datamine Zillow

Hello World,

As many of you may know at this point, I am relocating to South Florida.  Final location to be determined, but will probably be renting around Pompano Beach or Fort Lauderdale while working out of Venture Hive and the Microsoft Fort Lauderdale Offices.  So what does this have to do with Zillow?  Well, It has EVERYTHING to do with Zillow.  What I’ve found while searching for homes is that between Realtors, Zillow and Trulia, they really just don’t have a predictive analytics solution that works for me.  So I decided to give a shot at AzureML to mash together a few datasets to send me notifications more to my liking than is currently being sent.  So step 1 in this plan is to data mine Zillow.  Luckily, Zillow has an api for that.  Or if you are feeling particularly frisky, Zillow gets their data from ArcGIS (example for Raleigh).  So lets get cracking…

Continue reading

So you want to be an Analytics Developer

Hello World!

I get a series of questions all the time.

  1. How do I switch careers to be a developer?
  2. How do I become a data scientist?
  3. How do I add intelligence to my code?
  4. How do I get a job in distributed computing?
  5. How do I code more analytically?

The answer to these questions are pretty much all the same.  Step 1, learn about it and build one piece of software focused on that goal.  Step 2, go for it, just do it.  So that said, Microsoft has a fantastic resource, Microsoft Virtual Academy, which provides free training around various topics from entry level to advanced.  This article focuses on a learning plan with MVA to attain the goal of becoming an Analytics Developer.

Continue reading