So here is the big ticket item; How in the world do I write files to persisted storage from PySpark? There are tons of docs on RDD.toTextFile() or things of that nature; but that only matters if you are dealing with RDD’s or .csv files. What if you have a different set of needs. In this case; I wanted to visualize a decision decision forest I had built; but there are no good bindings that I could find between PySpark’s MLLIB and Matplot lib (or similiar) to visualize the decision forest.
If you are not familiar with Microsoft CoCos, you should be. Its a treasure trove of data for your learning pleasure! There just happens to be one pesky problem with it, and that is the fact that when attempting to find the files for training/testing; the Annotation file that ships with MS CoCo does not include the actual file name, but rather the image id. This sounds fine, except the data when you download it has a bunch of trailing stuff! In this article we will go through how to get it ready.
Here is a recorded version of an in-person training I have been doing. Enjoy. I end up coming back to this myself even for reference.
This episode is all about performing data manipulation to derive raw insights from your data using the R programming language. Data manipulation is the core to anything and everything you do in business intelligence and machine learning. This episode sets the base for all R based intelligence sessions from here on out.
This article is a video tutorial on introduction to the very bare basics of R. Its a bit dry, but it is the underlying components of everything covered in the interesting stuff. Can’t do cool stuff without understanding the basics first.
Project Teddy – Embedded Systems + Big Data with David Crook
Friday, Aug 21, 2015, 6:00 PM
PricewaterhouseCoopers, LLP. 600 Silks Run Suite# 2210 Hallandale Beach, FL
18 ETs Went
Project Teddy is an IoT/Big Data project currently under development by the South Florida Developer Evangelist Team. The goal of PT is to build a teddy bear that you can literally have a normal conversation with. This meetup is to discuss the current progress of project teddy from a technology perspective, how it is built, why it is being built t…
Today is a freaking cool day. Why do you ask? Because today I am writing an article on how to use two of the coolest freaking big data/data science tools out there together to do epic shit! Lets start with HBase. HBase is a way to have a big data solution with query performance at an interactive level. So many folks are starting to just dump data into HBase. In the project teddy solution, we are dumping tweets, dialogue and dialogue annotations to power our open domain conversational api. There really is no other way that is easy to use for us to do this.
The second part of project teddy is to predict based on an incoming conversational component, what sort of response the speaker is attempting to illicit from the teddy bear. If we power our teddy bear with predictive analytics and big data, this would be perfect. What better platform to do this quickly and easily than AzureML?
As many of you may know at this point, I am relocating to South Florida. Final location to be determined, but will probably be renting around Pompano Beach or Fort Lauderdale while working out of Venture Hive and the Microsoft Fort Lauderdale Offices. So what does this have to do with Zillow? Well, It has EVERYTHING to do with Zillow. What I’ve found while searching for homes is that between Realtors, Zillow and Trulia, they really just don’t have a predictive analytics solution that works for me. So I decided to give a shot at AzureML to mash together a few datasets to send me notifications more to my liking than is currently being sent. So step 1 in this plan is to data mine Zillow. Luckily, Zillow has an api for that. Or if you are feeling particularly frisky, Zillow gets their data from ArcGIS (example for Raleigh). So lets get cracking…
The answer to these questions are pretty much all the same. Step 1, learn about it and build one piece of software focused on that goal. Step 2, go for it, just do it. So that said, Microsoft has a fantastic resource, Microsoft Virtual Academy, which provides free training around various topics from entry level to advanced. This article focuses on a learning plan with MVA to attain the goal of becoming an Analytics Developer.