So today, I was asked to put some thought into what we should focus our entry level data scientists on in terms of tech skills. After I put a bunch of thought into it, I ended up coming up with this. I decided that the most important aspect of this was a few items fold
- Don’t overload them
- Can deliver to production where the target can be anything, including IoT.
- They will not be concerned with building front ends.
I have to say, the result greatly surprised me.
Here is a great set of materials that set a great foundation.
I would also recommend folks do the following:
- Python on Data Camp
- Tutorial: http://dacrook.com/setting-up-python-and-virtual-environments-in-visual-studio-code-on-ubuntu/
- Read: http://dacrook.com/categories-of-analytics/
If you want to do Machine Learning, this is a must
Functional Data Scientist
Can deliver projects where data science is the core and integrate into anything.
- Core Libraries to be familiar with
- Flask (or Django)
- Tensor Flow
- Core Platform Technology Familiarities…
- Azure Machine Learning
- Azure Blob Storage & Azure Storage Explorer
- Azure Data Factory
Reasons for everything.
- Python -> Easiest hitter across all server side technologies and integration into Tensor Flow and soon CNTK as well. Loads of documentation and community support. You can deploy your intelligence in Micro-Service architectures allowing for easiest and lightest touch integration into customer projects.
- C++ -> Any model can be built in python and exported to C++ for delivery on any platform in any language with any hardware. As there are native calls from .Net and Java for this, it’s perfect. This enables high performance requirements on top of dealing with any platform. Xamarin can even consume this.
- Flask or Django -> I don’t know much about these, I’ve been delivering via Asp.Net. Basically it’s the Python version of it. I chose this to reduce needing to learn .Net, though .net would be prefferable.
- Anaconda -> Basically this contains all the typical Python Data Science Libraries, numpy, pandas, scikitlearn etc etc.
- TensorFlow -> Need a backup when you can’t use Azure ML. This could be due to data size limits, complexity of training, data security etc. The models can be trained on prem with this and then delivered into Azure.
- Plotly -> Great Python plotting library for interactive charts and plots.
- Ubuntu -> Tensor Flow requires this. Also has cron jobs and lots of good server side stuff. NVidia embedded runs on this as well. It’s all about checking as many boxes with as little extra needed knowledge as possible. Code can also be deployed to Windows Boxes.
- Azure Machine Learning -> Still the best and fastest time to market machine learning tool out there.
- Blob Storage and Storage Explorer -> This is home base. Upload Data, Share Data Securely, munge tons of data, reliable, cheap. Its awesome.
- Azure Data Factory -> Really need to understand data pipelining and engineering.
- Understanding of web security. SSL and OAuth