Hello World,

So there are a ton of articles out there on the theory of Reinforcement Learning, but very few with an actual application. I watched a few lectures from Berkley, and read a few articles by NVidia and thought, “Well, lets just give this a shot”. 8 hours later, this is what I had.

Herby V1 simply learns to go forward as much as possible while avoiding obstacles.

**Lets start with the hardware.**

You can see it is clearly 90% cardboard and super glue. Really, you don’t need to spend a ton of money to build something smart. Just start building and see where you can get with what you have on hand. The bill of materials for this are:

- Arduino Uno
- 4 HC-SR04 Distance Sensors
- 2 Continuous Rotation Servos
- 1 Battery Pack
- Cardboard
- Super Glue
- Left over robot chassis from some arduino robotics kit I bought a year ago.

Total cost, I had everything on hand, so free. The key limiting factor in this hardware build out is actually not the brains, its the mechanics. Two continuous rotation servos for drive ends up not being optimal. It learns, which solves the issue of tuning which one has a bit more power, but pushing the degrees per documentation to force a specific amount of spin does not actually work, and each motor is a little different, so slightly more/less power on each wheel. The Arduino is actually more than capable of on the fly machine learning to my surprise, v2 will push this device even harder.

**Reinforcement Learning Refresher**

The basics of reinforcement learning are do something bad, get punished, do something good, get a reward. The cumulative effect of these rewards/punishments is what we want to optimize. Nvidia’s blog has a great reference for learning to ride a bicycle as well as more deep sci-ency explanations.

So how do we take all this Q-Value and Value Function and State Pairs talk and make something real out of it? At the end of the day its quite simple. You simply need to define these 3 things.

- What actions can the robot make?
- What is the current state of the environment?
- What does the robot get rewarded/punished for?

Herby v1 can only turn left, right or go full forward. (Actions). The environment is defined as during the previous loop state, what were the distances from the 3 sensors, and what action did we take? (States). +1 reward for forward, +.5 for left or right motion. -2 for getting to close to objects. (Rewards).

So how does this actually work? Its quite simple. Given our previous state, predict the potential reward from each action. Take the action with the greatest predicted reward. The learning comes in where we calculate the current state after our decision is made, calculate our actual reward for the given decision, and then use that as our y actual to update our supervised ML model. The ML model works as a stochastic gradient descent in this case.

The model I chose for Herby V1 is a simple Linear Regression. It predicts the value of each reward with literally the simplest form of a Linear Regression. The code for Herby V1 has been open sourced and can be found here. Primary work will likely happen on the Fort Lauderdale ML UG github once I get added.

**Key Brain Improvements**

There are two key improvements I think will substantially increase Herby’s performance. But before we dive into those improvements, I want to discuss Herby’s performance. Herby does more or less as expected. I initialize Herby to start by going directly forward via the weights. This allows herby to learn about the maximum reward. Herby then runs into something and learns about the punishments. Then can turn left or right. The problems become obvious as Herby often gets stuck or can occasionally just decide that spinning indefinitely will yeild the best cumulative reward. So lets think about this. I built Herby literally in less than 8 hours, so I made some very large simplifications that we can begin by improving those first.

**Historical Understanding of Environment and Decisions**

Currently Herby only knows about the now, and does not learn about actions and states leading up to a particular reward. The decision policy is purely based on the immediate. This turns out to be poor. If you are moving directly towards a wall and you have no understanding of previous, you will develop a “threshold” type value that makes the decision, where as with historical, you could develop a “rate of change” type threshold as well as an actual threshold. This will allow Herby to make better decisions earlier and avoid getting stuck.

**Increase Model Complexity**

Currently the model is a simple single polynomial linear regression, basically a straight line predictor with no combinations. The simplest approach would be to up to an order 2 polynomial linear regression. The optimal approach is likely to build a neural network with a few hidden layers. More optimal may actually be to build a convolution neural network and treat each sensor as a time series channel. We have to balance these “wants” with its an Arduino. I think the Arduino will do fine with the second order polynomial linear regression, I think I will end up adding a Raspberry Pi if we bump up to Neural networks and communicate via a two wire master/slave protocol. I’m a fan of start simple, see where you get, evaluate the next highest reward item; we will be doing second order polynomial next as it is easier to implement and this is a hobby not a job :D.

**Sample Rate**

Currently Herby Samples and updates far too frequently. This leads to over representation in the data set or “skewed” data in our model. You can think of this as having a million representations of going in circles yeilds .5, so we have converged on an understanding of that, but we have only updated our model to understand that forward might only yeild a .23, because we have not sampled or run those examples enough for our learning rate to converge on the answer. Also note that the prediction of forward is sort of also a guess on what we think our next state will be based on our current, so it forms an “invisible wall” so to say at a threshold per our punishment distance.

**Model Weight ****Initialization**

The next ideal item would be to initialize our weights with a partially trained model from a good training run. This will help ensure we begin with a positive balance between the various combinations and our bot will spend time learning its environment and not necessarily its immediate controls. Think of learning to ride a bicycle as a 4 year old as opposed to a 2 year old. You have an already learned basis of balance and motor controls that is superior to a 2 year old and therefor will converge on riding faster.

**Concolusions and Summary**

With a set of 3 simple distance sensors and 3 actions, its quite incredible I think what you can do with ML and an Arduino. I think the best bet is to push this to its limits and then iterate and decide how to make it better. My prediction is that I will be capped out at second/third order polynomials and historical understanding of 30 seconds. I will then bump up to either a Raspberry Pi or a Tegra TX-1. Both solutions will receive a camera and some open CV love :D.

This article came in handy. Thanks for writing and sharing.

Hi, after having carefully read the code for a while, I still don’t get why you scale distances (i.e. divide by 10) before using them ? Thanks for the answer

It is a super naive way to normalize my inputs (should have done min/max) but the one hot encoding of the states and the very large distance potential of the sensors gives uneven learning across input features; therefor I simply divided by 10 (shoulda min/maxed) so everything is a bit closer to same scale.