Standardize Continuous Data Shape for Neural Networks

Hello World,

So this is an interesting problem.  You are collecting data from somewhere and you want to feed it into a neural network for classification.  There is one main problem with this.  The shape of the data!  Neural networks and really just anything require specifically shaped data, you can’t just like give it something of ambiguous size.  There are tons of papers out there on dimensionality reduction, but nothing on dimensionality reduction to a specified size.  This article explains my approach.

What is out there already?

Well I found a ton of papers on doing some form of fancy math on convolutions of the data with sliding windows and checking back and forth and all sorts of stuff.  In fact I started by reading a white paper that I have since tossed out the window that went into extreme depth on one such approach.  The main issue with these approaches is the complexity.

So what did I do?

I started with a more complex approach and ended with this one.  Basically slice your entire data set into even chunks based on the desired number of peaks and valleys.  Find the peaks and valleys in each chunk and that’s your reduced data set.  I will add a step in which I divide the data set and perform a moving average as well in the middle of the specific peaks.  The approach really is a combination of a moving average and a peak oriented approach.  This is fast to compute while preserving the data shape.

Show me some Code!

Ok, bear with me a bit, I’m still learning this Python thing, but here we go…

def FindPeaksAndValleys (data, column_name, quantity):
    convSize = math.floor(len(data.index) / quantity)
    time = []
    temp = []
    for i in range(0, quantity):
        dsp = i * convSize
        tmp = data[dsp : dsp + convSize].copy()
        indx = np.where(tmp[column_name] == min(tmp[column_name]))[0]
        indxR = indx[math.floor(indx[0].size / 2)]
        dIndx = tmp.index[indxR]
        time.append(dIndx)
        temp.append(tmp[column_name][dIndx])
    pAv = pd.DataFrame({column_name : temp }, index = time)
    return pAv

pv0 = FindPeaksAndValleys(o_data, 'Temp', 30)    
pv1 = FindPeaksAndValleys(o_data, 'Temp', 100)
pv2 = FindPeaksAndValleys(o_data, 'Temp', 300)

There you have it, we reduce to 30, 100 and 300 data points.  I’ll have to add in the section where I do the moving average later, but over-all this produces some nice results.  Here is a side by side going from ~35,000 data points to 300, then 100 and then 30…

dimensionreduction

Not so bad.  I’ll write an article on how you blow that out into a feature set for Neural Networks.

Thanks,

~David

Leave a Reply

Your email address will not be published. Required fields are marked *