Time Series Discovery with Python

Hello World,

This article is loosely based on a time series challenge from customer data.  I have fabricated 3 data files such that they represent the same challenge and we will go through the process of discovering that data.  The primary challenge in this data set is that it is from a sleep study and the researchers left the date portion of the time stamp off.  What this means is that at midnight, the data plots at the beginning of the x-axis.  The second challenge is lining up data to see if there is anything interesting with the time.  So yes, you can simply plot using the index that python generates, however I’m also interested in the actual time itself as this is a study involving humans.

Import Packages

We are going to use a series of packages.  Make sure install these into your environment.

import numpy as np
import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from datetime import datetime

Numpy is a numerical computing library.  Pandas is our data frame manipulator.  Plotly is the best charting library out there and we need some dates.

Load the Data

First download the data from my Azure Storage.  Subject1 Data.  Subject2 Data.  Subject3 Data.  A very quick note about the data.  It contains absolutely no real data, this is purely fabricated data to demonstrate a few challenges.  The Temp field is simply a Gaussian distribution centered at 10.

Below is the code to load the data.  There isn’t much special about this.

#Prep Notebook for offline plotting
plotly.offline.init_notebook_mode()

#Build paths
basePath = "C:\\data\\Fabricated\\"
subj1_Path = "subject1.csv"
subj2_Path = "subject2.csv"
subj3_Path = "subject3.csv"

#Load Data
subj1Data = pd.read_csv(basePath + subj1_Path, index_col = 0)
subj2Data = pd.read_csv(basePath + subj2_Path, index_col = 0)
subj3Data = pd.read_csv(basePath + subj3_Path, index_col = 0)

Helper Manipulation Functions

I’m not a fan of repeating the same code a ton of times, so its best to knock out all of the manipulation code in a single function.  Here we have 2 functions that are used.

def Day7 (x):
    if x.hour < 12:
        return datetime(year = 1975, month = 5, day = 7, hour = x.hour, minute = x.minute, second = x.second)
    return x

def CleanData (data):
    data['Time'] = data['Time'].map(lambda x : pd.to_datetime(x, format="%I:%M:%S %p").time())     \    
        .map(lambda x: datetime(year = 1975, month = 5, day = 6, hour = x.hour, minute = x.minute, second = x.second))  \       
        .map(lambda x: Day7(x))
    return data.set_index(data['Time']).drop('Time', 1)

Lets start with the function “CleanData”.    Basically I’m a huge fan of high order functions (F# background).  .map allows us to clean data in a way that is more easily read.  Also notice that I use the “\” to continue a line.  This allows me to achieve a style that is more in line with what we do in R and F# by putting a new high order function on each line making the manipulation code more easily read.

The challenge in the data set really is around handling the time.  The first map function converts the string into a DateTime.  Notice we use %I instead of %H.  %I will allow you to use %p to maintain am/pm if you decide to convert to 24 hour instead of 12 hour representation.

The second map function takes the previous and transforms it into a new date time which includes a fabricated year, month and day.  The output of this is in the 24 hour format.  This is important because remember, these studies happen on the barrier between two days.  This is the beginnings of being able to fabricate a day that might keep our data in line.

The third map function really just uses this Day7 function.  The reason we use Day7 is it appears that lambda does not allow a multi-line function in Python, lame, but oh well.  What is needed for this is that folks go to sleep at night (hours 19-24 or so).  So really, what we need to do is grab any data that is before noon and put that 1 day ahead of the data which is hours greater than noon.  Of course there are some assumptions there, such as folks wake up before noon and go to sleep after noon.  I suppose as long as you aren’t dealing with teenagers you should be ok.

Finally we bump the time to be the index and drop the time column altogether.  The reason we do this is that we want to take advantage of pythons windowing and computational tools on series.  If your time is your index, Pandas will manage conversion between data, number and back again for you and everything is happy.  If you don’t set the date as your index, Python gives you an exception “I don’t know how to do that”.

Clean and Window

Turns out that the data is really noisy, so I wanted to plot a few different window strategies to see what might be more optimal while maintaining core data shape.

#Format Raw Data for Render
subj1Data = CleanData(subj1Data)
subj2Data = CleanData(subj2Data)
subj3Data = CleanData(subj3Data)

#Windows Rocks
window = 1000
subj1Data_W = subj1Data.rolling(window=window, center = True).mean().dropna()
subj2Data_W = subj2Data.rolling(window=window, center = True).mean().dropna()
subj3Data_W = subj3Data.rolling(window=window, center = True).mean().dropna()

Basically we are doing a rolling mean window with a window size of 1,000 seconds as the data is set on 1,000.  I want to maintain the center as we will be plotting this on top of the raw data.  We drop na, because doing this inherently reduces the size of our data set, and its just good practice to drop nulls.

Prep for Charting

Alright!  Lets get this stuff ready for rendering…

def CreateRawTrace (data, c):
    return go.Scatter(
        x = data.index,
        y = data['Temp'],
        line = dict(
            width = 1,
            color = c,
            dash = 'dash')
        )
def CreateWindTrace (data, c):
    return go.Scatter(
        x = data.index,
        y = data['Temp'],
        line = dict(
            color = c,
            width = 2)
        )        
subj1_Trace_R = CreateRawTrace(subj1Data,  'rgb(205, 12, 24)')
subj2_Trace_R = CreateRawTrace(subj2Data, 'rgb(12, 205, 24)')
subj3_Trace_R = CreateRawTrace(subj3Data, 'rgb(12, 24, 205)')

subj1_Trace_W = CreateWindTrace(subj1Data_W,  'rgb(205, 12, 24)')
subj2_Trace_W = CreateWindTrace(subj2Data_W, 'rgb(12, 205, 24)')
subj3_Trace_W = CreateWindTrace(subj3Data_W, 'rgb(12, 24, 205)')

    
data = [subj1_Trace_R, subj2_Trace_R, subj3_Trace_R,    \     
        subj1_Trace_W, subj2_Trace_W, subj3_Trace_W]

plotly.offline.plot(data, filename='subjects')

So I created 2 functions so we can more easily switch out the trace generation logic.  The first generates dashed thin lines for the raw data and the second draws a more thick solid line.  We take in a color as a parameter because we would end up with 6 different colors if we let plotly do that.  Create an offline plot and there you go, a nice HTML5/CSS/JavaScript based plot where you can zoom into your specific curves etc and inspect phenomina.

You can also swap plotly to work in online mode and upload to plotly if you want to push to a blog or whatever.

GaussianTimeSeriesPython

Summary

So we covered quite a bit here.  Lambda functions, more plotting, how to deal with time series in python, a bit on pandas.  Making some good progress.

Leave a Reply

Your email address will not be published. Required fields are marked *