Merging Data Sets in Python

Hello World,

So this article is inspired by a customer doing financial analysis who can only grab a certain amount of data at a time from the data steward’s stores in chunks based on time windows. As time is constantly moving, what happens is that occasionally you get duplicate data in each request. If you attempt to grab exactly on the edges, you have a chance of missing something, so its best to have a bit of an overlap and just deal with that overlap.

The Back Story

Lets talk just a little more deeply about the use case. There are two end places for this data. First is into blob storage where it is consumed from a report that is operationalized on a nightly basis and second is training data for machine learning algorithms. I was asked at one point, why use pandas? Ok, so if you are following this blog, you know I just started coding python this week, but heres the deal. Pandas is a data frame library with a boat load of great functionality for manipulating data and the meta data associated with that data. It is also one of the most commonly used. The first rule of dealing with customer data is to treat it like it has a plague of messed up everything in it and requires a titanic amount of work to put into a format that is use-able. This is especially true if the data comes from a third party system in a convoluted manner such as we are receiving said financial data.

As such, Pandas is perfectly set up for this. The first drop into blob storage will be a straight .csv which is easily consumed from a reporting perspective, but another drop will be made of 2 data sets, a “featurized” data set as well as a look up table for properties of that data set. All languages, libraries machine learning etc has these sort of basic components to it. So be it python with pandas, .net with deedle, or R with out of the box data frames, its just what has to be done to get the data into a consume-able format.

The Code

b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])


c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
                  'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
                  'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
                   index=[7, 8, 9, 10, 11])

result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()

So here we have two dummy data frames. One has 4 rows and the other has 5 rows. Duplicate row is at index 7. Pandas has a great concatenation operation. Note you MUST have data frames that line up appropriately. np.unique returns an array of numbers. Where the number corresponds to the unique values location in the array. Finally iloc is simply a selection function that can take in numerical indices, perfect. Finally we just sort it to ensure its in the correct order.

What about something more real?

Alright, here is some stock data anybody can get, which more closely resembles the customer’s scenario.

dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
data = pd.read_csv(dataUrl)
existing = data[:10]
data = data[:50]

r = pd.concat([data, existing])
idx = np.unique(r["Date"], return_index=True)[1]
r.iloc[idx]

Summary

And there you have it.  Next article in this series will be operationalizing this with compute services and Azure Blob storage.

Leave a Reply

Your email address will not be published. Required fields are marked *