Prepping Label Files for ML Training on Specific Machine

Hello World!

So you likely will run into this at some point.  You are reading data from somewhere and it is relative path based; but that doesn’t necessarily always help load data in especially if you are storing data and your code in separate paths (which is common) or if you are sharing data with a team; or even if your data is just somewhere totally different.

Anyways; this article will help convert a .csv label file with actual named labels to a label file with full path with a numerical label that can be more easily one hot encoded during the reading process.  Note for deep learning often this is a two step process.  Step 1: Convert from relative pathing to specific pathing & numerical labels.  Step 2: Convert to framework specific storage format for input reading pipeline (which varies framework to framework).  Here we cover Step 1.  We will be using the CIFAR 10 data set which can be downloaded from here: https://www.kaggle.com/c/cifar-10/data

The Labels File

So we make a big assumption here that you already have a labels file.  You should totally be working with a team to get one of these.  This is where you value really is.  Anyways; we have a label file.  Here is a picture of it.

baselabelfile

This file is a .csv, but they could come in .tsv .ssv or who knows; this stuff comes in all sorts of crazy formats if you don’t control it.  .csv’s are usually the easiest to read and understand.

Reading the Labels File with Pandas

So pandas is the easiest way to do all of this.

import pandas as pd

labels_file = 'C:/data/cifar_10/trainLabels.csv'
labels = pd.read_csv(labels_file, dtype={'id': object})

Here we simply define the path and call pd.read_csv after having imported pandas.  Notice that we specify the dtype for the id column.  We do this because the CIFAR 10 data set has each image named a number.  Pandas decides to import that as an integer; but we know that we need to convert that to a path using string manipulation; so we just want it to be read in as a string from the get go to make things easier.

Replace id column with full path

So this is the easiest, pandas makes this easy.  Here is the code.

labels['id'] = 'C:/data/cifar_10/train/' + labels['id'] + '.png'
print(labels.loc[0:4])

This will give you the following output

cifar_10_specificpath

Wonderful; so the code, all we are doing is using pandas nice built in capabilities to apply manipulation to all rows in one shot.  Good thing we decided to read the data in as a string for that column.  The second line of code simply prints the first 5 rows to the console so we can double check our work.

Build String Label and Numerical Label Conversion Dictionary

So most one hot encoding functions have a hard time determining ‘frog’ correlates to a specific output node.  So we go ahead and build a dictionary to push it in the correct direction and give it an easier path.  This also allows us to know how it is going to get encoded for sure and we can use the same dictionary to decode inferences.

classes = list(set(labels['label']))
num_class = [x for x in range(len(classes))]
class_to_num_dict = dict(zip(classes, num_class))
proc_labels = labels.replace({'label': class_to_num_dict})
proc_labels.columns = ['examplepath', 'label']
print(class_to_num_dict)
print(proc_labels.loc[0:4])

Set is simply the unique elements in a Series.  When you slice a dataframe by a column, you get back a Series.  We perform the set operation on that series to get the unique values in it.  These are the unique classes in our 50,000 labels. (whew, thats easy).  You can also use this as a way to get a quick look to see if there were mis-spellings etc.

We then convert the set into a list, because Set can’t tell us how many elements are in it (kinda lame; seems like top thing you would want to know about your set isn’t it?)

We then use list comprehension to create a list of integers from 0 to the length of the unique number of elements in our set.  In this case it is 10; so we generate a list of 0 through 9.

Next we call the zip function to create a list of pairs; the first being our key and the second being our value, hence we put classes first as we know this is the lookup value we want to use for replacement.  We then wrap that with a dict function to convert it to a dictionary we can use in pandas .replace function to replace everything in the label column using the newly created conversion dictionary.

I like to rename the columns so ‘id’ is now the path.  The last two lines simply print out our dictionary and

Saving the Dictionary and Labels File

Ok; now we need to save the label file and dictionary appropriately.

proc_labels_path = 'C:/data/cifar_10/proc_train_labels.csv'
conversion_dict_path = 'C:/data/cifar_10/proc_train_classes_dictionary.csv'
proc_labels.to_csv(proc_labels_path, index = False)
conv_df = pd.DataFrame(list(class_to_num_dict.items()))
conv_df.to_csv(conversion_dict_path, index = False)

The first two lines just specify where we are going to save these things.  The next line is a simple dataframe save, no big deal there.  Same with the last line.  The only tricky line is the creation of the conv_df object.  The list constructor can take the dict_items object and convert it to an actual list of tuples which pandas dataframe actually knows how to handle (I know, its a bit of hand holding, but it is what it is).

Testing the Writing, by reading and printing

So now we simply just want to read what we wrote and make sure it looks good.

test_proc_labels_read = pd.read_csv(proc_labels_path)
test_dict_read = pd.read_csv(conversion_dict_path)
print(test_proc_labels_read.loc[0:4])
print(test_dict_read)

So there we go; it looks great!

Next Steps 

Ok, so we got that down; the next steps really is to convert this file into the specific format and write that format to a remote storage system and distributed across our compute cluster for training.

Leave a Reply

Your email address will not be published. Required fields are marked *