Dealing with Pesky Image Names in Cocos

Hello World!

Coco_2

If you are not familiar with Microsoft CoCos, you should be.  Its a treasure trove of data for your learning pleasure!  There just happens to be one pesky problem with it, and that is the fact that when attempting to find the files for training/testing; the Annotation file that ships with MS CoCo does not include the actual file name, but rather the image id.  This sounds fine, except the data when you download it has a bunch of trailing stuff!  In this article we will go through how to get it ready.

Getting the Data Set

This is a big data set, so I like to open up a bash command (yes linux bash prompt) on windows and execute a wget with the link.  You can alternatively just follow the download link and let your browser handle it (this kept crashing my box though). http://mscoco.org/dataset/#download

You need to download the images as well as the annotations file.

Opening the Annotations File with Python

So if you aren’t already, you should set up Visual Studio Code w/CNTK using the previous article here.  You can read all about it there.  Once you have your environment up and running with your .vscode folder and settings/launch files, you can just interactively execute code.

The following code opens the annotations into a pandas data frame.

#%% imports
import pandas as pd
import json
import os

ann_base_dir = 'C:/data/coco/instances_train-val2014/annotations/'
train_lbls = 'instances_train2014.json'
test_lbls = 'instances_val2014.json'

#%% read data
raw_json = json.load(open(ann_base_dir + test_lbls, 'r'))
ann_df = pd.io.json.json_normalize(raw_json['annotations'])

You will be able to add an additional cell and simply see the output of the data frame as below.annotation_print

Notice the Image ID is just an image Id.  When you download the data you will see CoCo_Val2014_000000someid.jpg  This stinks.

List Comprehension is Neat

So I just did a quick test using list comprehension before I let my code loose on the big data set.  Here is the code to quickly test it.

#%% Remove Leading Zeros from Image Files
base_dir = 'C:/data/coco/val2014/val2014/'
[s.split('_')[2].lstrip("0") for s in os.listdir(base_dir)]

Updating the File Names

Below is the code to actually go through every file and update the file name using the technique we tested out in list comprehension.

#%%
for file in os.listdir(base_dir):
    removed_zeros = file.split('_')[2].lstrip("0")
    os.rename(base_dir + file, base_dir + removed_zeros)

Summary

Alright!  Thats it folks.  Now you are ready to go with creating your label files and you know the image names in the directory are simply the image_id you have + .jpg at the end.  Happy Deep learning!

Leave a Reply

Your email address will not be published. Required fields are marked *