How to use any Kaggle dataset in Google Colab

Machine Learning and Deep Learning are about studying and trying. Practising and experimenting are the best ways to understand and digest all the theoretical concepts. When practising, it is essential to build our datasets properly. Besides going entirely from scratch, using already available data and focusing more on data processing and algorithms is wiser.

Photo by Peter Bravo de los Rios on Unsplash

Let’s try to show a practical case where we want to experiment with Convolutional Neural Network (CNN) networks and try a transfer learning approach using reset50. This post will not show the training phase but only the data preparation. In our case, we need to identify a dataset containing a not fundamental type of object (e.g., car, animal) but something else to use the transfer learning for the first part and add our custom layers to identify this custom class. We decide that the custom class is pizza, and the purpose of our CNN is to classify something as pizza or not pizza. Good, now it’s time to find a dataset with images of pizzas and !pizzas (oops, this was a joke, I mean not pizzas). The standard place to search for datasets is (no surprise) Kaggle.

Before going into action, we could also mention that an ideal way to start practising is using Google Colab. There is no need to buy a laptop with a GPU, subscribe to AWS or start installing libraries to our Ubuntu machine. Our mission is to learn Deep Learning; we must find everything ready. Colab is a free service, a publicly hosted Jupiter Notebook, provided by Google.

How to download a dataset and use it in Colab? Is it a complicated procedure? Fear not, my friends!

First, we need to create a new token in kaggle.com. This is nothing else than a text file that includes our digital signature that allows 3rd party applications to connect to Kaggle on our behalf. We sign in to Kaggle and go to the “Settings” page. In the 3rd section of the page, there is a header that says “API” and a button “Create new token”. By pushing the button, Kaggle crates the token and allows us to download it to our hard disk locally. So far, so good!

Now we need somehow to upload the token to Colab. This is easy:

from google.colab import files
files.upload()

The execution of this block will show a pop-up window where we can select the kaggle.json file that we downloaded as a token. We select the file, which most of the time will be located in the downloads, and upload it to Colab. The file will be uploaded to the content folder, but we need to move it somewhere else in order to use it. Let’s run the following files:

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Well, we create a hidden folder in our home directory (where? in the cloud Colab environment, of course), we copy the token, and we chmod it to 600, which means that we can read and write the file, and other users have no access to it. Neat…

I have found you a good dataset with pizzas and !pizzas photos. The location is here. No, downloading it on your laptop and uploading it to Colad is not an elegant idea; I would reject it. On the other side, there is a nice Python library called Kaggle (lol) that can do the trick for you. Let’s install it in Google Colab:

! pip install -q kaggle

Now, if we infer from the last section of the URL the dataset name and username, we can use the Kaggle command to do the magic for us. And yes, since we have installed the token, the Google Colab virtual environment has access to Kaggle. Then we uncompress the contents.

! kaggle datasets download -d carlosrunner/pizza-not-pizza
! unzip /content/pizza-not-pizza.zip

Is everything okay so far? Do we work blindly? I need to get a hint about our data. We the following snippet, we go through the images and check the shapes.

import os

from PIL import Image
for _, _, filenames in os.walk('/content/pizza_not_pizza/pizza'):
  for image in filenames:
    a = Image.open(f'/content/pizza_not_pizza/pizza/{image}')
    print(np.asarray(a).shape)

Output:

...
(512, 512, 3)
(384, 512, 3)
(512, 512, 3)
(512, 384, 3)
(512, 512, 3)
(512, 512, 3)

Okay, I can see many images, in RGB colour (3 channels), in different dimensions and no labels. Now, we need to dive deeper into that to create the appropriate dataset.

Let’s first access the folder using Python and see the contents:

from tensorflow.python import data
import pathlib

data_dir = pathlib.Path("/content/pizza_not_pizza/")
list(data_dir.iterdir())

Output:

[PosixPath('/content/pizza_not_pizza/not_pizza'),
 PosixPath('/content/pizza_not_pizza/pizza'),
 PosixPath('/content/pizza_not_pizza/food101_subset.py')]

Hmm, there are no labels, but the data are organised nicely in two categories in different folders. There is also a Python script that I need to get rid of it. Easy:

! rm -r /content/pizza_not_pizza/food101_subset.py

Let’s count how many images we have in total.

# how many total images we have?
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)

Hmm, it seems that we have 1996. Not bad.

Let’s see what they look like:

pizza = list(data_dir.glob('pizza/*'))
PIL.Image.open(str(pizza[0]))

It seems delicious, don’t you agree?

Based on our research, it is preferable to have the same size in all images. In our cases, resnet50 like images 224 by 224 pixels. Let’s also create the necessary batches.

batch_size = 32
image_height = 224
image_width = 224

Now it’s the big time that we are going to create the validation and the training datasets:

training_dataset = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(image_height, image_width),
  batch_size=batch_size)

validation_dataset = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(image_height, image_width),
  batch_size=batch_size)

Mmm, let me check that classes are there:

class_names =training_dataset.class_names
print(class_names)

Output:

['not_pizza', 'pizza']

All good.

Do you want to see it to believe it, correct?

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_dataset.take(1):
  for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(class_names[labels[i]])
    plt.axis("off")

Yeap, all good. Now I think we are ready to proceed with the training. That’s it!