Running Jupyter in Kubernetes with an SLA

Hello World!

So Jupyter is a great tool for experimental science.  Running a jupyter notebook though can be tricky; especially if you want to maintain all of the data that is stored in it.  I have seen many strategies; but I have come up with one that I like best of all.  It is based on my “Micro Services for Data Science” strategy.  By using decoupled data and compute we can literally thrash our Jupyter notebook and all of our data and notebooks still live.  So why not put it in a self healing orchestrater and deploy via Kubernetes :D.

Step 1: Get a container registry

I use Azure’s Container Registry.  Its a private container registry that comes with Azure and gets billed through my Azure subscription.  I just like keeping everything in one place.  I’ll go through how to use a private container registry here as well; but if you don’t want to worry about that you can simply create a docker hub repository to push your containers to.

  1. https://azure.microsoft.com/en-us/services/container-registry/
  2. https://hub.docker.com/ 

Step 2: Create a light weight container with Jupyter in it.

Here is the docker file:

FROM ubuntu:16.04

RUN apt-get update -y
RUN apt-get upgrade -y
RUN apt-get install -y -qq build-essential libssl-dev libffi-dev python3-dev curl python3-pip
RUN pip3 install --upgrade pip

ADD requirements.txt /app/
RUN pip3 install -r /app/requirements.txt

EXPOSE 8888

We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).

Here is our requirements.txt file

numpy
pandas
scipy
jupyter
azure_common
azure-storage
scikit-learn
nltk
plotly

Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.

Step 3: Create a .yaml file

apiVersion: v1
kind: Service
metadata:
  labels:
    app: jupyter-persisted
  name: jupyter-persisted
spec:
  ports:
  - port: 80
    targetPort: 8888
  selector:
    app: jupyter-persisted
  type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: jupyter-persisted
  name: jupyter-persisted
spec:
  template:
    metadata:
      labels:
        app: jupyter-persisted
    spec:
      volumes:
      - name: dumpzone
        azureFile:
            secretName: storagesecret
            shareName: dumpzone
            readOnly: false   
      - name: rvlcdip
        azureFile:
            secretName: storagesecret
            shareName: rvlcdip
            readOnly: false
      - name: jupyter
        azureFile:
            secretName: storagesecret
            shareName: jupyter
            readOnly: false
      - name: models
        azureFile:
            secretName: storagesecret
            shareName: models
            readOnly: false     
      - name: tensorboard
        azureFile:
            secretName: storagesecret
            shareName: tensorboard
            readOnly: false              
      containers:
      - name: jupyter
        image: YOUR_REGISTRY_REPO_OR_LOGIN_SERVER/YOURIMAGE:YOURTAG
        imagePullPolicy: Always
        command: ["bash", "-c"]
        args: ["jupyter notebook --no-browser --port=8888 --ip=0.0.0.0 --notebook-dir=/jupyter --allow-root --NotebookApp.password='YOUR NOTE BOOK PASSWORD'"]
        ports:
        - containerPort: 8888
        volumeMounts:
        - mountPath: "/dumpzone"
          name: dumpzone
        - mountPath: "/rvlcdip"
          name: rvlcdip
        - mountPath: "/jupyter"
          name: jupyter
        - mountPath: "/models"
          name: models
        - mountPath: "/tensorboard"
          name: tensorboard                              
      imagePullSecrets:
      - name: regsecret

This is where most of the magic happens.  Notice the volume spec and the mounts specifically.  These are Azure Files.  I mount the notebook to a variety of file shares which have data in them so my teams can operate against them.  Very importantly, notice that we mount an Azure File to /jupyter; which happens to be the exact place we run Jupyter from.  This means my notebooks are persisted to geo redundant data which is seperate from my container or even my vm.  So my container and my vm can go down and Kubernetes will simply reschedule the pod with the same mount and my notebook will be automatically online again without me touching a single thing… WONDERFUL!

file_share

You can also see in the above image I have mounted all of my other file shares for management of those shares from jupyter.  Exploratory analysis etc etc.

Generating the notebook password

Its just some simple code:

from notebook.auth import passwd
print(passwd())

If you run that code; you have to enter a password twice and it gives back a hashed code that you put into the .yaml file.

Using a Private Container Registry

So you simply need to provide a registry secret in Kubernetes to use.  Just follow these steps to create the secret and use that…

https://docs.microsoft.com/en-us/azure/container-registry/container-registry-get-started-docker-cli 

Deploy to Kubernetes and get your endpoint

Just some simple commands.

kubectl create -f YOURYAMLFILE.yaml
kubectl get svc

The first command will schedule the service on the cluster and the second service will give you a list of all services on the cluster with a print out of what each ones public ip is.  Navigate to that public ip address, enter your password and off you go.

Summary

This Kubernetes thing is pretty freaking awesome.  I’m taking a big liking to this type of approach for my workloads.  I can now run workloads from anywhere in the world on whatever  compute I want, know it is going to be reliable and also give access to other folks in a secure way.  Good luck!

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *