Creating a development cluster using docker in a single machine

Let’s start testing Spark 3 with Docker, including shared volumes and different flavours of jupyter notebooks.

To setup our cluster we’ll use the following docker image: Docker Spark

We’ll create 3 Spark workers and one master, feel free to adjust the settings to your machine specs.

We’ll include in our docker compose the following jupyter notebooks, my final docker-compose.yml looks like this:

version: '3'
services:
  spark-master:
    image: bde2020/spark-master:3.1.1-hadoop3.2
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
      - "4040:4040"
    volumes:
      - spark_data:/opt/spark/data
    environment:
      - INIT_DAEMON_STEP=setup_spark
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 2G
  spark-worker-1:
    image: bde2020/spark-worker:3.1.1-hadoop3.2
    container_name: spark-worker-1
    depends_on:
      - spark-master
    ports:
      - "8081:8081"
    volumes:
      - spark_data:/data
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
    deploy:
      resources:
        limits:
          cpus: '3'
          memory: 5G
  spark-worker-2:
    image: bde2020/spark-worker:3.1.1-hadoop3.2
    container_name: spark-worker-2
    depends_on:
      - spark-master
    ports:
      - "8082:8081"
    volumes:
      - spark_data:/opt/spark/data
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
    deploy:
      resources:
        limits:
          cpus: '3'
          memory: 5G
  spark-worker-3:
    image: bde2020/spark-worker:3.1.1-hadoop3.2
    container_name: spark-worker-3
    depends_on:
      - spark-master
    volumes:
      - spark_data:/opt/spark/data
    ports:
      - "8083:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
    deploy:
      resources:
        limits:
          cpus: '3'
          memory: 5G
  all-spark-notebook:
    image: jupyter/all-spark-notebook
    container_name: all-spark-notebook
    volumes:
      - /opt/spark/conf:/conf
      - /opt/spark/:/home/jovyan/work
    ports:
      - "8888:8888"
      - "4050-4059:4040-4049"
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - GRANT_SUDO=yes
      - JUPYTER_ENABLE_LAB=1
    command: "start-notebook.sh --NotebookApp.token=''"
volumes:
    spark_data:

I’ve done some port mappings to make sure that we don’t have conflicts between our shell and notebooks while monitoring.

You can find this docker-compose.yml on my github.

Let’s download the images:

Access from the shell

Let’s start the cluster and test from the shell if it is working, I’m using Ubuntu 20.04 on WSL2 for this, make sure that you have Docker compatibility enabled with your WSL distribution.

Download and untar your Spark in the client:

curl -O https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

Make sure that the variables for Java and Spark are setup in your .bashrc file and that you’ve sourced it.

Start the Spark Shell:

spark-shell --master spark://localhost:7077

Monitor the job:

Access from Scala and Python notebooks

Now that we know that the cluster works, let’s test our notebooks in Scala:

Can we monitor the jobs? (Remember that I mapped port 4040-4049 to 4050-4059)

Everything is in order, we are ready to Rock and Roll !!!