Creating a development cluster using docker in a single machine
Let’s start testing Spark 3 with Docker, including shared volumes and different flavours of jupyter notebooks.
To setup our cluster we’ll use the following docker image: Docker Spark
We’ll create 3 Spark workers and one master, feel free to adjust the settings to your machine specs.
We’ll include in our docker compose the following jupyter notebooks, my final docker-compose.yml looks like this:
version: '3'
services:
spark-master:
image: bde2020/spark-master:3.1.1-hadoop3.2
container_name: spark-master
ports:
- "8080:8080"
- "7077:7077"
- "4040:4040"
volumes:
- spark_data:/opt/spark/data
environment:
- INIT_DAEMON_STEP=setup_spark
deploy:
resources:
limits:
cpus: '1'
memory: 2G
spark-worker-1:
image: bde2020/spark-worker:3.1.1-hadoop3.2
container_name: spark-worker-1
depends_on:
- spark-master
ports:
- "8081:8081"
volumes:
- spark_data:/data
environment:
- "SPARK_MASTER=spark://spark-master:7077"
deploy:
resources:
limits:
cpus: '3'
memory: 5G
spark-worker-2:
image: bde2020/spark-worker:3.1.1-hadoop3.2
container_name: spark-worker-2
depends_on:
- spark-master
ports:
- "8082:8081"
volumes:
- spark_data:/opt/spark/data
environment:
- "SPARK_MASTER=spark://spark-master:7077"
deploy:
resources:
limits:
cpus: '3'
memory: 5G
spark-worker-3:
image: bde2020/spark-worker:3.1.1-hadoop3.2
container_name: spark-worker-3
depends_on:
- spark-master
volumes:
- spark_data:/opt/spark/data
ports:
- "8083:8081"
environment:
- "SPARK_MASTER=spark://spark-master:7077"
deploy:
resources:
limits:
cpus: '3'
memory: 5G
all-spark-notebook:
image: jupyter/all-spark-notebook
container_name: all-spark-notebook
volumes:
- /opt/spark/conf:/conf
- /opt/spark/:/home/jovyan/work
ports:
- "8888:8888"
- "4050-4059:4040-4049"
environment:
- SPARK_MASTER=spark://spark-master:7077
- GRANT_SUDO=yes
- JUPYTER_ENABLE_LAB=1
command: "start-notebook.sh --NotebookApp.token=''"
volumes:
spark_data:
I’ve done some port mappings to make sure that we don’t have conflicts between our shell and notebooks while monitoring.
You can find this docker-compose.yml on my github.
Let’s download the images:
Access from the shell
Let’s start the cluster and test from the shell if it is working, I’m using Ubuntu 20.04 on WSL2 for this, make sure that you have Docker compatibility enabled with your WSL distribution.
Download and untar your Spark in the client:
curl -O https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
Make sure that the variables for Java and Spark are setup in your .bashrc file and that you’ve sourced it.
Start the Spark Shell:
spark-shell --master spark://localhost:7077
Monitor the job:
Access from Scala and Python notebooks
Now that we know that the cluster works, let’s test our notebooks in Scala:
Can we monitor the jobs? (Remember that I mapped port 4040-4049 to 4050-4059)
Everything is in order, we are ready to Rock and Roll !!!