Step-by-Step Guide to Installing Apache Spark on Ubuntu 20.04

Apache Spark is a robust open-source framework designed for general-purpose cluster computing. It offers high-level APIs in Java, Scala, Python, and R to support various execution graphs. With built-in modules for streaming, SQL, machine learning, and graph processing, Spark efficiently analyzes large datasets, distributing them across clusters for parallel processing.

In this guide, we will walk you through installing the Apache Spark cluster computing stack on Ubuntu 20.04.

Prerequisites

  • A server running Ubuntu 20.04.
  • Root access to the server.

Getting Started

First, ensure all system packages are up-to-date. Execute the following command:

apt-get update -y

Once the packages are updated, proceed to the next steps.

Install Java

Apache Spark requires Java to function. Install it using the following command:

apt-get install default-jdk -y

Verify the Java installation with:

java --version

You should see output similar to:

openjdk 11.0.8 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)

Install Scala

Since Spark is developed using Scala, install it with:

apt-get install scala -y

To confirm Scala’s version, use:

scala -version

You should see the following:

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Apache Spark

Download the latest Spark version from its official website. As of this writing, version 2.4.6 is the latest. Use:

cd /opt
wget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz

Extract the downloaded file:

tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz

Rename the directory for convenience:

mv spark-2.4.6-bin-hadoop2.7 spark

Configure Spark Environment

Edit the .bashrc file to configure the Spark environment:

nano ~/.bashrc

Append the following lines:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file, then activate the environment:

source ~/.bashrc

Start Spark Master Server

With Spark installed, start the master server:

start-master.sh

This command will output something similar to:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu2004.out

Spark listens on port 8080 by default. Confirm using:

ss -tpln | grep 8080

Expected output:

LISTEN   0        1                               *:8080                *:*      users:(("java",pid=4930,fd=249))   

Access the Spark web interface at http://your-server-ip:8080:

Apache Spark Web UI

Start Spark Worker Process

Use the Spark master URL to initiate the worker process:

start-slave.sh spark://your-server-ip:7077

Expected startup message:

starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu2004.out

Refresh the Spark dashboard to see the worker process:

Apache Spark Worker

Working with Spark Shell

Connect to the Spark server using the command line:

spark-shell

Upon connection, you’ll see output like:

...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ubuntu2004:4040
Spark context available as 'sc' (master = local[*], app id = local-1598711719335).
Spark session available as 'spark'.
...

Using Python with Spark

To use Python, install it with:

apt-get install python -y

Connect Spark with Python:

pyspark

You’ll see similar output:

...
Using Python version 2.7.18rc1 (default, Apr  7 2020 12:05:55)
SparkSession available as 'spark'.
>>> 

Stopping Spark Services

To stop both Master and Worker services, use:

stop-slave.sh
stop-master.sh

Conclusion

Congratulations on your successful installation of Apache Spark on Ubuntu 20.04. You’re now set to perform basic tests or start configuring a Spark cluster. Feel free to reach out with any questions.

Frequently Asked Questions (FAQ)

What are the key components required for installing Apache Spark?

You need a Linux system (Ubuntu 20.04 in this guide), Java, and Scala to run Spark. Remember to set appropriate environment variables for Spark.

What is the default port for Apache Spark’s web interface?

Spark’s web interface is accessible via port 8080.

Can I use Python with Apache Spark?

Yes, you can use the PySpark command-line utility to interact with Spark using Python.

How do I stop Spark services?

Use the commands stop-slave.sh and stop-master.sh to halt the worker and master services, respectively.