By: Christopher Waldeck
This guided installation will take you from a bare instance of Ubuntu 18.04.03 LTS to having a functioning installation of Spark, R, and RStudio Desktop. The guide finishes by establishing a connection to Spark with sparklyr.
Sparklyr ships with a function to install Spark, but it has only led me to heartbreak in the past. You can try your luck, but leave this tab open.
STEP ONE: DEPENDENCIES
First, we'll tackle R's "silent" system dependencies - these cause problems when you install popular R packages like those contained in the tidyverse.
STEP TWO: INSTALLING R & RSTUDIO
Save yourself a little heartache and don't jump right into apt-get install. If you don't modify your sources.list file first, you'll end up with a version of R that's nearly two years old, and you'll be right back here.
There's no reason for both of us to do that experiment. To get the latest version of R supported by CRAN for your platform, follow the step below to add the PPA as recommended by CRAN and add the appropriate signature key.
This will open up the nano editor, which ships with Ubuntu. We add the PPA entry (deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/) at the bottom of /etc/apt/sources.list as shown below:
Press CTRL-O to write the modified file and Enter to confirm the file name. Do not run apt-get update yet!
The CRAN archives for Ubuntu are signed with a key, and we need to add it to our system before APT will allow us to pull from the PPA.
For more background on how Linux handles package distribution, check out this article.
At this point, we have R fully installed along with some utilities for package development.
Now, head over to the RStudio download page, snag the appropriate installer for Ubuntu, and use the installer utility to install the program.
APPLY RENDERING ENGINE WORKAROUND (OPTIONAL...MAYBE)
This is one wrinkle I hit every time I use Rstudio on an Ubuntu system with a Nvidia graphics card. The (default) OpenGL rendering causes Rstudio to crash, and it can only be reopened after a reboot. To get around this, set the rendering engine option to "Software" in Tools -> General -> Advanced.
Your mileage will vary depending on your hardware and drivers.
STEP THREE: INSTALLING SPARK
This is one of many, many ways to get Spark on your system. There are valid reasons to Dockerize Spark, and there are even built-in functions to handle the installation in sparklyr and other APIs.
Why go this route? It works.
I've never successfully installed and connected to Spark using the built-in sparklyr functionality, and not for lack of trying. The assumptions install_spark() makes about Python, Pip, and your environment variables are a rabbit hole, and I prefer to avoid it.
In a new terminal, download the Spark tarball, unpack & move the files, and open up .bashrc:
Page down to the bottom of .bashrc, add the following two lines:
Now, reload .bashrc to apply the changes and start up the Spark standalone master server. Once the service has started, open up http://127.0.0.1:8080/ in your browser and copy the URL.
FINAL STEP - INSTALL SPARKLYR & CONNECT To Spark
The R code above will establish your spark connection and set a directory for Spark to use for storing intermediate results/operations. The settings in spark_conf can be tweaked and others can be added before passing the configuration to spark_connect().
If you prefer using SparkR, you can follow the instructions here to establish a connection. The sparkR.session() function is almost identical to spark_connect().
To help you get comfortable in Spark, here are a few resources for different audiences / use cases:
Copyright © 2018