REAL BUSINESS ANALYTICS
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
Search by typing & pressing enter

YOUR CART

Know The Tech

RBA Home

1/2/2020 0 Comments

Install & Connect: Spark & Sparklyr

By: Christopher Waldeck
This guided installation will take you from a bare instance of Ubuntu 18.04.03 LTS to having a functioning installation of Spark, R, and RStudio Desktop. The guide finishes by establishing a connection to Spark with sparklyr. 

Sparklyr ships with a function to install Spark, but it has only led me to heartbreak in the past. You can try your luck, but leave this tab open.

​STEP ONE: ​DEPENDENCIES

More than a few hours of my life have been claimed by R's system dependencies, but I'm starting to turn this franchise around.

While I was upgrading my computer, I figured I'd take the opportunity to write down every step I usually lose/forget along the way to getting my preferred data science setup working.

​For reference, you can see my system info to the right.
Picture
First, we'll tackle R's "silent" system dependencies - these cause problems when you install popular R packages like those contained in the tidyverse.
Terminal

    

​STEP TWO: INSTALLING R & RSTUDIO

Save yourself a little heartache and don't jump right into apt-get install. If you don't modify your sources.list file first, you'll end up with a version of R that's nearly two years old, and you'll be right back here. 

There's no reason for both of us to do that experiment. To get the latest version of R supported by CRAN for your platform, follow the step below to add the PPA as recommended by CRAN and add the appropriate signature key.
Terminal

    
This will open up the nano editor, which ships with Ubuntu. We add the PPA entry (deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/) at the bottom of /etc/apt/sources.list as shown below:
Picture
Press CTRL-O to write the modified file and Enter to confirm the file name. Do not run apt-get update yet! 

The CRAN archives for Ubuntu are signed with a key, and we need to add it to our system before APT will allow us to pull from the PPA.

For more background on how Linux handles package distribution, check out this article.
Terminal

    
At this point, we have R fully installed along with some utilities for package development.

Now, head over to the RStudio download page, snag the appropriate installer for Ubuntu, and use the installer utility to install the program. 
Picture

APPLY RENDERING ENGINE WORKAROUND (OPTIONAL...MAYBE)

This is one wrinkle I hit every time I use Rstudio on an Ubuntu system with a Nvidia graphics card. The (default) OpenGL rendering causes Rstudio to crash, and it can only be reopened after a reboot. To get around this, set the rendering engine option to "Software" in Tools -> General -> Advanced.
Picture
Your mileage will vary depending on your hardware and drivers.

STEP THREE: INSTALLING SPARK

This is one of many, many ways to get Spark on your system. There are valid reasons to Dockerize Spark, and there are even built-in functions to handle the installation in sparklyr and other APIs. 

Why go this route? It works. 

I've never successfully installed and connected to Spark using the built-in sparklyr functionality, and not for lack of trying. The assumptions install_spark() makes about Python, Pip, and your environment variables are a rabbit hole, and I prefer to avoid it.

In a new terminal, download the Spark tarball, unpack & move the files, and open up .bashrc:
Terminal

    
Page down to the bottom of .bashrc, add the following two lines:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Picture
Now, reload .bashrc to apply the changes and start up the Spark standalone master server. Once the service has started, open up http://127.0.0.1:8080/ in your browser and copy the URL.
Terminal

    
Picture

​FINAL STEP - INSTALL SPARKLYR & CONNECT To Spark

R

    
The R code above will establish your spark connection and set a directory for Spark to use for storing intermediate results/operations. The settings in spark_conf can be tweaked and others can be added before passing the configuration to spark_connect(). 

If you prefer using SparkR, you can follow the instructions here to establish a connection. The sparkR.session() function is almost identical to spark_connect().


To help you get comfortable in Spark, here are a few resources for different audiences / use cases:
  • Mastering Spark With R: Users new to Spark who want fully-worked examples and more narrative. Recommended for users with either R or Spark experience.
  • Sparklyr Tutorial: Experienced R users who want to write their first lines in Spark using sparklyr
  • Spark Machine Learning Guide: ML practitioners looking for algorithm implementation details
  • SparkR Documentation: Experienced R users who want to get started with SparkR
  • Spark Standalone Mode Documentation: Users who want to learn more about Spark deployment options
0 Comments



Leave a Reply.

    Author

    Chris Waldeck is trying to break the Buzzword Bingo cycle

    Archives

    January 2020
    January 2017
    December 2016
    November 2016

    Categories

    All R Spark Sparklyr

    RSS Feed

Copyright © 2018