By: Christopher Waldeck
This guided installation will take you from a bare instance of Ubuntu 18.04.03 LTS to having a functioning installation of Spark, R, and RStudio Desktop. The guide finishes by establishing a connection to Spark with sparklyr.
Sparklyr ships with a function to install Spark, but it has only led me to heartbreak in the past. You can try your luck, but leave this tab open.
STEP ONE: DEPENDENCIES
First, we'll tackle R's "silent" system dependencies - these cause problems when you install popular R packages like those contained in the tidyverse.
STEP TWO: INSTALLING R & RSTUDIO
Save yourself a little heartache and don't jump right into apt-get install. If you don't modify your sources.list file first, you'll end up with a version of R that's nearly two years old, and you'll be right back here.
There's no reason for both of us to do that experiment. To get the latest version of R supported by CRAN for your platform, follow the step below to add the PPA as recommended by CRAN and add the appropriate signature key.
This will open up the nano editor, which ships with Ubuntu. We add the PPA entry (deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/) at the bottom of /etc/apt/sources.list as shown below:
Press CTRL-O to write the modified file and Enter to confirm the file name. Do not run apt-get update yet!
The CRAN archives for Ubuntu are signed with a key, and we need to add it to our system before APT will allow us to pull from the PPA.
For more background on how Linux handles package distribution, check out this article.
At this point, we have R fully installed along with some utilities for package development.
Now, head over to the RStudio download page, snag the appropriate installer for Ubuntu, and use the installer utility to install the program.
APPLY RENDERING ENGINE WORKAROUND (OPTIONAL...MAYBE)
This is one wrinkle I hit every time I use Rstudio on an Ubuntu system with a Nvidia graphics card. The (default) OpenGL rendering causes Rstudio to crash, and it can only be reopened after a reboot. To get around this, set the rendering engine option to "Software" in Tools -> General -> Advanced.
Your mileage will vary depending on your hardware and drivers.
STEP THREE: INSTALLING SPARK
This is one of many, many ways to get Spark on your system. There are valid reasons to Dockerize Spark, and there are even built-in functions to handle the installation in sparklyr and other APIs.
Why go this route? It works.
I've never successfully installed and connected to Spark using the built-in sparklyr functionality, and not for lack of trying. The assumptions install_spark() makes about Python, Pip, and your environment variables are a rabbit hole, and I prefer to avoid it.
In a new terminal, download the Spark tarball, unpack & move the files, and open up .bashrc:
Page down to the bottom of .bashrc, add the following two lines:
Now, reload .bashrc to apply the changes and start up the Spark standalone master server. Once the service has started, open up http://127.0.0.1:8080/ in your browser and copy the URL.
FINAL STEP - INSTALL SPARKLYR & CONNECT To Spark
The R code above will establish your spark connection and set a directory for Spark to use for storing intermediate results/operations. The settings in spark_conf can be tweaked and others can be added before passing the configuration to spark_connect().
If you prefer using SparkR, you can follow the instructions here to establish a connection. The sparkR.session() function is almost identical to spark_connect().
To help you get comfortable in Spark, here are a few resources for different audiences / use cases:
In order to use the code in this post, download this file, import it into your geography database (or wherever), and execute the following code to create a geography column on the table. You can throw a spatial index on there if you like, but the code below isn't designed to utilize it.
I will name the table [dbo].[USZipCodes] and assume you have SSMS pointed to the relevant database so we can avoid three-part identifiers.
While there are undoubtedly better platforms than SQL Server for k-means clustering, scenarios exist where it makes a lot of sense to do these calculations in a database. Procedures running overnight that could benefit from a clustering heuristic for cutting down calculation times are probably the most common cases.
The traditional k-means algorithm, however, relies on features of Euclidean geometry. Because SQL Server’s geography calculations are happening on an ellipse, this is technically performing K-Means on a warped surface. However, since the points are all "on top" of the ellipse and all the calculations are constrained to be the shortest arcs over the surface (thanks, Microsoft!), we won’t fall into any nonlinearity traps.
However, this solution is not intended for data that cross the 180th Meridian because It averages latitude and longitude to ascertain density-weighted centroids during the update phase. If your data crosses this line, I would recommend using a reference bounding box as a basis for calculating centrality.
The code here uses the Forgy initialization, which means that k points are chosen at random as initial points. However, instead of picking random points on the Earth (the analog for normal Forgy), it picks points from the data at hand.
It’s worth calling out that this method does not give a deterministic result. K-Means is sensitive to the random initialization positions in each run. It is imperative to test any heuristic method on your data to ensure reliability before including it in a workflow.
k-Means Clustering: Step One
You'll notice a lot of window functions in this code. This is because the geography data type is not comparable: it cannot be used for grouping in aggregate functions.
Also, I want to call out ORDER BY ROW_NUMBER()OVER(ORDER BY NEWID()), which is one of my favorite constructions in T-SQL. It allows you randomly order any data by first assigning a random GUID to each row and then ordering by it. The best thing is that the GUID is gone by the end of the query since it isn't selected - an efficient way of handling a problem SQL isn't really built to handle.
Step Two: Loop and Select Results
There it is: You're left with the final assignment of each point to a cluster, the final set of clusters, and a record that shows you how the clusters' centroids changed through each assignment step. While speed is obviously sensitive to the choice of tolerance and the size of the data initially selected, I've found this formulation to converge quickly and reliably as I've tested it with even larger sets of data (~50,000 points).
Like I said before, I really wanted to keep this in the book since the code took a little while to write, but the use case just wasn't general enough for me to charge people for it. However, I imagine the audience on the Real Business Analytics blogs will get a kick out of this kind of problem.
12/18/2016 0 Comments
Assuming you've wisely taken my suggestion about setting up SQL Server to scale up your analysis capabilities and value proposition. However, despite having a multitude of intuitive geography functions, you notice that SQL Server doesn't actually know where anything is! If, say, you need to know if the facilities in one of your tables are in Maryland, you can easily look up the conversion from standard lat-longs to geography datatypes, but turning "Maryland" into something that makes sense to your database is less easy to figure out.
If you Google around for a while, the conventional wisdom seems to involve getting your mits on some shapefiles and attempting to load them directly into your database with Shape2SQL, which was built for SQL Server 2008. However, the conversion tool is not professionally maintained, and I've had it fail for me a few times in SQL Server 2014 and 2016.
What's worse, even though political boundaries are incredibly useful and relevant, they are not considered a standard feature (hence, the lack of documentation), which means your DBAs aren't going to have a ready-made solution for you.
Because of this obvious capability gap, I figured I'd help out my fellow data scientists with a simple tutorial to get a couple of extremely useful tables up and running. The code that follows utilizes data from Natural Earth, which is an awesome project that provides shapefiles for the boundary data we need. I've used their 50m resolution maps, but they provide more and less granular data. You can download what you need from their website, or you can snag the files I used here:
Finally, note that I only use the spatial data from the files, but you could pull through all the other data that is included in them by the people at Natural Earth. Check out what's available after you read these files into R.
Step One: R, Maptools, & RODBC
Of course, R doesn't have a native way of talking to shapefiles, but with maptools and RODBC, you're able to read in, manipulate, and export these files in a way that will agree with SQL Server. Pay special attention to the way "crswgs84" is declared and used. In my research, this was a common hangup for a lot of people using shapefiles in R.
As a bonus, the Code-to-Results ratio is pretty awesome.
R Code: Country Shapefile to SQL Server
I can hear it now: "But why don't you use SqlDrop() to try and drop the table? And what's with the ANSI_WARNINGS bit?"
That, my friend, is what experience (and more than a few minutes spent with my head on my desk) looks like. Also, if you're trying to run this from the CMD console, store the "try" statements to a variable ("var <- try(...)") to prevent a code-stopping warning (it took a lot of profanity to figure that out). The "ANSI_WARNINGS" and "NOCOUNT" settings control things that can confuse your RODBC connection. Sometimes, when a SQL statement issues an "X rows affected" or other non-error message, R thinks execution failed. It doesn't happen everytime, but it does happen...especially when you're showing your code to someone else.
Now, we'll rinse and repeat for the province file:
R Code: Province Shapefile to SQL Server
Boom. Now you've got two tables in your database populated with countries and provinces. Currrently, the borders are a series of lat-longs that we will use to trace the polygon. If you go rogue after this tutorial, you'll need to familiarize yourself with some spatial data rules, the first which throws many noobs off: The lat-long points need to go in a certain order. In fact, that's why the SQL functions above have ORDER BY clauses.
Right-hand, left-hand or clockwise, counterclockwise rules differ by database type (SQL Server, Oracle, etc.) AND geography type. In SQL Server, for instance, if you want to have a shape with interior shapes (Multipolygon), the exterior shape needs to have its points listed clockwise, while the interior shape's points are listed counterclockwise, as described here.
There are tons of other things, like degree vs. decimal lat-longs, various projection/grid types, and SRID's to worry about, but we'll stay on the yellow brick road for the time being.
Step Two: Lat-Long To Polygon In SQL Server
First, we're going to make a couple of functions that are capable of converting the lat-longs that come from R into geography polygons. And don't sweat the red highlighting - SSMS is just letting you know that it doesn't see the tables you're pointing at with the functions. They'll be there by the time we call the functions.
Declare Functions for Polygon Parsing
Functions accomplished. Now, let's make some tables, apply the functions, and clean up the dbo schema.
Step Three: Create, Populate, And Drop Tables
Apply Polygon Functions & Clean Things Up
Just like that, we've got what we came for: Two tables that can be used for meaningful spatial analysis with all the goodies T-SQL provides. Also, if you didn't know how Cross Apply worked before, I think this gives you a good example with an intuitive function. Apply is extremely difficult to explain to someone using just the concept, but it's easy to understand when you see it.
A little trick I embedded to eliminate duplicate entries is the ROW_NUMBER() window function. If you run the subqueries by themselves, they return a row for every country and lat-long pair, which isn't useful. However, the geography data type is not comparable - it can't be used with DISTINCT. Relying on the uniqueness (via PARTITION) of other columns, the ROW_NUMBER() function gives us a way to eliminate the duplicates. The ORDER BY is required by the syntax, but it doesn't really matter in this context.
Finally, you'll notice that I put a spatial index on each table to speed up searching. It works by gridding out the entire world on several layers, which allows the Query Optimizer to disregard large portions of the world immediately, without having to search them. However, the optimal grid setting depends on what the data will be used for. I have used the default setting here, but if you're doing some hardcore spatial querying that can't be cached, you'll have to check out the specifics.
Word of Warning: Don't try to read the new WKT (Well Known Text) geography types from SQL Server back into R. They are too large for R to handle, and this will cause an error.
Spatial Results from tables created above
Spatial statistics is infinitely easier to deploy when you can take the pain out of dealing with the spatial data in your analysis language. Offloading that work to SQL can save you a ton of time and memory that can be better used elsewhere.
Hopefully, getting these tables doesn't fully slake your thirst for better spatial analysis. The folks over at Natural Earth have way more data, and you'll be able to easily deploy projects that would have been extremely difficult in the past if you continue to learn SQL Server's geography features.
Fair Warning: This article assumes a working knowledge of Linear Programming and SQL
Maybe you're just jumping into the business world, and you've found that no one in your organization has heard of GAMS or LINDO. Maybe you're finding that Excel can't handle the matrices you're trying to build, or that a custom-made approach to building your matrices will be a large project that won't scale well to different applications.
An entry-level approach to building transportation networks involves plugging distances between supply, demand, and transshipment nodes into a matrix, the costs of traveling those arcs into an objective function, and then some constraint equations that restrict the values of certain variables.
Generally, the better you are at formulating these problems, the simpler the constraints can be, and the easier they are to solve. However, a few textbook examples show how, even in simple cases, the matrices grow as the networks' complexity increases. Scaling that up to, say, an enterprise-grade logistics network with tens of thousands of supply, transshipment, and demand nodes, it is clear that even setting up the distance matrix could become an insurmountable task.
If you try to offload spatial problems onto an analysis platform like R or Python, you'll likely find that all the tools you want to use were created in different packages or libraries. Keeping spatial data usable and comparable across different packages can mean building translations to different map projections, data structures (e.g. spatial data frames), and more to utilize the functions you want to apply.
However, a little celebrity appearance by SQL Server can take a huge bite out of the work that often proves most tedious when deploying this kind of work on an enterprise scale.
the sql server solution
As I've said before, I love SQL Server because it has a flavor for every budget, including free. SQL Server's unique distance functions offer an easy way to dynamically fill out the distance matrix. Replacing the tedious process of having data entry interns estimating distances and filling out huge matrices (or, the more common approach: Just not solving the problem at all) with this automated implementation is a classic case of making computers work for your organization rather than the other way around.
step one: Finding Lat-Longs
Google Places API makes it very easy to tag locations with lat-long coordinates. Obviously, the API can be used with traditional languages, and the API is flexible enough to accommodate a few use scenarios (searching by address, searching by place name, etc.). R users should check out this post to implement it. Consider running this tagging for your data set in chunks each night to get around the API's 2500 query limit.
R Example: Google API via ggmap
Step two: sql server geography
Once you've gotten lat-longs from Google, use the T-SQL below to convert them to Geography datatypes and calculate distances between all the points.
TSQL: Set up geography column & Retrieve point-to-point distances
Note that I divided by a decimal when converting my distances from meters to miles. It is good SQL practice to always multiply and divide by a decimal, even if it's [Some Number].0, to avoid an accidental integer casting. This is especially dangerous since it can mess with calculations and will not throw an error or warning.
The results are saved to #temp and referenced below.
step three: Select the Solution
At this point, you have a couple of options. You can pivot the solution to create a traditional matrix, or you can just directly apply your constraints and costs to the results above, aggregate them, and order the answers to give you the lowest-cost options. This should almost always be the easiest solution, since all the metadata is still attached to the nodes.
If you're after a traditional matrix, I suggest the following approach, which borrows the dynamic SQL code you can find here.
Dynamic Matrix Selection
There you have it! A quick way to populate the distance matrix that can easily be incorporated with the rest of many network problems for a direct solution.
With just a few extensions (and some heuristics), it's an easy task to take the weight of setting up and formulating your problems off of your shoulders and put it on SQL Server's Query Optimizer!
In 2016, there are many options when it comes to platforms for conducting analytical projects. There are Python, R, SAS, and SPSS for statistical work. You can use a multitude of SQL flavors for databases (mySQL, SQLite, SQL Server, etc.), and to distribute the analysis to the masses, the number and types of reporting systems are too many to list and too varied to represent with any examples. However, the plethora and flexibility of solution available shouldn't make you believe that there are no wrong answers. Many analytics professionals, especially folks who are focused on pure mathematics or statistics, believe platform agnosticism is next to godliness.
"Many analytics professionals ... believe platform agnosticism is next to godliness."
While it isn't fair to say that choosing one of these platforms over another will actually stop you from conducting one or more kinds of analysis, the choice will have a profound impact on how easy certain types of analysis will be to conduct, how scalable your analytical projects will be over time, and how the results can be communicated through your organization.
For the sake of transparency (if, by chance, you missed the title), I am a huge supporter of R and Microsoft SQL Server. There are a few reasons I prefer this duo:
mro and r: a sequel that doesn't suck
A few years ago, Microsoft released their own version of R, which enables parallel execution and adds enhanced matrix algebra utilities from Intel, but, most importantly, they left everything else alone. MRO (Microsoft R Open) and the Intel Math Kernel Library (MKL) greatly enhance performance of matrix algebra on a single core, and add to this advantage when more cores are available.
Shiny: Get ready to look like a hero
Unlike other analysis-centric programming languages, R has Shiny, which lets you deploy applications to people who need customized results. Python has a few users working on something sort of similar, but it is in a fledgling stage by comparison.
This is the death of BI tools. Tableau is down on its knees and just heard the hammer click back. I'll be dedicating a post to exactly why BI is doomed anyway, but Shiny is a significant nail in the coffin. By allowing non-technical users to parameterize complex problems and retrieve their own solutions, an analytics group is able to deliver more general solutions instead of targeted reports (which must always be tweaked if they are not entirely ignored).
Shiny application development feels markedly different from standard R programming, but if you can get a meaningful project and a little time, you'll be functionally proficient in a week or two.
This is huge for businesses. There are people who build Tableau reports for a living making $90k/year just to show the results from analytical work. That is an insane use of money when you can just modify the original scripts in the original environment and get a perfectly functional, professional business solution.
You get to steal from smart people
R lets you harness the work of leading academics. Thanks to its popularity in scientific and academic spheres, R has packages that let you employ even the most obscure research techniques simply and effectively. Need a GMM solution to Weighted Least Squares? You're covered in one line of code.
The sqldf package is a trump card if you don't have the time or interest to master the data shuffling idiosyncrasies of a new language. This package allows you to use SQL syntax directly on data frames (the workhorse of object types in R), which greatly decreases learning time while increasing code transparency throughout your analytics group.
One sql to rule them all...
First, I want to say that I am not being (and have never been) paid by Microsoft to endorse their products. Second, I want to say that I am super, super open to that. If anyone from Microsoft is reading, I would genuinely say anything you want for money...Bill Gates fights crime. Apple plastics are made from puppies.
Hit me up.
SQL Server, which uses Microsoft's T-SQL (Transact SQL) language, has several advantages for analytical work (besides being an extremely slick piece of technical goodness under the hood for computer-sciencey reasons). Besides a sweet, new set of features called R Services, which lets you run R inside your database environment.
Could you mess around with C# and make something similar appear to happen with Python or SAS? If one of your chief concerns on a daily basis is the rapidly declining amount of wall space available to continue hanging your honorary degrees and your shed is filling up with patent approvals, Maybe. However, if your main focus is on the analysis and not on systems engineering (or you're a mere mortal), maybe go for the off-the-shelf solution.
Besides R Services, SQL Server has a few other advantages that come up on a daily basis.
without geography, you're nowhere
SQL Server implements Geography Functions and Geography Indexing in a user-friendly way, meaning spatial calculations become infinitely easier. Gone are the days of using 15 libraries to grid out your x-y points while dealing with map projections, 2D/3D conversion, etc.: It's all handled. Spatial statistics and optimization problems that used to be untenable and beautiful maps with analysis layers become manageable when you combine SQL Server's intuitive geography features with R's functionality.
Semantic Search: Another day at the text mine
SQL Server's Semantic Search capabilities are absolutely awesome. Without too much trouble, you'll be writing queries that are terminator-esque. By utilizing thesauruses (Yes, that's the plural - I Googled.) and a host of other nifty text-mining assistance tools, you can write your own code to do feature extraction, score documents on how similar they are, match & extract key phrases, and more. This is way beyond REGEX.
Though not truly text mining on its own, these features provide you a jumping-off point way ahead of where you would start if you were only using your analytical language.
it's (sort of) free
Using SQL Server Express is a great way to bridge the gap between budget and high-end functionality, and it isn't a bait-and-switch. This isn't one of those "This is totally free...I just need your credit card info" deals. There are a couple of restrictions on Express vs. the full distribution, but they really are minor and related to DB size. It's free. Forever. Period.
Also - this usually goes unmentioned - a consequence of this "free" distribution is that people can install it on their personal computers at home, making it easier for people to pick up and practice if they're new to databases, database management systems, or just new to SQL.
Perfection and other things you won't find in this world
Are there weaknesses to this approach? Of course. Depending on how you divide up the work between SQL Server and R, you may run into the fact that R's flow control is hideously slow (the new parallel execution accelerates calculations, not flow control). And those convenient/awesome/magical geography types from SQL Server? They'll cause R to fail if you try to read them in (it doesn't take many records to exceed R's memory allotment).
Developing packages for R can also be frustrating, as it requires an arcane (Rcane?) series of component installations which, due to reliance on the specified PATH and the fact that every installed component seems to have a different default installation destination.
Java is also finicky on R (this comes up more rarely now, but it does come up). Let me save you some heartache: You need to go find the Java distribution that matches your R architecture (64- or 32-bit). The default Java install won't work.
The dynamic duo I recommend has served me incredibly well over the years. I have not, to this day, run into an analytical problem that could not be solved and solved well with these tools, and they both keep getting better. Picking up the ability to employ these solutions in tandem makes you more valuable, your group far more productive, and your work incredible.
Copyright © 2018