12/18/2016 0 Comments
Assuming you've wisely taken my suggestion about setting up SQL Server to scale up your analysis capabilities and value proposition. However, despite having a multitude of intuitive geography functions, you notice that SQL Server doesn't actually know where anything is! If, say, you need to know if the facilities in one of your tables are in Maryland, you can easily look up the conversion from standard lat-longs to geography datatypes, but turning "Maryland" into something that makes sense to your database is less easy to figure out.
If you Google around for a while, the conventional wisdom seems to involve getting your mits on some shapefiles and attempting to load them directly into your database with Shape2SQL, which was built for SQL Server 2008. However, the conversion tool is not professionally maintained, and I've had it fail for me a few times in SQL Server 2014 and 2016.
What's worse, even though political boundaries are incredibly useful and relevant, they are not considered a standard feature (hence, the lack of documentation), which means your DBAs aren't going to have a ready-made solution for you.
Because of this obvious capability gap, I figured I'd help out my fellow data scientists with a simple tutorial to get a couple of extremely useful tables up and running. The code that follows utilizes data from Natural Earth, which is an awesome project that provides shapefiles for the boundary data we need. I've used their 50m resolution maps, but they provide more and less granular data. You can download what you need from their website, or you can snag the files I used here:
Finally, note that I only use the spatial data from the files, but you could pull through all the other data that is included in them by the people at Natural Earth. Check out what's available after you read these files into R.
Step One: R, Maptools, & RODBC
Of course, R doesn't have a native way of talking to shapefiles, but with maptools and RODBC, you're able to read in, manipulate, and export these files in a way that will agree with SQL Server. Pay special attention to the way "crswgs84" is declared and used. In my research, this was a common hangup for a lot of people using shapefiles in R.
As a bonus, the Code-to-Results ratio is pretty awesome.
R Code: Country Shapefile to SQL Server
I can hear it now: "But why don't you use SqlDrop() to try and drop the table? And what's with the ANSI_WARNINGS bit?"
That, my friend, is what experience (and more than a few minutes spent with my head on my desk) looks like. Also, if you're trying to run this from the CMD console, store the "try" statements to a variable ("var <- try(...)") to prevent a code-stopping warning (it took a lot of profanity to figure that out). The "ANSI_WARNINGS" and "NOCOUNT" settings control things that can confuse your RODBC connection. Sometimes, when a SQL statement issues an "X rows affected" or other non-error message, R thinks execution failed. It doesn't happen everytime, but it does happen...especially when you're showing your code to someone else.
Now, we'll rinse and repeat for the province file:
R Code: Province Shapefile to SQL Server
Boom. Now you've got two tables in your database populated with countries and provinces. Currrently, the borders are a series of lat-longs that we will use to trace the polygon. If you go rogue after this tutorial, you'll need to familiarize yourself with some spatial data rules, the first which throws many noobs off: The lat-long points need to go in a certain order. In fact, that's why the SQL functions above have ORDER BY clauses.
Right-hand, left-hand or clockwise, counterclockwise rules differ by database type (SQL Server, Oracle, etc.) AND geography type. In SQL Server, for instance, if you want to have a shape with interior shapes (Multipolygon), the exterior shape needs to have its points listed clockwise, while the interior shape's points are listed counterclockwise, as described here.
There are tons of other things, like degree vs. decimal lat-longs, various projection/grid types, and SRID's to worry about, but we'll stay on the yellow brick road for the time being.
Step Two: Lat-Long To Polygon In SQL Server
First, we're going to make a couple of functions that are capable of converting the lat-longs that come from R into geography polygons. And don't sweat the red highlighting - SSMS is just letting you know that it doesn't see the tables you're pointing at with the functions. They'll be there by the time we call the functions.
Declare Functions for Polygon Parsing
Functions accomplished. Now, let's make some tables, apply the functions, and clean up the dbo schema.
Step Three: Create, Populate, And Drop Tables
Apply Polygon Functions & Clean Things Up
Just like that, we've got what we came for: Two tables that can be used for meaningful spatial analysis with all the goodies T-SQL provides. Also, if you didn't know how Cross Apply worked before, I think this gives you a good example with an intuitive function. Apply is extremely difficult to explain to someone using just the concept, but it's easy to understand when you see it.
A little trick I embedded to eliminate duplicate entries is the ROW_NUMBER() window function. If you run the subqueries by themselves, they return a row for every country and lat-long pair, which isn't useful. However, the geography data type is not comparable - it can't be used with DISTINCT. Relying on the uniqueness (via PARTITION) of other columns, the ROW_NUMBER() function gives us a way to eliminate the duplicates. The ORDER BY is required by the syntax, but it doesn't really matter in this context.
Finally, you'll notice that I put a spatial index on each table to speed up searching. It works by gridding out the entire world on several layers, which allows the Query Optimizer to disregard large portions of the world immediately, without having to search them. However, the optimal grid setting depends on what the data will be used for. I have used the default setting here, but if you're doing some hardcore spatial querying that can't be cached, you'll have to check out the specifics.
Word of Warning: Don't try to read the new WKT (Well Known Text) geography types from SQL Server back into R. They are too large for R to handle, and this will cause an error.
Spatial Results from tables created above
Spatial statistics is infinitely easier to deploy when you can take the pain out of dealing with the spatial data in your analysis language. Offloading that work to SQL can save you a ton of time and memory that can be better used elsewhere.
Hopefully, getting these tables doesn't fully slake your thirst for better spatial analysis. The folks over at Natural Earth have way more data, and you'll be able to easily deploy projects that would have been extremely difficult in the past if you continue to learn SQL Server's geography features.
Copyright © 2018