Fair Warning: This article assumes a working knowledge of Linear Programming and SQL
Maybe you're just jumping into the business world, and you've found that no one in your organization has heard of GAMS or LINDO. Maybe you're finding that Excel can't handle the matrices you're trying to build, or that a custom-made approach to building your matrices will be a large project that won't scale well to different applications.
An entry-level approach to building transportation networks involves plugging distances between supply, demand, and transshipment nodes into a matrix, the costs of traveling those arcs into an objective function, and then some constraint equations that restrict the values of certain variables.
Generally, the better you are at formulating these problems, the simpler the constraints can be, and the easier they are to solve. However, a few textbook examples show how, even in simple cases, the matrices grow as the networks' complexity increases. Scaling that up to, say, an enterprise-grade logistics network with tens of thousands of supply, transshipment, and demand nodes, it is clear that even setting up the distance matrix could become an insurmountable task.
If you try to offload spatial problems onto an analysis platform like R or Python, you'll likely find that all the tools you want to use were created in different packages or libraries. Keeping spatial data usable and comparable across different packages can mean building translations to different map projections, data structures (e.g. spatial data frames), and more to utilize the functions you want to apply.
However, a little celebrity appearance by SQL Server can take a huge bite out of the work that often proves most tedious when deploying this kind of work on an enterprise scale.
the sql server solution
As I've said before, I love SQL Server because it has a flavor for every budget, including free. SQL Server's unique distance functions offer an easy way to dynamically fill out the distance matrix. Replacing the tedious process of having data entry interns estimating distances and filling out huge matrices (or, the more common approach: Just not solving the problem at all) with this automated implementation is a classic case of making computers work for your organization rather than the other way around.
step one: Finding Lat-Longs
Google Places API makes it very easy to tag locations with lat-long coordinates. Obviously, the API can be used with traditional languages, and the API is flexible enough to accommodate a few use scenarios (searching by address, searching by place name, etc.). R users should check out this post to implement it. Consider running this tagging for your data set in chunks each night to get around the API's 2500 query limit.
R Example: Google API via ggmap
Step two: sql server geography
Once you've gotten lat-longs from Google, use the T-SQL below to convert them to Geography datatypes and calculate distances between all the points.
TSQL: Set up geography column & Retrieve point-to-point distances
Note that I divided by a decimal when converting my distances from meters to miles. It is good SQL practice to always multiply and divide by a decimal, even if it's [Some Number].0, to avoid an accidental integer casting. This is especially dangerous since it can mess with calculations and will not throw an error or warning.
The results are saved to #temp and referenced below.
step three: Select the Solution
At this point, you have a couple of options. You can pivot the solution to create a traditional matrix, or you can just directly apply your constraints and costs to the results above, aggregate them, and order the answers to give you the lowest-cost options. This should almost always be the easiest solution, since all the metadata is still attached to the nodes.
If you're after a traditional matrix, I suggest the following approach, which borrows the dynamic SQL code you can find here.
Dynamic Matrix Selection
There you have it! A quick way to populate the distance matrix that can easily be incorporated with the rest of many network problems for a direct solution.
With just a few extensions (and some heuristics), it's an easy task to take the weight of setting up and formulating your problems off of your shoulders and put it on SQL Server's Query Optimizer!
In 2016, there are many options when it comes to platforms for conducting analytical projects. There are Python, R, SAS, and SPSS for statistical work. You can use a multitude of SQL flavors for databases (mySQL, SQLite, SQL Server, etc.), and to distribute the analysis to the masses, the number and types of reporting systems are too many to list and too varied to represent with any examples. However, the plethora and flexibility of solution available shouldn't make you believe that there are no wrong answers. Many analytics professionals, especially folks who are focused on pure mathematics or statistics, believe platform agnosticism is next to godliness.
"Many analytics professionals ... believe platform agnosticism is next to godliness."
While it isn't fair to say that choosing one of these platforms over another will actually stop you from conducting one or more kinds of analysis, the choice will have a profound impact on how easy certain types of analysis will be to conduct, how scalable your analytical projects will be over time, and how the results can be communicated through your organization.
For the sake of transparency (if, by chance, you missed the title), I am a huge supporter of R and Microsoft SQL Server. There are a few reasons I prefer this duo:
mro and r: a sequel that doesn't suck
A few years ago, Microsoft released their own version of R, which enables parallel execution and adds enhanced matrix algebra utilities from Intel, but, most importantly, they left everything else alone. MRO (Microsoft R Open) and the Intel Math Kernel Library (MKL) greatly enhance performance of matrix algebra on a single core, and add to this advantage when more cores are available.
Shiny: Get ready to look like a hero
Unlike other analysis-centric programming languages, R has Shiny, which lets you deploy applications to people who need customized results. Python has a few users working on something sort of similar, but it is in a fledgling stage by comparison.
This is the death of BI tools. Tableau is down on its knees and just heard the hammer click back. I'll be dedicating a post to exactly why BI is doomed anyway, but Shiny is a significant nail in the coffin. By allowing non-technical users to parameterize complex problems and retrieve their own solutions, an analytics group is able to deliver more general solutions instead of targeted reports (which must always be tweaked if they are not entirely ignored).
Shiny application development feels markedly different from standard R programming, but if you can get a meaningful project and a little time, you'll be functionally proficient in a week or two.
This is huge for businesses. There are people who build Tableau reports for a living making $90k/year just to show the results from analytical work. That is an insane use of money when you can just modify the original scripts in the original environment and get a perfectly functional, professional business solution.
You get to steal from smart people
R lets you harness the work of leading academics. Thanks to its popularity in scientific and academic spheres, R has packages that let you employ even the most obscure research techniques simply and effectively. Need a GMM solution to Weighted Least Squares? You're covered in one line of code.
The sqldf package is a trump card if you don't have the time or interest to master the data shuffling idiosyncrasies of a new language. This package allows you to use SQL syntax directly on data frames (the workhorse of object types in R), which greatly decreases learning time while increasing code transparency throughout your analytics group.
One sql to rule them all...
First, I want to say that I am not being (and have never been) paid by Microsoft to endorse their products. Second, I want to say that I am super, super open to that. If anyone from Microsoft is reading, I would genuinely say anything you want for money...Bill Gates fights crime. Apple plastics are made from puppies.
Hit me up.
SQL Server, which uses Microsoft's T-SQL (Transact SQL) language, has several advantages for analytical work (besides being an extremely slick piece of technical goodness under the hood for computer-sciencey reasons). Besides a sweet, new set of features called R Services, which lets you run R inside your database environment.
Could you mess around with C# and make something similar appear to happen with Python or SAS? If one of your chief concerns on a daily basis is the rapidly declining amount of wall space available to continue hanging your honorary degrees and your shed is filling up with patent approvals, Maybe. However, if your main focus is on the analysis and not on systems engineering (or you're a mere mortal), maybe go for the off-the-shelf solution.
Besides R Services, SQL Server has a few other advantages that come up on a daily basis.
without geography, you're nowhere
SQL Server implements Geography Functions and Geography Indexing in a user-friendly way, meaning spatial calculations become infinitely easier. Gone are the days of using 15 libraries to grid out your x-y points while dealing with map projections, 2D/3D conversion, etc.: It's all handled. Spatial statistics and optimization problems that used to be untenable and beautiful maps with analysis layers become manageable when you combine SQL Server's intuitive geography features with R's functionality.
Semantic Search: Another day at the text mine
SQL Server's Semantic Search capabilities are absolutely awesome. Without too much trouble, you'll be writing queries that are terminator-esque. By utilizing thesauruses (Yes, that's the plural - I Googled.) and a host of other nifty text-mining assistance tools, you can write your own code to do feature extraction, score documents on how similar they are, match & extract key phrases, and more. This is way beyond REGEX.
Though not truly text mining on its own, these features provide you a jumping-off point way ahead of where you would start if you were only using your analytical language.
it's (sort of) free
Using SQL Server Express is a great way to bridge the gap between budget and high-end functionality, and it isn't a bait-and-switch. This isn't one of those "This is totally free...I just need your credit card info" deals. There are a couple of restrictions on Express vs. the full distribution, but they really are minor and related to DB size. It's free. Forever. Period.
Also - this usually goes unmentioned - a consequence of this "free" distribution is that people can install it on their personal computers at home, making it easier for people to pick up and practice if they're new to databases, database management systems, or just new to SQL.
Perfection and other things you won't find in this world
Are there weaknesses to this approach? Of course. Depending on how you divide up the work between SQL Server and R, you may run into the fact that R's flow control is hideously slow (the new parallel execution accelerates calculations, not flow control). And those convenient/awesome/magical geography types from SQL Server? They'll cause R to fail if you try to read them in (it doesn't take many records to exceed R's memory allotment).
Developing packages for R can also be frustrating, as it requires an arcane (Rcane?) series of component installations which, due to reliance on the specified PATH and the fact that every installed component seems to have a different default installation destination.
Java is also finicky on R (this comes up more rarely now, but it does come up). Let me save you some heartache: You need to go find the Java distribution that matches your R architecture (64- or 32-bit). The default Java install won't work.
The dynamic duo I recommend has served me incredibly well over the years. I have not, to this day, run into an analytical problem that could not be solved and solved well with these tools, and they both keep getting better. Picking up the ability to employ these solutions in tandem makes you more valuable, your group far more productive, and your work incredible.
Copyright © 2018