In order to use the code in this post, download this file, import it into your geography database (or wherever), and execute the following code to create a geography column on the table. You can throw a spatial index on there if you like, but the code below isn't designed to utilize it.
I will name the table [dbo].[USZipCodes] and assume you have SSMS pointed to the relevant database so we can avoid three-part identifiers.
While there are undoubtedly better platforms than SQL Server for k-means clustering, scenarios exist where it makes a lot of sense to do these calculations in a database. Procedures running overnight that could benefit from a clustering heuristic for cutting down calculation times are probably the most common cases.
The traditional k-means algorithm, however, relies on features of Euclidean geometry. Because SQL Server’s geography calculations are happening on an ellipse, this is technically performing K-Means on a warped surface. However, since the points are all "on top" of the ellipse and all the calculations are constrained to be the shortest arcs over the surface (thanks, Microsoft!), we won’t fall into any nonlinearity traps.
However, this solution is not intended for data that cross the 180th Meridian because It averages latitude and longitude to ascertain density-weighted centroids during the update phase. If your data crosses this line, I would recommend using a reference bounding box as a basis for calculating centrality.
The code here uses the Forgy initialization, which means that k points are chosen at random as initial points. However, instead of picking random points on the Earth (the analog for normal Forgy), it picks points from the data at hand.
It’s worth calling out that this method does not give a deterministic result. K-Means is sensitive to the random initialization positions in each run. It is imperative to test any heuristic method on your data to ensure reliability before including it in a workflow.
k-Means Clustering: Step One
You'll notice a lot of window functions in this code. This is because the geography data type is not comparable: it cannot be used for grouping in aggregate functions.
Also, I want to call out ORDER BY ROW_NUMBER()OVER(ORDER BY NEWID()), which is one of my favorite constructions in T-SQL. It allows you randomly order any data by first assigning a random GUID to each row and then ordering by it. The best thing is that the GUID is gone by the end of the query since it isn't selected - an efficient way of handling a problem SQL isn't really built to handle.
Step Two: Loop and Select Results
There it is: You're left with the final assignment of each point to a cluster, the final set of clusters, and a record that shows you how the clusters' centroids changed through each assignment step. While speed is obviously sensitive to the choice of tolerance and the size of the data initially selected, I've found this formulation to converge quickly and reliably as I've tested it with even larger sets of data (~50,000 points).
Like I said before, I really wanted to keep this in the book since the code took a little while to write, but the use case just wasn't general enough for me to charge people for it. However, I imagine the audience on the Real Business Analytics blogs will get a kick out of this kind of problem.
Copyright © 2018