REAL BUSINESS ANALYTICS
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
Search by typing & pressing enter

YOUR CART

Know The Tech

RBA Home

1/29/2017 4 Comments

K-Means Clustering Using T-SQL Geography

By: Christopher Waldeck
Picture

This article started its life as a section of my upcoming guide on working with Geography in SQL Server:

T-SQL Geography: Getting Off The Ground

Since, in part, that guide is intended as a code bank/reference for people new to the platform's geography functionality, it will be released on Amazon's Kindle platform to facilitate using the code on a daily basis.

However, since the point of that guide is covering the essentials of becoming an effective problem-solver utilizing T-SQL Geography, I decided to pull this section from the book and put it here...

As a completely shameless plug

In order to use the code in this post, download this file, import it into your geography database (or wherever), and execute the following code to create a geography column on the table. You can throw a spatial index on there if you like, but the code below isn't designed to utilize it.

​I will name the table [dbo].[USZipCodes] and assume you have SSMS pointed to the relevant database so we can avoid three-part identifiers.
zip.xlsx
File Size: 1959 kb
File Type: xlsx
Download File

Create GeographyPoint

    


​While there are undoubtedly better platforms than SQL Server for k-means clustering, scenarios exist where it makes a lot of sense to do these calculations in a database. Procedures running overnight that could benefit from a clustering heuristic for cutting down calculation times are probably the most common cases.

The traditional k-means algorithm, however, relies on features of Euclidean geometry. Because SQL Server’s geography calculations are happening on an ellipse, this is technically performing K-Means on a warped surface. However, since the points are all "on top" of the ellipse and all the calculations are constrained to be the shortest arcs over the surface (thanks, Microsoft!), we won’t fall into any nonlinearity traps.

​However, this solution is not intended for data that cross the 180th Meridian because It averages latitude and longitude to ascertain density-weighted centroids during the update phase. If your data crosses this line, I would recommend using a reference bounding box as a basis for calculating centrality.


The code here uses the Forgy initialization, which means that k points are chosen at random as initial points. However, instead of picking random points on the Earth (the analog for normal Forgy), it picks points from the data at hand.

It’s worth calling out that this method does not give a deterministic result. K-Means is sensitive to the random initialization positions in each run. It is imperative to test any heuristic method on your data to ensure reliability before including it in a workflow.  
k-Means Clustering: Step One

    
You'll notice a lot of window functions in this code. This is because the geography data type is not comparable: it cannot be used for grouping in aggregate functions.

​Also, I want to call out 
ORDER BY ROW_NUMBER()OVER(ORDER BY NEWID())​, which is one of my favorite constructions in T-SQL. It allows you randomly order any data by first assigning a random GUID to each row and then ordering by it. The best thing is that the GUID is gone by the end of the query since it isn't selected - an efficient way of handling a problem SQL isn't really built to handle.
Step Two: Loop and Select Results

    

Wrap Up

​There it is: You're left with the final assignment of each point to a cluster, the final set of clusters, and a record that shows you how the clusters' centroids changed through each assignment step. While speed is obviously sensitive to the choice of tolerance and the size of the data initially selected, I've found this formulation to converge quickly and reliably as I've tested it with even larger sets of data (~50,000 points). 

Like I said before, I really wanted to keep this in the book since the code took a little while to write, but the use case just wasn't general enough for me to charge people for it. However, I imagine the audience on the Real Business Analytics blogs will get a kick out of this kind of problem.



​

​
4 Comments
pbord
7/20/2019 03:40:43 pm

Seems as though there is a bug with the iterationscore update, if you want to do iterationScore_t+1 = sumDistance_t+t - itaterationScore_t, you should use absolute value, example, for iterationScore is 500, next sum is 400, so new iterationScore is -100, which would break the while loop even though the improvement is not in (0, toleranceInMiles). As you multiply twice by the conversion factor of meters to miles, first when selecting Distance to insert into #Clusters and then in the select Sum(Distance) when getting the new iterationScore.

Reply
Christopher Waldeck link
7/21/2019 01:27:08 pm

Nice catches, pbord! The code has been corrected.

Reply
brod
9/25/2019 01:15:10 pm

I tried running this example as is, and it seems to be stuck in infinite loop. the Variable @PreviousScore is never used, could that be the cause?

Reply
tokisada
11/4/2020 02:49:54 pm

Same here, the example seems to be stuck in infinite loop. :/

¿Can anybody help?.




Leave a Reply.

    Author

    Chris Waldeck is trying to break the Buzzword Bingo cycle

    Archives

    January 2020
    January 2017
    December 2016
    November 2016

    Categories

    All R Spark Sparklyr

    RSS Feed

Copyright © 2018