By: Christopher Waldeck An Example From Biology
Kingdom - Phylum - Class - Order - Family - Genus - Species
I don't know the technical definitions of these terms, but I know my 7th grade science teacher made us memorize them in this order, and I know that today is the first time this knowledge has been of any use.
Consider biological life as a generic data generation process with individual beings measured as a cross-section, each with many facets. Life (uh, finds a way) naturally occurs in nested subsets. Given sets A and B with subsets {a1, a2} and {b1, b2}, we should expect that facets of elements in a1 and a2 are more similar than they are to those in b1 or b2. As in the biological classification scheme, these sets may be nested for several levels.
Hierarchies are common in data, in fact, this is how the first databases were organized. There are two general algorithmic strategies used to create these groups:- Agglomerative: Start with individual elements, and group the closest two together. Repeat until all elements are contained within a single superset.
- Divisive: Start with one big superset. At each update step, divide a cluster into two based on the proximity of its members. Repeat until each element is in its own "cluster".
While agglomerative methods are more common, it should be noted that divisive algorithms more naturally lend themselves to parallel processing: A single thread or worker could continue to group Reptiles into subsets without needing to constantly measure their distance from subsets within the Mammal kingdom.
Embarrassingly, after an entire post on the topic of distances between observations, I'm throwing around a notion of cluster "distance" that lacks clarity. In this context, the principle concern is how to compute distances between clusters as they are being agglomerated. In the illustration above, the blue circle shows two clusters that have just been merged. A few options exist for how to calculate this new cluster's distance from the remaining cluster:- Maximum (Complete Linkage) - The distance between the closest elements of the previously farther cluster and the remaining cluster
- Minimum (Single Linkage) - The distance between the closest elements of the previously closer cluster and the remaining cluster
- Midpoint - The distance between the average, median, or other representative elements (commonly called "medoids" or "prototypes") of the new and remaining cluster
Generally, a midpoint strategy provides the best trade-off.
For example: Imagine you are tasked with prioritizing houses for remediation after an environmental accident (call it a "spill") that effected a few points nearby. You start with spill points to initialize clustering. If you are using a minimum distance strategy, after the first update step (each spill is now grouped with the nearest house), the algorithm would prioritize cleaning a second house that is closer to the first house, even if another one is closer to the spill (but farther than the first house is from the second)!
General Algorithmic Strategy
Before Diving In
Under the hood, clustering algorithms are almost always "greedy algorithms", which means that they are making lots of locally optimal choices in the hopes of coming up with a good approximation of the globally optimal solution. If you're thinking that this language implies there is an objective function to optimize, you'd be right. It is possible, though not always empirically equivalent, to describe the goals of a given algorithmic strategy as an optimization problem. Implementation decisions like the choice of objective function and solution heuristics also drive developers' choices of default measurement and solution strategies. Before overriding the defaults, be sure to read and understand the documentation that accompanies your statistical software/package. After the initial distance or dissimilarity calculations (strategies for this covered in the previous post, but briefly revisited here), agglomerative algorithms generally use the Lance-Williams algorithm recursively to calculate distances between new clusters and points:
The interpretation is straightforward: At each step, the optimal pair of clusters must be identified for merging according to some rule, and the distance between all the clusters must be recalculated for the next update step. Using the notation of the second formula, the distance is calculated for the newly combined cluster (i,j) (from the previously disjoint clusters, i and j) from each other disjoint cluster, k.
However, we have some choices to make as modelers: - Is there something about, say, the cluster i that makes distance from it worse than another cluster?
- If the newly created ij cluster is spread out (d(i,j) is large), do we think we're in danger of creating a sink by accident? After all, large, sparse clusters will likely appear closer to many points, risking a chain reaction.
- Finally, what if the distance between former cluster i and current cluster k is significantly different from that of former cluster j and current cluster k? Should we punish or reward that heterogeneity?
As you may have guessed, the first bullet addresses the α parameters, the second addresses the β, and the third can be used to address γ (I promise, that's a gamma, not a "y") or to do aid some other distance calculation, as we'll see in the next section. It may seem you're doomed to waste many hours tweaking these parameters in vain (and with dubious results), but we're lucky enough to be able to stand on the shoulders of giants in 2017 - There are several ready-made strategies to employ here:
Extreme Point Strategies: Referred to as "Single-Linkage" and "Complete-Linkage" for no discernible reason, these strategies apply the following weighting schemes:
Thus, it halves and adds the distance between a third cluster and the former two and subtracts (single-linkage) or adds (complete-linkage) half the distance between the two former clusters. If you're having trouble seeing the equivalence between this operation and simply taking the distance from the closest point of each cluster, see the illustration below:
The Lance Williams formula was designed with generality in mind, and as a result, it is capable of expressing many strategies.
Midpoint Strategies: Not surprisingly, there have been many efforts to come up with midpoint measurements that result in stable, useful clusters. I've chosen the most useful and common ones to review here. For disambiguation, these strategies are broken into two additional groups: Implicit Centers and Explicit Centers. All averages imply some central tendency, but some strategies of merging two clusters will result in a new center that is directly derived from the old centers. I think these are worth calling out.
Implicit Centers: WPGMA (Weighted Pair Group Method with Arithmetic mean, obviously) and Simple Average may appear relatively naive strategies because they do not consider the size of the former cluster when computing a new midpoint; however, this feature can help an algorithm avoid "snowballing" due to local optima. Accordingly, these make good benchmarks and reality checks for clustering results.
Explicit Centers: It can (but won't) be shown that the Median method (also, confusingly, known as Gower's WPGMC) results in combining two clusters whose centers' dissimilarity is their squared absolute dissimilarity.
Ward's Method, on the other hand, has an interesting property: It minimizes the total within-cluster variance. By extension, Ward's Method minimizes the sum of squared errors due to lack of fit. While this is a useful property, it has been noted exhaustively on threads, boards, etc. that Ward's Method is exclusively valid over Euclidean space. Specifically, for the desirable properties of Ward's method to hold, the distances between the initial points must be proportional to the squared Euclidean distances between them.
Obviously, many dissimilarity metrics would create distances that violate this assumption. However, even deprived of Euclidean space, we don't need to quit. If you're interested in the math behind the conclusion in the Gower case, you can see page 6 of this paper. The Cliff's Notes version is this: It is possible to derive meaningful, metric distances from Gower dissimilarities. In R's cluster package, the distance calculations are the same as they are described here. Somewhat embarrassingly, the third party website (for different software) is much clearer in the steps actually performed. Application
With R's built in dataset "swiss", which has observations on several socioeconomic factors from different towns, we can explore a hypothetical workflow.
Let's get the data set up (and satisfy my personal requirement that variable names not contain periods) by adding a little complexity: Below, I've discretized the variable Catholic and preserved the rankings hierarchy by making it an ordered factor. Lastly, because the row names contain the labels for the cities, I ensure they are preserved (this code would strip the row names without the last line). As you can tell, I'm using the cluster package to which I alluded earlier. The daisy function provides a friendly interface to several methods of calculating distances and dissimilarities. Its output can be consumed directly by actual clustering functions, such as agnes (agglomerative nesting), which performs hierarchical clustering according to the parameters supplied. The output confirms the equivalence I claimed earlier:
Supplying parameters to the Lance-Williams equation is the same as a directly specifying how to group the clusters. | |
Let's take a peek at the Agglomerative Coefficient, a metric that shows the average similarity of the objects that were merged throughout the update steps. As you can see, the agglomerative coefficient is returned as part of the object created by a call to agnes. If you run this code, you'll find that this AC = .999, which isn't as easy to interpret as some may make it appear. See, the AC formula is the average of all (1 - [D(First) / D(Last)]), where D(First) is the dissimilarity of the first cluster an element is merged with and D(Last) is the last merge the algorithm does. As many have noted (because they are copying the documentation), the dissimilarity D(Last) tends to grow with the number of observations, which decreases the ratio and increases the coefficient.
With groups of heterogeneous size and spread, the AC can prove to be quite misleading, with quite high values even for fairly well-clustered data. Also, because AC is an average of non-normalized dissimilarities, even a single point can severely distort it. Caveat emptor. Now, let's check out an underappreciated feature of the agnes output object - The merge matrix:
For the sake of any future code readers, I'm leaving in my own inline comments. The merge matrix output allows users to see the order in which observations and elements were merged. This can be extremely instructive for a postmortem analysis, and also serves to highlight how differently various clustering decisions affect the behavior of the algorithm.
Finally, some visualizations:
For my money, plotting the clustering against the principle variance components is the best way to get an intuitive understanding of how the clustering worked: Group heterogeneity, appropriate cluster number, and cluster distances can all be assessed from here.
I also include the code for generating banner plots and dendrograms (with highlighting!), but I find these to be not only less useful but downright misleading. At first blush, dendrograms appear to be a way to judge clustering performance, but this is not the case. The sizes of clusters when cutting the dendrogram at various points and the shape of the dendrogram might seem like logical things to check, but they are not valid ways to gauge algorithm performance or appropriateness of clustering. This is just the inexperienced researcher hunting for confirmation bias, not hypothesis testing.
For something closer to hypothesis testing for hierarchical clustering, see the tools in the pvclust package, which was designed to assess cluster stability and uncertainty.
]]>