Do The Math |
An Example From BiologyKingdom - Phylum - Class - Order - Family - Genus - Species I don't know the technical definitions of these terms, but I know my 7th grade science teacher made us memorize them in this order, and I know that today is the first time this knowledge has been of any use. Consider biological life as a generic data generation process with individual beings measured as a cross-section, each with many facets. Life (uh, finds a way) naturally occurs in nested subsets. Given sets A and B with subsets {a1, a2} and {b1, b2}, we should expect that facets of elements in a1 and a2 are more similar than they are to those in b1 or b2. As in the biological classification scheme, these sets may be nested for several levels. Hierarchies are common in data, in fact, this is how the first databases were organized. There are two general algorithmic strategies used to create these groups:
While agglomerative methods are more common, it should be noted that divisive algorithms more naturally lend themselves to parallel processing: A single thread or worker could continue to group Reptiles into subsets without needing to constantly measure their distance from subsets within the Mammal kingdom. Embarrassingly, after an entire post on the topic of distances between observations, I'm throwing around a notion of cluster "distance" that lacks clarity. In this context, the principle concern is how to compute distances between clusters as they are being agglomerated. In the illustration above, the blue circle shows two clusters that have just been merged. A few options exist for how to calculate this new cluster's distance from the remaining cluster:
Generally, a midpoint strategy provides the best trade-off. For example: Imagine you are tasked with prioritizing houses for remediation after an environmental accident (call it a "spill") that effected a few points nearby. You start with spill points to initialize clustering. If you are using a minimum distance strategy, after the first update step (each spill is now grouped with the nearest house), the algorithm would prioritize cleaning a second house that is closer to the first house, even if another one is closer to the spill (but farther than the first house is from the second)! General Algorithmic StrategyBefore Diving In Under the hood, clustering algorithms are almost always "greedy algorithms", which means that they are making lots of locally optimal choices in the hopes of coming up with a good approximation of the globally optimal solution. If you're thinking that this language implies there is an objective function to optimize, you'd be right. It is possible, though not always empirically equivalent, to describe the goals of a given algorithmic strategy as an optimization problem. Implementation decisions like the choice of objective function and solution heuristics also drive developers' choices of default measurement and solution strategies. Before overriding the defaults, be sure to read and understand the documentation that accompanies your statistical software/package. After the initial distance or dissimilarity calculations (strategies for this covered in the previous post, but briefly revisited here), agglomerative algorithms generally use the Lance-Williams algorithm recursively to calculate distances between new clusters and points: The interpretation is straightforward: At each step, the optimal pair of clusters must be identified for merging according to some rule, and the distance between all the clusters must be recalculated for the next update step. Using the notation of the second formula, the distance is calculated for the newly combined cluster (i,j) (from the previously disjoint clusters, i and j) from each other disjoint cluster, k. However, we have some choices to make as modelers:
As you may have guessed, the first bullet addresses the α parameters, the second addresses the β, and the third can be used to address γ (I promise, that's a gamma, not a "y") or to do aid some other distance calculation, as we'll see in the next section. It may seem you're doomed to waste many hours tweaking these parameters in vain (and with dubious results), but we're lucky enough to be able to stand on the shoulders of giants in 2017 - There are several ready-made strategies to employ here: Extreme Point Strategies: Referred to as "Single-Linkage" and "Complete-Linkage" for no discernible reason, these strategies apply the following weighting schemes: Thus, it halves and adds the distance between a third cluster and the former two and subtracts (single-linkage) or adds (complete-linkage) half the distance between the two former clusters. If you're having trouble seeing the equivalence between this operation and simply taking the distance from the closest point of each cluster, see the illustration below: The Lance Williams formula was designed with generality in mind, and as a result, it is capable of expressing many strategies. Midpoint Strategies: Not surprisingly, there have been many efforts to come up with midpoint measurements that result in stable, useful clusters. I've chosen the most useful and common ones to review here. For disambiguation, these strategies are broken into two additional groups: Implicit Centers and Explicit Centers. All averages imply some central tendency, but some strategies of merging two clusters will result in a new center that is directly derived from the old centers. I think these are worth calling out. Implicit Centers: WPGMA (Weighted Pair Group Method with Arithmetic mean, obviously) and Simple Average may appear relatively naive strategies because they do not consider the size of the former cluster when computing a new midpoint; however, this feature can help an algorithm avoid "snowballing" due to local optima. Accordingly, these make good benchmarks and reality checks for clustering results. Explicit Centers: It can (but won't) be shown that the Median method (also, confusingly, known as Gower's WPGMC) results in combining two clusters whose centers' dissimilarity is their squared absolute dissimilarity. Ward's Method, on the other hand, has an interesting property: It minimizes the total within-cluster variance. By extension, Ward's Method minimizes the sum of squared errors due to lack of fit. While this is a useful property, it has been noted exhaustively on threads, boards, etc. that Ward's Method is exclusively valid over Euclidean space. Specifically, for the desirable properties of Ward's method to hold, the distances between the initial points must be proportional to the squared Euclidean distances between them. Obviously, many dissimilarity metrics would create distances that violate this assumption. However, even deprived of Euclidean space, we don't need to quit. If you're interested in the math behind the conclusion in the Gower case, you can see page 6 of this paper. The Cliff's Notes version is this: It is possible to derive meaningful, metric distances from Gower dissimilarities. In R's cluster package, the distance calculations are the same as they are described here. Somewhat embarrassingly, the third party website (for different software) is much clearer in the steps actually performed. ApplicationWith R's built in dataset "swiss", which has observations on several socioeconomic factors from different towns, we can explore a hypothetical workflow. Let's get the data set up (and satisfy my personal requirement that variable names not contain periods) by adding a little complexity: Below, I've discretized the variable Catholic and preserved the rankings hierarchy by making it an ordered factor. Lastly, because the row names contain the labels for the cities, I ensure they are preserved (this code would strip the row names without the last line). Getting Started
As you can tell, I'm using the cluster package to which I alluded earlier. The daisy function provides a friendly interface to several methods of calculating distances and dissimilarities. Its output can be consumed directly by actual clustering functions, such as agnes (agglomerative nesting), which performs hierarchical clustering according to the parameters supplied. Let's take a peek at the Agglomerative Coefficient, a metric that shows the average similarity of the objects that were merged throughout the update steps. As you can see, the agglomerative coefficient is returned as part of the object created by a call to agnes. If you run this code, you'll find that this AC = .999, which isn't as easy to interpret as some may make it appear. See, the AC formula is the average of all (1 - [D(First) / D(Last)]), where D(First) is the dissimilarity of the first cluster an element is merged with and D(Last) is the last merge the algorithm does. As many have noted (because they are copying the documentation), the dissimilarity D(Last) tends to grow with the number of observations, which decreases the ratio and increases the coefficient. With groups of heterogeneous size and spread, the AC can prove to be quite misleading, with quite high values even for fairly well-clustered data. Also, because AC is an average of non-normalized dissimilarities, even a single point can severely distort it. Caveat emptor. Now, let's check out an underappreciated feature of the agnes output object - The merge matrix: For the sake of any future code readers, I'm leaving in my own inline comments. The merge matrix output allows users to see the order in which observations and elements were merged. This can be extremely instructive for a postmortem analysis, and also serves to highlight how differently various clustering decisions affect the behavior of the algorithm. Finally, some visualizations: For my money, plotting the clustering against the principle variance components is the best way to get an intuitive understanding of how the clustering worked: Group heterogeneity, appropriate cluster number, and cluster distances can all be assessed from here. I also include the code for generating banner plots and dendrograms (with highlighting!), but I find these to be not only less useful but downright misleading. At first blush, dendrograms appear to be a way to judge clustering performance, but this is not the case. The sizes of clusters when cutting the dendrogram at various points and the shape of the dendrogram might seem like logical things to check, but they are not valid ways to gauge algorithm performance or appropriateness of clustering. This is just the inexperienced researcher hunting for confirmation bias, not hypothesis testing. For something closer to hypothesis testing for hierarchical clustering, see the tools in the pvclust package, which was designed to assess cluster stability and uncertainty.
0 Comments
A Brief Comment oN ClusteringIf you were the jargon-oriented type, you might call clustering an unsupervised machine learning technique. It's especially tempting to do so since clustering, in contrast to classification, isn't trying to produce a result that is readily interpretable to humans. Rather than identifying and extending patterns in a training set to assign meaningful labels (classification) to new records, clustering algorithms iteratively group observations, updating the groupings until some learning and similarity criteria are satisfied. At first blush, any statistics major would tell you that the results obtained via clustering are not only random but dangerous - most clustering algorithms are complete, meaning they will assign every observation to a group, regardless of the absolute effect this has on groups' integrity (the algorithms minimize the relative harm done by a poor assignment). Your friendly, neighborhood stats major is not entirely wrong. These algorithms usually need to pick random points to begin updating (improving). As a result, the final states of clustering can change from run to run. In some pathological cases (or, more likely, when an error has been made), they can be entirely unstable from one run to the next. However, armed with cross validation, a few metrics, and some empirical care, reliable clustering results should be achievable when natural clusters exist in the data. What Does "Close" Mean?Fair question. In all but the most textbook problems, there are usually big wrinkles in figuring out exactly how to measure distance. Got categorical variables? Ordinal variables? Did you discretize a continuous variable? In most applied work, the answer is "Yes, and I also have a bunch of NA's." This is far from being a show-stopper, but it should be pretty intuitive that it throws a wrench in the works if you're thinking in Euclidean (continuous) space. How far is the point {Green, Cat, 3} from {Steel, Man, 14}? We can't just arrange the values of categorical variables at random (not even alphabetical order) across an axis and go off to the races; the chosen order determines the distance that will be used by the clustering algorithm. And what about the interval length? Is the interval [Cat, Dog] shorter than [1, 4]? The answer to this question usually comes in the form of a distance or dissimilarity matrix. When an algorithm is given a set of points (a dataset, where a row is an observation of k columns or facets), the default behavior is generally to cluster with the goal being to minimizing the sum of squared Euclidean distances between the points. However, [Cat, Dog], as you may have noticed, is not a point in continuous space. In these more general cases, we need to build dissimilarity matrices that capture how far apart these observations are from one another. We then will need to alert the algorithm (like the ones in R's "cluster" package) that it should minimize the dissimilarity between observations rather than outright distance. These are conceptually similar, but the algorithm needs to treat the inputs differently.
This article touches on the following distance metrics: Gower, Manhattan, Mahalanobis, Euclidean, and Cosine similarity. There are, of course, many others, and the intuition provided here should give you what you need to be comfortable enough to dive into, say, the Jaccard-based family of metrics, of which Cosine Similarity is a member. I chose this collection to give a blend of stuff you can use right now and things that will help you build up your knowledge later. Almost all metrics you'll find in applied work are derivatives or family members of the types discussed here. Finally, this article will not go into depth (showing actual equations) on the calculations of most of these distance metrics because modern statistical software packages will do the math after the user specifies his preference as a function argument or method parameter. For clustering purposes, it's sufficient to get a conceptual hold on what these distances mean. Gower - Generalized Manhattan Distances For mixed Variable TypesNote: Gower distance, Gower similarity, Gower dissimilarity, and Gower coefficient can all be used to refer to the same metric. For other metrics, the relation Dissimilarity = (1 - Similarity) holds, but not in the Gower context. The Gower coefficient was specifically created to be a metric of similarity that formed a positive semi-definite similarity matrix. Usually, sophisticated distance measures would be at the end of the list. However, because most analyses work with mixed variable types, I'm putting this up front. The Gower dissimilarity matrix is a square matrix of Gower distances between observations. Standard software packages will do these calculations for you, which is convenient and ensures consistency across applications and analysts on one hand, but comes at the risk of creating "grey boxes" within the analysis packages created by your group or company. We'll get into exactly what a Gower dissimilarity matrix contains, but leave implementation-specific points alone. Recall that the intuitive linear distance between points is of limited use in this context. In this case, the "line" is passing through discrete space - it's bouncing between observations (or the probability thereof, in the case of categorical variables). Accordingly, a modified version of Manhattan distance is what we need.
"Dice coefficient?" Fear not - this is just the Jaccard similarity coefficient, re-weighted to reflect the fact that a change from any category to any other is a move of the same "distance". Centrality across the categorical variables is determined by how often their levels coincide across the individuals (in fact, the Dice coefficient can be calculated from a simple cross-tab). Also, an inherent issue with ordinal variables is repeated ranks across many individuals. While ties can be resolved with some algorithmic trickery, you should be aware that the Spearman rank correlation coefficient (frequently used as the measure of similarity in this context) breaks down when many ties exist. This is a real issue if many of your variables are categorical, and the advantage of using Kendall's Tau is explained in this page of Janos Podani's text. In R's "cluster" package, ranking is avoided altogether - the levels of ordinal variables are simply given integer codes (this is called "standard scoring"). Two final notes: 1) Even though Gower distances are unit-agnostic, the presence of outliers will have a large impact on distance calculations since they will effectively "push" other points together when the variables are re-scaled. Outliers must to be dealt with or removed in preprocessing. 2) By default, a Gower distance matrix weights differences across variables evenly - this may be inappropriate if, say, your dataset has important binary variables. Euclidean Distance - So, You're Here For Homework?Good old-fashioned, finite-dimensional, continuous space. In this case, no distance matrix is required - the observations themselves contain valid information about how far apart they are, and a clustering algorithm can take the data directly. However, even when working with model data, there are usually some adjustments to be made. Clustering algorithms are easily sidetracked by outliers - just like the more general case, these need to be processed out. Also, variables must be on the same scale for most clustering algorithms, especially the simple ones. K-Means, for instance, requires that clusters be "globular" - if the surface is stretched out by multiple scales, the clusters identified will be unreliable. In fact, these very drawbacks are the reason for the invention of Mahalanobis Distance. You may see a lot of resources and discussions on StackExchange mention that Mahalanobis Distance is actually about handling correlated data, and they are correct (technically correct - the best kind of correct). The matrix algebra used in the calculation of Mahalanobis distance effectively "divides out" the correlation among the variables. However, it is mathematically equivalent to simply say we are correcting stretched out (elliptical) clusters because linear trends in the space of any variables is the definition of correlation. Cosine Distance - Build your own Distance Matrix!This can be a head-scratcher if it isn't natural for you to think of variables as vectors in space. However, cosine similarity is related to the Jaccard similarity coefficient, the foundation for most classic document distance metrics, and it deserves to be mentioned here. Cosine similarity is, in fact, a special case of correlation. Recall the following from matrix algebra & geometry:
Cosine similarity has a useful geometric interpretation: If we treat two sets of word counts as vectors, then we can use the angle formed between them to determine their similarity. However, we won't need a bunch of trigonometry trickery to find the coefficient (we don't even need the cosine function, so you can stop trying to remember the SOH-CAH-TOA rules). In fact, the coefficient is found by dividing each vector by its length and then multiplying them together. To make this concrete, check out the R code below and the executed result to the right: Cosine similarity lends itself well to document data because it utilizes the vector lengths instead of frequency counts. As a result, 0-0 "matches" in the arrays will be ignored. This is desirable because, if phrases or paragraphs have exactly 0 words in common, that isn't a positive match, and it certainly shouldn't inflate their computed commonality. In fact, you can confirm this by dropping the middle 0 from each array in the code above - the answer will be the same. Cosine distances naturally fall on the interval [0, 1] since they are just correlations, and outliers shouldn't be an issue because the scaling isn't a column-wise operation. However, there are many more modern techniques for capturing document distances that utilize syntax, semantics, and even access thesauruses to assess how similar the meanings of the words are. However, in every data mining class, cosine similarity will be the starting point for the topic of text mining. Next - Hierarchical & Partitioning MethodsThe following two parts of this article will cover Hierarchical and Partitioning clustering methods in turn, with coded examples.
|
AuthorChristopher Waldeck believes he can eradicate the misuse if the term "Big Data". Others are not so sure. Archives
December 2017
Categories |
Copyright © 2018