Do The Math |
A Brief Comment oN ClusteringIf you were the jargon-oriented type, you might call clustering an unsupervised machine learning technique. It's especially tempting to do so since clustering, in contrast to classification, isn't trying to produce a result that is readily interpretable to humans. Rather than identifying and extending patterns in a training set to assign meaningful labels (classification) to new records, clustering algorithms iteratively group observations, updating the groupings until some learning and similarity criteria are satisfied. At first blush, any statistics major would tell you that the results obtained via clustering are not only random but dangerous - most clustering algorithms are complete, meaning they will assign every observation to a group, regardless of the absolute effect this has on groups' integrity (the algorithms minimize the relative harm done by a poor assignment). Your friendly, neighborhood stats major is not entirely wrong. These algorithms usually need to pick random points to begin updating (improving). As a result, the final states of clustering can change from run to run. In some pathological cases (or, more likely, when an error has been made), they can be entirely unstable from one run to the next. However, armed with cross validation, a few metrics, and some empirical care, reliable clustering results should be achievable when natural clusters exist in the data. What Does "Close" Mean?Fair question. In all but the most textbook problems, there are usually big wrinkles in figuring out exactly how to measure distance. Got categorical variables? Ordinal variables? Did you discretize a continuous variable? In most applied work, the answer is "Yes, and I also have a bunch of NA's." This is far from being a show-stopper, but it should be pretty intuitive that it throws a wrench in the works if you're thinking in Euclidean (continuous) space. How far is the point {Green, Cat, 3} from {Steel, Man, 14}? We can't just arrange the values of categorical variables at random (not even alphabetical order) across an axis and go off to the races; the chosen order determines the distance that will be used by the clustering algorithm. And what about the interval length? Is the interval [Cat, Dog] shorter than [1, 4]? The answer to this question usually comes in the form of a distance or dissimilarity matrix. When an algorithm is given a set of points (a dataset, where a row is an observation of k columns or facets), the default behavior is generally to cluster with the goal being to minimizing the sum of squared Euclidean distances between the points. However, [Cat, Dog], as you may have noticed, is not a point in continuous space. In these more general cases, we need to build dissimilarity matrices that capture how far apart these observations are from one another. We then will need to alert the algorithm (like the ones in R's "cluster" package) that it should minimize the dissimilarity between observations rather than outright distance. These are conceptually similar, but the algorithm needs to treat the inputs differently.
This article touches on the following distance metrics: Gower, Manhattan, Mahalanobis, Euclidean, and Cosine similarity. There are, of course, many others, and the intuition provided here should give you what you need to be comfortable enough to dive into, say, the Jaccard-based family of metrics, of which Cosine Similarity is a member. I chose this collection to give a blend of stuff you can use right now and things that will help you build up your knowledge later. Almost all metrics you'll find in applied work are derivatives or family members of the types discussed here. Finally, this article will not go into depth (showing actual equations) on the calculations of most of these distance metrics because modern statistical software packages will do the math after the user specifies his preference as a function argument or method parameter. For clustering purposes, it's sufficient to get a conceptual hold on what these distances mean. Gower - Generalized Manhattan Distances For mixed Variable TypesNote: Gower distance, Gower similarity, Gower dissimilarity, and Gower coefficient can all be used to refer to the same metric. For other metrics, the relation Dissimilarity = (1 - Similarity) holds, but not in the Gower context. The Gower coefficient was specifically created to be a metric of similarity that formed a positive semi-definite similarity matrix. Usually, sophisticated distance measures would be at the end of the list. However, because most analyses work with mixed variable types, I'm putting this up front. The Gower dissimilarity matrix is a square matrix of Gower distances between observations. Standard software packages will do these calculations for you, which is convenient and ensures consistency across applications and analysts on one hand, but comes at the risk of creating "grey boxes" within the analysis packages created by your group or company. We'll get into exactly what a Gower dissimilarity matrix contains, but leave implementation-specific points alone. Recall that the intuitive linear distance between points is of limited use in this context. In this case, the "line" is passing through discrete space - it's bouncing between observations (or the probability thereof, in the case of categorical variables). Accordingly, a modified version of Manhattan distance is what we need.
"Dice coefficient?" Fear not - this is just the Jaccard similarity coefficient, re-weighted to reflect the fact that a change from any category to any other is a move of the same "distance". Centrality across the categorical variables is determined by how often their levels coincide across the individuals (in fact, the Dice coefficient can be calculated from a simple cross-tab). Also, an inherent issue with ordinal variables is repeated ranks across many individuals. While ties can be resolved with some algorithmic trickery, you should be aware that the Spearman rank correlation coefficient (frequently used as the measure of similarity in this context) breaks down when many ties exist. This is a real issue if many of your variables are categorical, and the advantage of using Kendall's Tau is explained in this page of Janos Podani's text. In R's "cluster" package, ranking is avoided altogether - the levels of ordinal variables are simply given integer codes (this is called "standard scoring"). Two final notes: 1) Even though Gower distances are unit-agnostic, the presence of outliers will have a large impact on distance calculations since they will effectively "push" other points together when the variables are re-scaled. Outliers must to be dealt with or removed in preprocessing. 2) By default, a Gower distance matrix weights differences across variables evenly - this may be inappropriate if, say, your dataset has important binary variables. Euclidean Distance - So, You're Here For Homework?Good old-fashioned, finite-dimensional, continuous space. In this case, no distance matrix is required - the observations themselves contain valid information about how far apart they are, and a clustering algorithm can take the data directly. However, even when working with model data, there are usually some adjustments to be made. Clustering algorithms are easily sidetracked by outliers - just like the more general case, these need to be processed out. Also, variables must be on the same scale for most clustering algorithms, especially the simple ones. K-Means, for instance, requires that clusters be "globular" - if the surface is stretched out by multiple scales, the clusters identified will be unreliable. In fact, these very drawbacks are the reason for the invention of Mahalanobis Distance. You may see a lot of resources and discussions on StackExchange mention that Mahalanobis Distance is actually about handling correlated data, and they are correct (technically correct - the best kind of correct). The matrix algebra used in the calculation of Mahalanobis distance effectively "divides out" the correlation among the variables. However, it is mathematically equivalent to simply say we are correcting stretched out (elliptical) clusters because linear trends in the space of any variables is the definition of correlation. Cosine Distance - Build your own Distance Matrix!This can be a head-scratcher if it isn't natural for you to think of variables as vectors in space. However, cosine similarity is related to the Jaccard similarity coefficient, the foundation for most classic document distance metrics, and it deserves to be mentioned here. Cosine similarity is, in fact, a special case of correlation. Recall the following from matrix algebra & geometry:
Cosine similarity has a useful geometric interpretation: If we treat two sets of word counts as vectors, then we can use the angle formed between them to determine their similarity. However, we won't need a bunch of trigonometry trickery to find the coefficient (we don't even need the cosine function, so you can stop trying to remember the SOH-CAH-TOA rules). In fact, the coefficient is found by dividing each vector by its length and then multiplying them together. To make this concrete, check out the R code below and the executed result to the right: Cosine similarity lends itself well to document data because it utilizes the vector lengths instead of frequency counts. As a result, 0-0 "matches" in the arrays will be ignored. This is desirable because, if phrases or paragraphs have exactly 0 words in common, that isn't a positive match, and it certainly shouldn't inflate their computed commonality. In fact, you can confirm this by dropping the middle 0 from each array in the code above - the answer will be the same. Cosine distances naturally fall on the interval [0, 1] since they are just correlations, and outliers shouldn't be an issue because the scaling isn't a column-wise operation. However, there are many more modern techniques for capturing document distances that utilize syntax, semantics, and even access thesauruses to assess how similar the meanings of the words are. However, in every data mining class, cosine similarity will be the starting point for the topic of text mining. Next - Hierarchical & Partitioning MethodsThe following two parts of this article will cover Hierarchical and Partitioning clustering methods in turn, with coded examples.
0 Comments
Leave a Reply. |
AuthorChristopher Waldeck believes he can eradicate the misuse if the term "Big Data". Others are not so sure. Archives
December 2017
Categories |
Copyright © 2018