## Doing The Math |

## An Example From BiologyKingdom - Phylum - Class - Order - Family - Genus - Species I don't know the technical definitions of these terms, but I know my 7th grade science teacher made us memorize them in this order, and I know that today is the first time this knowledge has been of any use. Consider biological life as a generic data generation process with individual beings measured as a cross-section, each with many facets. Life (uh, finds a way) naturally occurs in nested subsets. Given sets A and B with subsets {a1, a2} and {b1, b2}, we should expect that facets of elements in a1 and a2 are more similar than they are to those in b1 or b2. As in the biological classification scheme, these sets may be nested for several levels. Hierarchies are common in data, in fact, this is how the first databases were organized. There are two general algorithmic strategies used to create these groups: - Agglomerative: Start with individual elements, and group the closest two together. Repeat until all elements are contained within a single superset.
- Divisive: Start with one big superset. At each update step, divide a cluster into two based on the proximity of its members. Repeat until each element is in its own "cluster".
While agglomerative methods are more common, it should be noted that divisive algorithms more naturally lend themselves to parallel processing: A single thread or worker could continue to group Reptiles into subsets without needing to constantly measure their distance from subsets within the Mammal kingdom. Embarrassingly, after an entire post on the topic of distances between observations, I'm throwing around a notion of cluster "distance" that lacks clarity. In this context, the principle concern is how to compute distances between clusters as they are being agglomerated. In the illustration above, the blue circle shows two clusters that have just been merged. A few options exist for how to calculate this new cluster's distance from the remaining cluster:- Maximum (Complete Linkage) - The distance between the closest elements of the previously farther cluster and the remaining cluster
- Minimum (Single Linkage) - The distance between the closest elements of the previously closer cluster and the remaining cluster
- Midpoint - The distance between the average, median, or other representative elements (commonly called "medoids" or "prototypes") of the new and remaining cluster
Generally, a midpoint strategy provides the best trade-off. For example: Imagine you are tasked with prioritizing houses for remediation after an environmental accident (call it a "spill") that effected a few points nearby. You start with spill points to initialize clustering. If you are using a minimum distance strategy, after the first update step (each spill is now grouped with the nearest house), the algorithm would prioritize cleaning a second house that is closer to the first house, even if another one is closer to the spill (but farther than the first house is from the second)! |
## Author
## Archives
December 2017
## Categories |

Copyright © 2018