REAL BUSINESS ANALYTICS
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
Search by typing & pressing enter

YOUR CART

Do The Math

RBA Home

12/30/2017 0 Comments

Clustering Methods: Part Two - Hierarchical Clustering

By: Christopher Waldeck

An Example From Biology

Kingdom - Phylum - Class - Order - Family - Genus - Species
I don't know the technical definitions of these terms, but I know my 7th grade science teacher made us memorize them in this order, and I know that today is the first time this knowledge has been of any use.

​Consider biological life as a generic data generation process with individual beings measured as a cross-section, each with many facets. Life (uh, finds a way) naturally occurs in nested subsets. Given sets A and B with subsets {a1, a2} and {b1, b2}, we should expect that facets of elements in a1 and a2 are more similar than they are to those in b1 or b2. As in the biological classification scheme, these sets may be nested for several levels.
Hierarchies are common in data, in fact, this is how the first databases were organized. There are two general algorithmic strategies used to create these groups:
  • Agglomerative: Start with individual elements, and group the closest two together. Repeat until all elements are contained within a single superset.
  • Divisive: Start with one big superset. At each update step, divide a cluster into two based on the proximity of its members. Repeat until each element is in its own "cluster".​
While agglomerative methods are more common, it should be noted that divisive algorithms more naturally lend themselves to parallel processing: A single thread or worker could continue to group Reptiles into subsets without needing to constantly measure their distance from subsets within the Mammal kingdom.
Picture
Embarrassingly, after an entire post on the topic of distances between observations, I'm throwing around a notion of cluster "distance" that lacks clarity. In this context, the principle concern is how to compute distances between clusters as they are being agglomerated. In the illustration above, the blue circle shows two clusters that have just been merged. A few options exist for how to calculate this new cluster's distance from the remaining cluster:
  • ​Maximum (Complete Linkage) - The distance between the closest elements of the previously farther cluster and the remaining cluster
  • Minimum (Single Linkage) - The distance between the closest elements of the previously closer cluster and the remaining cluster
  • Midpoint - The distance between the average, median, or other representative elements (commonly called "medoids" or "prototypes") of the new and remaining cluster
Generally, a midpoint strategy provides the best trade-off.

For example: Imagine you are tasked with prioritizing houses for remediation after an environmental accident (call it a "spill") that effected a few points nearby. You start with spill points to initialize clustering. If you are using a minimum distance strategy, after the first update step (each spill is now grouped with the nearest house), the algorithm would prioritize cleaning a second house that is closer to the first house, even if another one is closer to the spill (but farther than the first house is from the second)!

General Algorithmic Strategy


Before Diving In
Under the hood, clustering algorithms are almost always "greedy algorithms", which means that they are making lots of locally optimal choices in the hopes of coming up with a good approximation of the globally optimal solution. If you're thinking that this language implies there is an objective function to optimize, you'd be right. It is possible, though not always empirically equivalent, to describe the goals of a given algorithmic strategy as an optimization problem. Implementation decisions like the choice of objective function and solution heuristics also drive developers' choices of default measurement and solution strategies. Before overriding the defaults, be sure to read and understand the documentation that accompanies your statistical software/package.  

After the initial distance or dissimilarity calculations (strategies for this covered in the previous post, but briefly revisited here), agglomerative algorithms generally use the Lance-Williams algorithm recursively to calculate distances between new clusters and points:
Picture
The​ interpretation is straightforward: At each step, the optimal pair of clusters must be identified for merging according to some rule, and the distance between all the clusters must be recalculated for the next update step. Using the notation of the second formula, the distance is calculated for the newly combined cluster (i,j) (from the previously disjoint clusters, i and j) from each other disjoint cluster, k.​ 

However, we have some choices to make as modelers: 
  • Is there something about, say, the cluster i that makes distance from it worse than another cluster?
  • If the newly created ij cluster is spread out (d(i,j) is large), do we think we're in danger of creating a sink by accident? After all, large, sparse clusters will likely appear closer to many points, risking a chain reaction.
  • Finally, what if the distance between former cluster i and current cluster k is significantly different from that of former cluster j and current cluster k? Should we punish or reward that heterogeneity?
As you may have guessed, the first bullet addresses the α parameters, the second addresses the β, and the third can be used to address γ (I promise, that's a gamma, not a "y") or to do aid some other distance calculation, as we'll see in the next section. It may seem you're doomed to waste many hours tweaking these parameters in vain (and with dubious results), but we're lucky enough to be able to stand on the shoulders of giants in 2017 - There are several ready-made strategies to employ here:
Extreme Point Strategies:  Referred to as "Single-Linkage" and "Complete-Linkage" for no discernible reason, these strategies apply the following weighting schemes:
Picture
Thus, it halves and adds the distance between a third cluster and the former two and subtracts (single-linkage) or adds (complete-linkage) half the distance between the two former clusters. If you're having trouble seeing the equivalence between this operation and simply taking the distance from the closest point of each cluster, see the illustration below:
Picture
The Lance Williams formula was designed with generality in mind, and as a result, it is capable of expressing many strategies.
Midpoint Strategies: Not surprisingly, there have been many efforts to come up with midpoint measurements that result in stable, useful clusters. I've chosen the most useful and common ones to review here. For disambiguation, these strategies are broken into two additional groups: Implicit Centers and Explicit Centers. All averages imply some central tendency, but some strategies of merging two clusters will result in a new center that is directly derived from the old centers. I think these are worth calling out.
Implicit Centers: WPGMA (Weighted Pair Group Method with Arithmetic mean, obviously) and Simple Average may appear relatively naive strategies because they do not consider the size of the former cluster when computing a new midpoint; however, this feature can help an algorithm avoid "snowballing" due to local optima. Accordingly, these make good benchmarks and reality checks for clustering results.
Picture
Explicit Centers: It can (but won't) be shown that the Median method (also, confusingly, known as Gower's WPGMC) results in combining two clusters whose centers' dissimilarity is their squared absolute dissimilarity.
Picture
Ward's Method, on the other hand, has an interesting property: It minimizes the total within-cluster variance. By extension, Ward's Method minimizes the sum of squared errors due to lack of fit. While this is a useful property, it has been noted exhaustively on threads, boards, etc. that Ward's Method is exclusively valid over Euclidean space. Specifically, for the desirable properties of Ward's method to hold, the distances between the initial points must be proportional to the squared Euclidean distances between them.

Obviously, many dissimilarity metrics would create distances that violate this assumption. However, even deprived of Euclidean space, we don't need to quit. If you're interested in the math behind the conclusion in the Gower case, you can see page 6 of this paper. The Cliff's Notes version is this: It is possible to derive meaningful, metric distances from Gower dissimilarities.​ In R's cluster package, the distance calculations are the same as they are described here. Somewhat embarrassingly, the third party website (for different software) is much clearer in the steps actually performed.

Application

With R's built in dataset "swiss", which has observations on several socioeconomic factors from different towns, we can explore a hypothetical workflow.

Let's get the data set up (and satisfy my personal requirement that variable names not contain periods) by adding a little complexity: Below, I've discretized the variable Catholic and preserved the rankings hierarchy by making it an ordered factor. Lastly, because the row names contain the labels for the cities, I ensure they are preserved (this code would strip the row names without the last line).
Getting Started

    
As you can tell, I'm using the cluster package to which I alluded earlier. The daisy function provides a friendly interface to several methods of calculating distances and dissimilarities. Its output can be consumed directly by actual clustering functions, such as agnes (agglomerative nesting), which performs hierarchical clustering according to the parameters supplied.

    

​The output confirms the equivalence I claimed earlier:

Supplying parameters to the Lance-Williams equation is the same as a directly specifying how to group the clusters.
Picture
Let's take a peek at the Agglomerative Coefficient, a metric that shows the average similarity of the objects that were merged throughout the update steps. As you can see, the agglomerative coefficient is returned as part of the object created by a call to agnes.

    
If you run this code, you'll find that this AC = .999, which isn't as easy to interpret as some may make it appear. See, the AC formula is the average of all (1 - [D(First) / D(Last)]), where D(First) is the dissimilarity of the first cluster an element is merged with and D(Last) is the last merge the algorithm does. As many have noted (because they are copying the documentation), the dissimilarity D(Last) tends to grow with the number of observations, which decreases the ratio and increases the coefficient.

With groups of heterogeneous size and spread, the AC can prove to be quite misleading, with quite high values even for fairly well-clustered data. Also, because AC is an average of non-normalized dissimilarities, even a single point can severely distort it. Caveat emptor.
Now, let's check out an underappreciated feature of the agnes output object - The merge matrix:

    
For the sake of any future code readers, I'm leaving in my own inline comments. The merge matrix output allows users to see the order in which observations and elements were merged. This can be extremely instructive for a postmortem analysis, and also serves to highlight how differently various clustering decisions affect the behavior of the algorithm.
Finally, some visualizations:

    
Picture
For my money, plotting the clustering against the principle variance components is the best way to get an intuitive understanding of how the clustering worked: Group heterogeneity, appropriate cluster number, and cluster distances can all be assessed from here.

I also include the code for generating banner plots and dendrograms (with highlighting!), but I find these to be not only less useful but downright misleading. At first blush, dendrograms appear to be a way to judge clustering performance, but this is not the case. The sizes of clusters when cutting the dendrogram at various points and the shape of the dendrogram might seem like logical things to check, but they are not valid ways to gauge algorithm performance or appropriateness of clustering. This is just the inexperienced researcher hunting for confirmation bias, not hypothesis testing.

For something closer to hypothesis testing for hierarchical clustering, see the tools in the pvclust package, which was designed to assess cluster stability and uncertainty.

​
0 Comments

12/27/2017 0 Comments

Clustering Methods - Part One: All About Distance

By: Christopher Waldeck

A Brief Comment oN Clustering

If you were the jargon-oriented type, you might call clustering an unsupervised machine learning technique. It's especially tempting to do so since clustering, in contrast to classification, isn't trying to produce a result that is readily interpretable to humans. Rather than identifying and extending patterns in a training set to assign meaningful labels (classification) to new records, clustering algorithms iteratively group observations, updating the groupings until some learning and similarity criteria are satisfied.
At first blush, any statistics major would tell you that the results obtained via clustering are not only random but dangerous - most clustering algorithms are complete, meaning they will assign every observation to a group, regardless of the absolute effect this has on groups' integrity (the algorithms minimize the relative harm done by a poor assignment).
Your friendly, neighborhood stats major is not entirely wrong. These algorithms usually need to pick random points to begin updating (improving). As a result, the final states of clustering can change from run to run. In some pathological cases (or, more likely, when an error has been made), they can be entirely unstable from one run to the next. However, armed with cross validation, a few metrics, and some empirical care, reliable clustering results should be achievable when natural clusters exist in the data.

What Does "Close" Mean?

Fair question. In all but the most textbook problems, there are usually big wrinkles in figuring out exactly how to measure distance. Got categorical variables? Ordinal variables? Did you discretize a continuous variable?

In most applied work, the answer is "Yes, and I also have a bunch of NA's." This is far from being a show-stopper, but it should be pretty intuitive that it throws a wrench in the works if you're thinking in Euclidean (continuous) space. How far is the point {Green, Cat, 3} from {Steel, Man, 14}? We can't just arrange the values of categorical variables at random (not even alphabetical order) across an axis and go off to the races; the chosen order determines the distance that will be used by the clustering algorithm.

And what about the interval length? Is the interval [Cat, Dog] shorter than [1, 4]?

The answer to this question usually comes in the form of a distance or dissimilarity matrix. When an algorithm is given a set of points (a dataset, where a row is an observation of k columns or facets), the default behavior is generally to cluster with the goal being to minimizing the sum of squared Euclidean distances between the points.

However, [Cat, Dog], as you may have noticed, is not a point in continuous space. In these more general cases, we need to build dissimilarity matrices that capture how far apart these observations are from one another. We then will need to alert the algorithm (like the ones in R's "cluster" package) that it should minimize the dissimilarity between observations rather than outright distance. These are conceptually similar, but the algorithm needs to treat the inputs differently.
Danger
The algorithms to come will run regardless of whether the variables have been appropriately transformed. This is leads to especially disastrous results when working with mixed variable types (continuous, categorical, etc.). The algorithms will "succeed" insofar as they will assign observations to groups, but the results will be meaningless.

As common as clustering is, it may seem odd that the first step is hot-wiring the problem, but that's exactly what we should do in many cases. Finally, when you choose an algorithm, know what it's doing under the hood. Is it expecting similarities or dissimilarities? Does it think a high value is good or bad? I emphatically encourage the use of a framework like the "cluster" package, which is designed to keep you safe while you're using these algorithms.

​Always practice safe clustering.
This article touches on the following distance metrics: Gower, Manhattan, Mahalanobis, Euclidean, and Cosine similarity. There are, of course, many others, and the intuition provided here should give you what you need to be comfortable enough to dive into, say, the Jaccard-based family of metrics, of which Cosine Similarity is a member. I chose this collection to give a blend of stuff you can use right now and things that will help you build up your knowledge later. Almost all metrics you'll find in applied work are derivatives or family members of the types discussed here.

Finally, this article will not go into depth (showing actual equations) on the calculations of most of these distance metrics because modern statistical software packages will do the math after the user specifies his preference as a function argument or method parameter. For clustering purposes, it's sufficient to get a conceptual hold on what these distances mean.

Gower - Generalized Manhattan Distances For mixed Variable Types

Note: Gower distance, Gower similarity, Gower dissimilarity, and Gower coefficient can all be used to refer to the same metric. For other metrics, the relation Dissimilarity = (1 - Similarity) holds, but not in the Gower context. The Gower coefficient was specifically created to be a metric of similarity that formed a positive semi-definite similarity matrix. 
​Usually, sophisticated distance measures would be at the end of the list. However, because most analyses work with mixed variable types, I'm putting this up front. The Gower dissimilarity matrix is a square matrix of Gower distances between observations. Standard software packages will do these calculations for you, which is convenient and ensures consistency across applications and analysts on one hand, but comes at the risk of creating "grey boxes" within the analysis packages created by your group or company. We'll get into exactly what a Gower dissimilarity matrix contains, but leave implementation-specific points alone.

Recall that the intuitive linear distance between points is of limited use in this context. In this case, the "line" is passing through discrete space - it's bouncing between observations (or the probability thereof, in the case of categorical variables). Accordingly, a modified version of Manhattan distance is what we need.
Picture

Picture
​In the top illustration, you can see the red, blue, and yellow distances are equivalent: each is the Manhattan distance between the two points. In this sense, the Manhattan distance is more "flexible" than the Euclidean distance because the latter is a uniquely defined path for any two points. 
To make the difference tangible, note that the three hollow points in the second illustration are equidistant from the two original points. This is an important characteristic for our purposes because it is entirely conceivable that two observations with categorical variables could be just as different from other records while not being the same (as would be the case with a distance metric that had a unique midpoint)
Gower Distance Calculations
  • Continuous Variables are scaled to the interval [0,1]
  • Ordinal Variables are ranked and scaled [0,1]
  • ​​Categorical Variables distances are the Dice Coefficient, naturally bounded [0,1].
"Dice coefficient?" Fear not - this is just the Jaccard similarity coefficient, re-weighted to reflect the fact that a change from any category to any other is a move of the same "distance". Centrality across the categorical variables is determined by how often their levels coincide across the individuals (in fact, the Dice coefficient can be calculated from a simple cross-tab).

Also, an inherent issue with ordinal variables is repeated ranks across many individuals. While ties can be resolved with some algorithmic trickery, you should be aware that the Spearman rank correlation coefficient (frequently used as the measure of similarity in this context) breaks down when many ties exist. This is a real issue if many of your variables are categorical, and the advantage of using Kendall's Tau is explained in this page of Janos Podani's text. In R's "cluster" package, ranking is avoided altogether - the levels of ordinal variables are simply given integer codes (this is called "standard scoring").

Two final notes: 1) Even though Gower distances are unit-agnostic, the presence of outliers will have a large impact on distance calculations since they will effectively "push" other points together when the variables are re-scaled. Outliers must to be dealt with or removed in preprocessing. 2) By default, a Gower distance matrix weights differences across variables evenly - this may be inappropriate if, say, your dataset has important binary variables.

Euclidean Distance - So, You're Here For Homework?

Good old-fashioned, finite-dimensional, continuous space.  In this case, no distance matrix is required - the observations themselves contain valid information about how far apart they are, and a clustering algorithm can take the data directly. However, even when working with model data, there are usually some adjustments to be made.

Clustering algorithms are easily sidetracked by outliers - just like the more general case, these need to be processed out. Also, variables must be on the same scale for most clustering algorithms, especially the simple ones. K-Means, for instance, requires that clusters be "globular" - if the surface is stretched out by multiple scales, the clusters identified will be unreliable.

In fact, these very drawbacks are the reason for the invention of Mahalanobis Distance. You may see a lot of resources and discussions on StackExchange mention that Mahalanobis Distance is actually about handling correlated data, and they are correct (technically correct - the best kind of correct). The matrix algebra used in the calculation of Mahalanobis distance effectively "divides out" the correlation among the variables. However, it is mathematically equivalent to simply say we are correcting stretched out (elliptical) clusters because linear trends in the space of any variables is the definition of correlation.

Cosine Distance - Build your own Distance Matrix!

This can be a head-scratcher if it isn't natural for you to think of variables as vectors in space. However, cosine similarity is related to the Jaccard similarity coefficient, the foundation for most classic document distance metrics, and it deserves to be mentioned here.

Cosine similarity is, in fact, a special case of correlation. Recall the following from matrix algebra & geometry:
  • The multiplication ("dot product") of any two vectors (variables) is the correlation between them
  • If two vectors are at right angles to one another, they are "orthogonal" or uncorrelated
  • The square root is the relationship between a square's area and the length of one of its sides
Cosine similarity has a useful geometric interpretation: If we treat two sets of word counts as vectors, then we can use the angle formed between them to determine their similarity. However, we won't need a bunch of trigonometry trickery to find the coefficient (we don't even need the cosine function, so you can stop trying to remember the SOH-CAH-TOA rules). In fact, the coefficient is found by dividing each vector by its length and then multiplying them together. To make this concrete, check out the R code below and the executed result to the right:
Cosine Similarity

    
Picture
Cosine similarity lends itself well to document data because it utilizes the vector lengths instead of frequency counts. As a result, 0-0 "matches" in the arrays will be ignored. This is desirable because, if phrases or paragraphs have exactly 0 words in common, that isn't a positive match, and it certainly shouldn't inflate their computed commonality.

In fact, you can confirm this by dropping the middle 0 from each array in the code above - the answer will be the same.

Cosine distances naturally fall on the interval [0, 1] since they are just correlations, and outliers shouldn't be an issue because the scaling isn't a column-wise operation. However, there are many more modern techniques for capturing document distances that utilize syntax, semantics, and even access thesauruses to assess how similar the meanings of the words are. However, in every data mining class, cosine similarity will be the starting point for the topic of text mining.

Next - Hierarchical & Partitioning Methods

The following two parts of this article will cover Hierarchical and Partitioning clustering methods in turn, with coded examples. 
Part Two - Hierarchical Clustering
​
0 Comments

1/10/2017 0 Comments

Regression: The Manager's Guide, Part Two

By: Christopher Waldeck
THE MODEL
​What Successful Managers, Researchers, and Analysts Know Before Trusting Regression Analysis
Everyone will be relieved to hear that this article will not attempt to cover the proper way to specify regression models. That topic is both too broad and too deep for this forum. Instead, I'll focus on the real-world pitfalls and opportunities that come up most often when looking at, analyzing, and understanding regression studies.

This article is not designed to have the reader come out the other side a regression wizard. In any case, the last thing most quantitative projects need is another cook in the kitchen. Rather, its purpose is to give managers insight into issues that most often make it through the modeling process unchecked so they can be addressed by the analysts themselves. 


Note: Depending on the flavor of regression being used, estimated coefficients can take on a wide variety of meanings. I'll do my best to stick to general terms like "effect" for the right side of the equation.

A Model Is A (Set Of) Distributional Claim(s)

Consistent Distributions
Right now, someone is saying "Sure, the claim is that the right side can predict or explain the left." While true, this is woefully incomplete.

​This phrasing is far more instructive: A model specifies that a set of appropriately transformed and weighted variables has a joint distribution that is the same or tolerably similar to the distribution of the response variable.

​Put this way, something jumps out that isn't as apparent with the more slapdash formulation: 
The distributions on the left and right had better be the same type. ​​
Picture
Some normal distributions for your enjoyment
Danger: When the independent variables' joint distribution doesn't match the response distribution, estimates become extremely sensitive outside (or even within) the sampled domain & significance tests are unreliable
Solution:  Always check & publish the Q-Q Plot 
Picture
Each variable's distribution makes up a "slice" of a joint (bivariate) distribution
Well-Sampled Distributions
​Each term in a model is a "slice" of a potentially highly dimensional joint distribution, and while these slices are weighted differently, regression assumes your sample provides enough information about each slice.

​That is, each slice's distribution must be sufficiently populated to accurately calculate moments (you remember: 
Mean, Variance, Skewness, Kurtosis, etc.).

​To make the danger here clear, take the case of an analyst looking at wage differences among groups. Let's say he had specified a model that controlled for occupation type, race, and sex. 
Having estimated the model, he speculates that there may be an additional "kicker" effect of being a black woman that is not captured by the impacts of being either a woman or black. To test this idea, he adds an interaction term to estimate the effect of being both black and a woman on wages.

Hold the phone.

With occupation type already in the model, the interaction term is interpreted as the average effect of being a black woman across occupational fields.

This opens the model up to two possible problem scenarios:
  1. The sample only has black women from a couple of occupational fields, and, in those fields, being black and a woman has a measurable impact on wages
  2. The sample has only a few black women from each field, but they all fall near the top or bottom of the wage distribution 

In the first case, the p-value would show the variable is significant because of the lack of observations outside the occupations wherein being a black woman has an impact. Thus, our model would improperly extend the effect to unsampled fields.

The second case is simply a traditional small sample problem: We shouldn't feel comfortable extending the results across all people and fields. In either case, cross-validation would fail to call our results into question because the samples lack counterexamples that likely exist in the larger population.

Danger: Adding complex terms can rapidly reduce the sample size used to estimate effects
Solution:
 Analysts should explicitly list sample sizes used to estimate each term in the model
​

Errors Tell A Story

Don't Minimize Errors (They're Important)
​Quants know to look at error distributions and magnitudes. They know to check for autocorrelation and heteroskedasticity. However, rarely will an analyst actually map the errors back to the data to see how they evolve over all the data's dimensions throughout the model specification process. 

The story of how errors evolve as the model is reformulated and tuned is often summarized by aggregate metrics (Information Criteria; Predicted vs. Actuals; etc.), but mapping the errors to the data allows analysts to be specific about the way in which the model develops ("After adding the term CollegeDegree, the error for predicted wages decreased across the sample except for people who attended trade school"). Narratives like this motivate the progression from one specification to the next.

Using words anyone can understand to intuitively summarize the system as it is being modeled pays dividends when it's time to use the results in anger by providing deep insight with low mental overhead.

I hear you: "Okay...How do I actually do this?"

Use A Storytelling Platform
​Personally, I recommend R Markdown, which eliminates the need to deliver results and reporting separately (read: No Slides).
Picture
​This approach swaps the traditional model-and-a-slide-deck deliverable for institutional knowledge and reproducible research your organization can apply effectively going forward. It scales your group's work, increases your impact, and decreases the long-run cost of analytical development.
​

Regression Is As Good As Your Counterfactual

"Everything should be made as simple as possible, but not simpler" - Probably Not Einstein
Picture
Picture
Picture
Picture
"Dad, I heard this story about a gorilla-"
"We don't speak of 2016, son."
In the real world, simplifying and condensing quantitative work is crucial if our results are to be trusted and relied upon. However, counterfactuals should not be left on the cutting room floor. Those hypotheses that, if true, rebut our models help us understand how sensitive the claims are.

​Nate Silver's FiveThirtyEight famously forecast Hillary's 2016 election win for a long time. However, unlike cases of an upset in sports
, people were extremely frustrated with the model, and most immediately concluded that the model must be fixed: that it was not specified properly. This knee-jerk reaction springs from a failure to understand the counterfactual. To illustrate this point, I'll refer to Silver's model as if the output was the sum of 50 (one for each state) logistic regression models.

The Electoral College sets up a situation in which candidates should generate just enough votes in each state to win (except Maine and Nebraska) and then concentrate efforts elsewhere. As it happened, Hillary generated votes above and beyond what she needed in states she had already won, while Trump won lots of states by a low margin. 

Turning to the simplified Silver model, we see that it was very nearly correct, failing to predict the critical result only slightly - but repeatedly. Complex situations (Electoral College vs. Popular Vote) confer valuable information to our counterfactuals. This model has some multiple of 50 counterfactuals: Each state having several variables that may need a different formulation to properly predict its outcome. One such alternative could posit that the outcome in each state is dependent on the concentration of efforts outside the state. Since the total amount of effort to expend is fixed, resources spent outside each state are considered lost. 

At the end of the day, a review of the many model specifications possible due to sensitivities introduced by the Electoral College, one comes to the valuable conclusion that <bad joke> claims should be kept conservative </bad joke>. In Silver's own post mortem of the election forecasts, he says the outcome models that were there at the time, trained on identical data and coming to wildly different conclusions should have been indicative of precisely this sensitivity.

​In business, we must take care to understand precisely how sensitive to misspecification our models are, especially when we are comfortable with the projections. The solution is being meticulous about testing competing formulations and fostering this discipline in others. 

Wrap Up

Managers working with analytics professionals may often and suddenly find themselves feeling technically outclassed. Using this guide, I hope you will feel empowered to ask the questions that analysts and researchers, caught up in their work, often fail to ask themselves.
​
  1. Check the distributions
  2. Make sure you know the story behind the work
  3. ​And, to paraphrase Angrist, always make a point to review your parallel worlds
0 Comments

10/30/2016 0 Comments

Regression: The Manager's Guide                                

By: Christopher Waldeck
The word "Regression" evokes something different in every mind at any table. The weary manager staring at a line drawn through 12 points, the intern halfway through crashing his Excel session, and the scientist crawling through 19 ggplot calls to figure out why his trendline code failed all have the same problem when they go to Google their issues:

Everyone uses the same toolbox, but we all speak different languages.

If we stepped into that scientist's shoes, we would likely conclude the various understandings of regression came from a common ancestor whose identity has been lost to time. The origin of this divergence is the same Bruce Lee saw in martial arts styles when he said "Any [style], however worthy and desirable, becomes a disease when the mind is obsessed with it." Groups develop dogmas over time. Since problems are approached and solved wildly differently across fields (say, between Economics and Biology), it's natural for certain points to be emphasized in one and not in another. Today, we don't even use the same textbooks.

So, what really matters?

Obviously, I'm not putting all the rules and assumptions of regression into this post. Instead, I want to give people who don't think about regression every day (or week) a useful guide for critiquing and interpreting what they're seeing. Think of it as a cheat sheet - all questions a researcher or analyst should answer and all the concerns you should have when you're presented with regression results.

It is instructive to cover regression in two parts: Understanding the Data and Interpreting the Model. 

UNDERSTANDING THE DATA
​What Successful Managers, Researchers, and Analysts Know Before Trusting Regression Analysis
The specter that haunts the dreams of any analytics manager is Revision. New information, data problems, and model errors all may mean you need to change reported numbers, and even when it's completely unavoidable, that will always take a toll on credibility.

We can't always control the need to revise, but there are some unforced errors that come up consistently. Here are the questions you need to have answered so your group can sidestep some classic slip-ups.

​QUESTION ONE: HOW DID THIS DATA GET HERE?

Are the new poll numbers in?
          I hope the pollster didn't use a landline - my grandparents are the only people I know who own a phone with a cord.

What do the new SAT scores say about inner city performance?​
          Check the fine print for how they handled students who dropped out before they could take it. 

Reading the Times piece on Millennials caring more about money than previously believed?
          It turns out people who will participate in a study for $7 think about money quite a bit.
Understanding how your data got from the world to your computer is critical for detecting serious problems. And these aren't the "I noticed a .21 autocorrelation in your errors." problems that only concern you in theory, these are the "There's a box in your office, and your security badge won't open the doors on your way out." kind of issues:
Pseudoreplication
Censoring
Selection Bias
​​Pseudoreplication, one of many big words for simple problems, means there is something the same (or similar) about all (or many) of your data points that you couldn't see. This error throws all the advantages of a random sample right out the window. In the case above, a telephone survey is misleading because the respondents had something in common: they were all three-quarters fossilized.
Censoring, besides being an annoying detail of sub-premium TV, happens when you wind up looking at what's left of your data by accident. For example, if students drop out before they can take the test that shows how poorly they are doing, the results will appear artificially high.   ​
Selection Bias has a few different flavors, but the case above pervasive in Behavioral Economics. If you entice a bunch of people to complete a survey in the pursuit of knowledge on a particular group (say, Millennials), you will wind up with information on a subset of that group: Millennials who will take 30 minutes to complete a survey for $7, which shouldn't be used to represent the broader group. Always ask if you or others are inadvertently selecting data.   ​

Question Two: Can this data answer your question?

​When you understand how the data came to be in the model and what systems generated that data, it's time to evaluate it in the context of the question posed by your researcher. The point is this: You can have a great model with perfect data and ruin it by asking a bad question.
"The Average Person Has Fewer Than Two Feet"
Picture
Always be aware that regressions are complex averages. Parameter estimates are average derivatives, and "forecasts" are average responses. Things like outlier need to be worried about (exactly how worried and why is a topic for another post) for exactly the same reason as in any other context: the average is not a reasonable stand-in for the expected value. Accordingly, regression (in its vanilla form) is not suitable for forecasting unlikely events, though it can show how unlikely past events were. OLS assumes (asymptotically) normally distributed errors, implying they have not only the bell-curve shape, but span an infinite range. If you're looking at an inherently non-negative variable with a bunch of observations at or near 0, that has consequences for the methodology.

Be wary of questions that can't be answered well by averages at all. While it may be arithmetically true that the average number of feet per person is less than two, that data has all useful information averaged out of it.​
Does the data capture the system or just the hypothesis? ​
Picture
If you looked at the relationship between changes in the minimum wage and social cohesion, you'd likely conclude that it could be done using data from either 19th century France or modern-day South Africa. However, if you used both, you would likely find nothing at all.

Why? When the minimum wage went up in France, it was about the beginning of democracy: An impoverished population overcoming aristocracy. In South Africa, the minimum wage is used to price blacks out of the labor market. Taken as a panel, the offsetting "effects" of the minimum wage on social cohesion (France's wage being positively correlated with social cohesion while South Africa's is negatively correlated) would yield an wrong answer for both France and South Africa. (This case is an example of Omitted Variable Bias: Getting useless results by leaving something out.)

This is why I'm paranoid about my data capturing details of the mechanism being studied, not just my question.  In any system, (economies, environments, etc.) the thing you're interested in probably interacts with that system in many ways. When people rush to isolate the variables of interest (as in the example above), they're bound to miss important details.

Question Three: How (if at all) should this data be extrapolated?

​This is the question that likely matters most to you and almost definitely the one the researcher spent the least time thinking about. There are two dimensions of extrapolation: Intensive & Extensive.​ You need to understand exactly how far the data lets you go in both dimensions.                                                                                                                                                                    
  • Extensive Extrapolation means using regression estimates from the studied group on different parts of a population. A naive approach says extrapolation is justified if the groups are similar, but the proper approach also asks if a new population has the same ability to respond as the studied group did. Returning to the South Africa example, what if social cohesion is already so nonexistent that marginally increases in discrimination has no impact at all? Should that result be applied elsewhere?                                                                                                                                                                                                         
  • Intensive Extrapolation is applying regression coefficients to values outside the range (de-/increasing the intensity) observed in the sample. While we may estimate that toddlers grow 1 inch taller per 10 pounds of food eaten, that clearly doesn't imply we can make a toddler grow 10 inches by cramming 100 pounds of food into him. 
​
​A subtle version of this exists when Using regression coefficients for control variables and applying extreme values to variables of interest. Consider this: For many, car maintenance expenses are unrelated to weight, but the morbidly obese may have costly problems with alignment and tire wear, especially if they are more likely to own older, less robust cars.

​If I were to study the impact of weight on annual expenses, but control for vehicle maintenance, which is a function of weight at extreme values, the estimated impact of weight on total expenses would be too low, since I accidentally attributed some of weight's impact to car maintenance.    
                       

wrap up

Finally, it's important to understand that, by asking these questions, no one is hoping to "catch" the researcher in an error. This is just due diligence. In the real world, the data always has issues, and it's no one's job to create the perfect dataset - we're after results. When giving the results, though, managers and executives should be able to articulate both the insight AND the scenarios when it should be applied.

Put simply: We need to know both what we believe and why we believe it. Understanding the data takes us much closer to being confident in our results and clear about how they should be put incorporated into strategy.

0 Comments

    Author

    Christopher Waldeck believes he can eradicate the misuse if the term "Big Data". Others are not so sure.

    Archives

    December 2017
    January 2017
    October 2016

    Categories

    All Data Problems Manager Guide Regression

    RSS Feed

Copyright © 2018