The word "Regression" evokes something different in every mind at any table. The weary manager staring at a line drawn through 12 points, the intern halfway through crashing his Excel session, and the scientist crawling through 19 ggplot calls to figure out why his trendline code failed all have the same problem when they go to Google their issues:
Everyone uses the same toolbox, but we all speak different languages.
If we stepped into that scientist's shoes, we would likely conclude the various understandings of regression came from a common ancestor whose identity has been lost to time. The origin of this divergence is the same Bruce Lee saw in martial arts styles when he said "Any [style], however worthy and desirable, becomes a disease when the mind is obsessed with it." Groups develop dogmas over time. Since problems are approached and solved wildly differently across fields (say, between Economics and Biology), it's natural for certain points to be emphasized in one and not in another. Today, we don't even use the same textbooks.
So, what really matters?
Obviously, I'm not putting all the rules and assumptions of regression into this post. Instead, I want to give people who don't think about regression every day (or week) a useful guide for critiquing and interpreting what they're seeing. Think of it as a cheat sheet - all questions a researcher or analyst should answer and all the concerns you should have when you're presented with regression results.
It is instructive to cover regression in two parts: Understanding the Data and Interpreting the Model.
UNDERSTANDING THE DATA
What Successful Managers, Researchers, and Analysts Know Before Trusting Regression Analysis
The specter that haunts the dreams of any analytics manager is Revision. New information, data problems, and model errors all may mean you need to change reported numbers, and even when it's completely unavoidable, that will always take a toll on credibility.
We can't always control the need to revise, but there are some unforced errors that come up consistently. Here are the questions you need to have answered so your group can sidestep some classic slip-ups.
QUESTION ONE: HOW DID THIS DATA GET HERE?
Are the new poll numbers in?
I hope the pollster didn't use a landline - my grandparents are the only people I know who own a phone with a cord.
What do the new SAT scores say about inner city performance?
Check the fine print for how they handled students who dropped out before they could take it.
Reading the Times piece on Millennials caring more about money than previously believed?
It turns out people who will participate in a study for $7 think about money quite a bit.
Understanding how your data got from the world to your computer is critical for detecting serious problems. And these aren't the "I noticed a .21 autocorrelation in your errors." problems that only concern you in theory, these are the "There's a box in your office, and your security badge won't open the doors on your way out." kind of issues:
Pseudoreplication, one of many big words for simple problems, means there is something the same (or similar) about all (or many) of your data points that you couldn't see. This error throws all the advantages of a random sample right out the window. In the case above, a telephone survey is misleading because the respondents had something in common: they were all three-quarters fossilized.
Censoring, besides being an annoying detail of sub-premium TV, happens when you wind up looking at what's left of your data by accident. For example, if students drop out before they can take the test that shows how poorly they are doing, the results will appear artificially high.
Selection Bias has a few different flavors, but the case above pervasive in Behavioral Economics. If you entice a bunch of people to complete a survey in the pursuit of knowledge on a particular group (say, Millennials), you will wind up with information on a subset of that group: Millennials who will take 30 minutes to complete a survey for $7, which shouldn't be used to represent the broader group. Always ask if you or others are inadvertently selecting data.
Question Two: Can this data answer your question?
When you understand how the data came to be in the model and what systems generated that data, it's time to evaluate it in the context of the question posed by your researcher. The point is this: You can have a great model with perfect data and ruin it by asking a bad question.
"The Average Person Has Fewer Than Two Feet"
Always be aware that regressions are complex averages. Parameter estimates are average derivatives, and "forecasts" are average responses. Things like outlier need to be worried about (exactly how worried and why is a topic for another post) for exactly the same reason as in any other context: the average is not a reasonable stand-in for the expected value. Accordingly, regression (in its vanilla form) is not suitable for forecasting unlikely events, though it can show how unlikely past events were. OLS assumes (asymptotically) normally distributed errors, implying they have not only the bell-curve shape, but span an infinite range. If you're looking at an inherently non-negative variable with a bunch of observations at or near 0, that has consequences for the methodology.
Be wary of questions that can't be answered well by averages at all. While it may be arithmetically true that the average number of feet per person is less than two, that data has all useful information averaged out of it.
Does the data capture the system or just the hypothesis?
If you looked at the relationship between changes in the minimum wage and social cohesion, you'd likely conclude that it could be done using data from either 19th century France or modern-day South Africa. However, if you used both, you would likely find nothing at all.
Why? When the minimum wage went up in France, it was about the beginning of democracy: An impoverished population overcoming aristocracy. In South Africa, the minimum wage is used to price blacks out of the labor market. Taken as a panel, the offsetting "effects" of the minimum wage on social cohesion (France's wage being positively correlated with social cohesion while South Africa's is negatively correlated) would yield an wrong answer for both France and South Africa. (This case is an example of Omitted Variable Bias: Getting useless results by leaving something out.)
This is why I'm paranoid about my data capturing details of the mechanism being studied, not just my question. In any system, (economies, environments, etc.) the thing you're interested in probably interacts with that system in many ways. When people rush to isolate the variables of interest (as in the example above), they're bound to miss important details.
Question Three: How (if at all) should this data be extrapolated?
This is the question that likely matters most to you and almost definitely the one the researcher spent the least time thinking about. There are two dimensions of extrapolation: Intensive & Extensive. You need to understand exactly how far the data lets you go in both dimensions.
A subtle version of this exists when Using regression coefficients for control variables and applying extreme values to variables of interest. Consider this: For many, car maintenance expenses are unrelated to weight, but the morbidly obese may have costly problems with alignment and tire wear, especially if they are more likely to own older, less robust cars.
If I were to study the impact of weight on annual expenses, but control for vehicle maintenance, which is a function of weight at extreme values, the estimated impact of weight on total expenses would be too low, since I accidentally attributed some of weight's impact to car maintenance.
Finally, it's important to understand that, by asking these questions, no one is hoping to "catch" the researcher in an error. This is just due diligence. In the real world, the data always has issues, and it's no one's job to create the perfect dataset - we're after results. When giving the results, though, managers and executives should be able to articulate both the insight AND the scenarios when it should be applied.
Put simply: We need to know both what we believe and why we believe it. Understanding the data takes us much closer to being confident in our results and clear about how they should be put incorporated into strategy.
Copyright © 2018