REAL BUSINESS ANALYTICS
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
  • Home
    • What is Business Analytics?
    • Do The Math
    • Know The Tech
    • Adapt & Overcome
  • Consulting
  • About
  • Downloads
  • Analytics 101
    • Why BI Will (Mostly) Die
    • Data Strategy For Analytics
    • What The Hell Is An Analyst?
Search by typing & pressing enter

YOUR CART

Do The Math

RBA Home

1/10/2017 0 Comments

Regression: The Manager's Guide, Part Two

By: Christopher Waldeck
THE MODEL
​What Successful Managers, Researchers, and Analysts Know Before Trusting Regression Analysis
Everyone will be relieved to hear that this article will not attempt to cover the proper way to specify regression models. That topic is both too broad and too deep for this forum. Instead, I'll focus on the real-world pitfalls and opportunities that come up most often when looking at, analyzing, and understanding regression studies.

This article is not designed to have the reader come out the other side a regression wizard. In any case, the last thing most quantitative projects need is another cook in the kitchen. Rather, its purpose is to give managers insight into issues that most often make it through the modeling process unchecked so they can be addressed by the analysts themselves. 


Note: Depending on the flavor of regression being used, estimated coefficients can take on a wide variety of meanings. I'll do my best to stick to general terms like "effect" for the right side of the equation.

A Model Is A (Set Of) Distributional Claim(s)

Consistent Distributions
Right now, someone is saying "Sure, the claim is that the right side can predict or explain the left." While true, this is woefully incomplete.

​This phrasing is far more instructive: A model specifies that a set of appropriately transformed and weighted variables has a joint distribution that is the same or tolerably similar to the distribution of the response variable.

​Put this way, something jumps out that isn't as apparent with the more slapdash formulation: 
The distributions on the left and right had better be the same type. ​​
Picture
Some normal distributions for your enjoyment
Danger: When the independent variables' joint distribution doesn't match the response distribution, estimates become extremely sensitive outside (or even within) the sampled domain & significance tests are unreliable
Solution:  Always check & publish the Q-Q Plot 
Picture
Each variable's distribution makes up a "slice" of a joint (bivariate) distribution
Well-Sampled Distributions
​Each term in a model is a "slice" of a potentially highly dimensional joint distribution, and while these slices are weighted differently, regression assumes your sample provides enough information about each slice.

​That is, each slice's distribution must be sufficiently populated to accurately calculate moments (you remember: 
Mean, Variance, Skewness, Kurtosis, etc.).

​To make the danger here clear, take the case of an analyst looking at wage differences among groups. Let's say he had specified a model that controlled for occupation type, race, and sex. 
Having estimated the model, he speculates that there may be an additional "kicker" effect of being a black woman that is not captured by the impacts of being either a woman or black. To test this idea, he adds an interaction term to estimate the effect of being both black and a woman on wages.

Hold the phone.

With occupation type already in the model, the interaction term is interpreted as the average effect of being a black woman across occupational fields.

This opens the model up to two possible problem scenarios:
  1. The sample only has black women from a couple of occupational fields, and, in those fields, being black and a woman has a measurable impact on wages
  2. The sample has only a few black women from each field, but they all fall near the top or bottom of the wage distribution 

In the first case, the p-value would show the variable is significant because of the lack of observations outside the occupations wherein being a black woman has an impact. Thus, our model would improperly extend the effect to unsampled fields.

The second case is simply a traditional small sample problem: We shouldn't feel comfortable extending the results across all people and fields. In either case, cross-validation would fail to call our results into question because the samples lack counterexamples that likely exist in the larger population.

Danger: Adding complex terms can rapidly reduce the sample size used to estimate effects
Solution:
 Analysts should explicitly list sample sizes used to estimate each term in the model
​

Errors Tell A Story

Don't Minimize Errors (They're Important)
​Quants know to look at error distributions and magnitudes. They know to check for autocorrelation and heteroskedasticity. However, rarely will an analyst actually map the errors back to the data to see how they evolve over all the data's dimensions throughout the model specification process. 

The story of how errors evolve as the model is reformulated and tuned is often summarized by aggregate metrics (Information Criteria; Predicted vs. Actuals; etc.), but mapping the errors to the data allows analysts to be specific about the way in which the model develops ("After adding the term CollegeDegree, the error for predicted wages decreased across the sample except for people who attended trade school"). Narratives like this motivate the progression from one specification to the next.

Using words anyone can understand to intuitively summarize the system as it is being modeled pays dividends when it's time to use the results in anger by providing deep insight with low mental overhead.

I hear you: "Okay...How do I actually do this?"

Use A Storytelling Platform
​Personally, I recommend R Markdown, which eliminates the need to deliver results and reporting separately (read: No Slides).
Picture
​This approach swaps the traditional model-and-a-slide-deck deliverable for institutional knowledge and reproducible research your organization can apply effectively going forward. It scales your group's work, increases your impact, and decreases the long-run cost of analytical development.
​

Regression Is As Good As Your Counterfactual

"Everything should be made as simple as possible, but not simpler" - Probably Not Einstein
Picture
Picture
Picture
Picture
"Dad, I heard this story about a gorilla-"
"We don't speak of 2016, son."
In the real world, simplifying and condensing quantitative work is crucial if our results are to be trusted and relied upon. However, counterfactuals should not be left on the cutting room floor. Those hypotheses that, if true, rebut our models help us understand how sensitive the claims are.

​Nate Silver's FiveThirtyEight famously forecast Hillary's 2016 election win for a long time. However, unlike cases of an upset in sports
, people were extremely frustrated with the model, and most immediately concluded that the model must be fixed: that it was not specified properly. This knee-jerk reaction springs from a failure to understand the counterfactual. To illustrate this point, I'll refer to Silver's model as if the output was the sum of 50 (one for each state) logistic regression models.

The Electoral College sets up a situation in which candidates should generate just enough votes in each state to win (except Maine and Nebraska) and then concentrate efforts elsewhere. As it happened, Hillary generated votes above and beyond what she needed in states she had already won, while Trump won lots of states by a low margin. 

Turning to the simplified Silver model, we see that it was very nearly correct, failing to predict the critical result only slightly - but repeatedly. Complex situations (Electoral College vs. Popular Vote) confer valuable information to our counterfactuals. This model has some multiple of 50 counterfactuals: Each state having several variables that may need a different formulation to properly predict its outcome. One such alternative could posit that the outcome in each state is dependent on the concentration of efforts outside the state. Since the total amount of effort to expend is fixed, resources spent outside each state are considered lost. 

At the end of the day, a review of the many model specifications possible due to sensitivities introduced by the Electoral College, one comes to the valuable conclusion that <bad joke> claims should be kept conservative </bad joke>. In Silver's own post mortem of the election forecasts, he says the outcome models that were there at the time, trained on identical data and coming to wildly different conclusions should have been indicative of precisely this sensitivity.

​In business, we must take care to understand precisely how sensitive to misspecification our models are, especially when we are comfortable with the projections. The solution is being meticulous about testing competing formulations and fostering this discipline in others. 

Wrap Up

Managers working with analytics professionals may often and suddenly find themselves feeling technically outclassed. Using this guide, I hope you will feel empowered to ask the questions that analysts and researchers, caught up in their work, often fail to ask themselves.
​
  1. Check the distributions
  2. Make sure you know the story behind the work
  3. ​And, to paraphrase Angrist, always make a point to review your parallel worlds
0 Comments

    Author

    Christopher Waldeck believes he can eradicate the misuse if the term "Big Data". Others are not so sure.

    Archives

    December 2017
    January 2017
    October 2016

    Categories

    All Data Problems Manager Guide Regression

    RSS Feed

Copyright © 2018