Do The Math |
An Example From BiologyKingdom - Phylum - Class - Order - Family - Genus - Species I don't know the technical definitions of these terms, but I know my 7th grade science teacher made us memorize them in this order, and I know that today is the first time this knowledge has been of any use. Consider biological life as a generic data generation process with individual beings measured as a cross-section, each with many facets. Life (uh, finds a way) naturally occurs in nested subsets. Given sets A and B with subsets {a1, a2} and {b1, b2}, we should expect that facets of elements in a1 and a2 are more similar than they are to those in b1 or b2. As in the biological classification scheme, these sets may be nested for several levels. Hierarchies are common in data, in fact, this is how the first databases were organized. There are two general algorithmic strategies used to create these groups:
While agglomerative methods are more common, it should be noted that divisive algorithms more naturally lend themselves to parallel processing: A single thread or worker could continue to group Reptiles into subsets without needing to constantly measure their distance from subsets within the Mammal kingdom. Embarrassingly, after an entire post on the topic of distances between observations, I'm throwing around a notion of cluster "distance" that lacks clarity. In this context, the principle concern is how to compute distances between clusters as they are being agglomerated. In the illustration above, the blue circle shows two clusters that have just been merged. A few options exist for how to calculate this new cluster's distance from the remaining cluster:
Generally, a midpoint strategy provides the best trade-off. For example: Imagine you are tasked with prioritizing houses for remediation after an environmental accident (call it a "spill") that effected a few points nearby. You start with spill points to initialize clustering. If you are using a minimum distance strategy, after the first update step (each spill is now grouped with the nearest house), the algorithm would prioritize cleaning a second house that is closer to the first house, even if another one is closer to the spill (but farther than the first house is from the second)! General Algorithmic StrategyBefore Diving In Under the hood, clustering algorithms are almost always "greedy algorithms", which means that they are making lots of locally optimal choices in the hopes of coming up with a good approximation of the globally optimal solution. If you're thinking that this language implies there is an objective function to optimize, you'd be right. It is possible, though not always empirically equivalent, to describe the goals of a given algorithmic strategy as an optimization problem. Implementation decisions like the choice of objective function and solution heuristics also drive developers' choices of default measurement and solution strategies. Before overriding the defaults, be sure to read and understand the documentation that accompanies your statistical software/package. After the initial distance or dissimilarity calculations (strategies for this covered in the previous post, but briefly revisited here), agglomerative algorithms generally use the Lance-Williams algorithm recursively to calculate distances between new clusters and points: The interpretation is straightforward: At each step, the optimal pair of clusters must be identified for merging according to some rule, and the distance between all the clusters must be recalculated for the next update step. Using the notation of the second formula, the distance is calculated for the newly combined cluster (i,j) (from the previously disjoint clusters, i and j) from each other disjoint cluster, k. However, we have some choices to make as modelers:
As you may have guessed, the first bullet addresses the α parameters, the second addresses the β, and the third can be used to address γ (I promise, that's a gamma, not a "y") or to do aid some other distance calculation, as we'll see in the next section. It may seem you're doomed to waste many hours tweaking these parameters in vain (and with dubious results), but we're lucky enough to be able to stand on the shoulders of giants in 2017 - There are several ready-made strategies to employ here: Extreme Point Strategies: Referred to as "Single-Linkage" and "Complete-Linkage" for no discernible reason, these strategies apply the following weighting schemes: Thus, it halves and adds the distance between a third cluster and the former two and subtracts (single-linkage) or adds (complete-linkage) half the distance between the two former clusters. If you're having trouble seeing the equivalence between this operation and simply taking the distance from the closest point of each cluster, see the illustration below: The Lance Williams formula was designed with generality in mind, and as a result, it is capable of expressing many strategies. Midpoint Strategies: Not surprisingly, there have been many efforts to come up with midpoint measurements that result in stable, useful clusters. I've chosen the most useful and common ones to review here. For disambiguation, these strategies are broken into two additional groups: Implicit Centers and Explicit Centers. All averages imply some central tendency, but some strategies of merging two clusters will result in a new center that is directly derived from the old centers. I think these are worth calling out. Implicit Centers: WPGMA (Weighted Pair Group Method with Arithmetic mean, obviously) and Simple Average may appear relatively naive strategies because they do not consider the size of the former cluster when computing a new midpoint; however, this feature can help an algorithm avoid "snowballing" due to local optima. Accordingly, these make good benchmarks and reality checks for clustering results. Explicit Centers: It can (but won't) be shown that the Median method (also, confusingly, known as Gower's WPGMC) results in combining two clusters whose centers' dissimilarity is their squared absolute dissimilarity. Ward's Method, on the other hand, has an interesting property: It minimizes the total within-cluster variance. By extension, Ward's Method minimizes the sum of squared errors due to lack of fit. While this is a useful property, it has been noted exhaustively on threads, boards, etc. that Ward's Method is exclusively valid over Euclidean space. Specifically, for the desirable properties of Ward's method to hold, the distances between the initial points must be proportional to the squared Euclidean distances between them. Obviously, many dissimilarity metrics would create distances that violate this assumption. However, even deprived of Euclidean space, we don't need to quit. If you're interested in the math behind the conclusion in the Gower case, you can see page 6 of this paper. The Cliff's Notes version is this: It is possible to derive meaningful, metric distances from Gower dissimilarities. In R's cluster package, the distance calculations are the same as they are described here. Somewhat embarrassingly, the third party website (for different software) is much clearer in the steps actually performed. ApplicationWith R's built in dataset "swiss", which has observations on several socioeconomic factors from different towns, we can explore a hypothetical workflow. Let's get the data set up (and satisfy my personal requirement that variable names not contain periods) by adding a little complexity: Below, I've discretized the variable Catholic and preserved the rankings hierarchy by making it an ordered factor. Lastly, because the row names contain the labels for the cities, I ensure they are preserved (this code would strip the row names without the last line). Getting Started
As you can tell, I'm using the cluster package to which I alluded earlier. The daisy function provides a friendly interface to several methods of calculating distances and dissimilarities. Its output can be consumed directly by actual clustering functions, such as agnes (agglomerative nesting), which performs hierarchical clustering according to the parameters supplied. Let's take a peek at the Agglomerative Coefficient, a metric that shows the average similarity of the objects that were merged throughout the update steps. As you can see, the agglomerative coefficient is returned as part of the object created by a call to agnes. If you run this code, you'll find that this AC = .999, which isn't as easy to interpret as some may make it appear. See, the AC formula is the average of all (1 - [D(First) / D(Last)]), where D(First) is the dissimilarity of the first cluster an element is merged with and D(Last) is the last merge the algorithm does. As many have noted (because they are copying the documentation), the dissimilarity D(Last) tends to grow with the number of observations, which decreases the ratio and increases the coefficient. With groups of heterogeneous size and spread, the AC can prove to be quite misleading, with quite high values even for fairly well-clustered data. Also, because AC is an average of non-normalized dissimilarities, even a single point can severely distort it. Caveat emptor. Now, let's check out an underappreciated feature of the agnes output object - The merge matrix: For the sake of any future code readers, I'm leaving in my own inline comments. The merge matrix output allows users to see the order in which observations and elements were merged. This can be extremely instructive for a postmortem analysis, and also serves to highlight how differently various clustering decisions affect the behavior of the algorithm. Finally, some visualizations: For my money, plotting the clustering against the principle variance components is the best way to get an intuitive understanding of how the clustering worked: Group heterogeneity, appropriate cluster number, and cluster distances can all be assessed from here. I also include the code for generating banner plots and dendrograms (with highlighting!), but I find these to be not only less useful but downright misleading. At first blush, dendrograms appear to be a way to judge clustering performance, but this is not the case. The sizes of clusters when cutting the dendrogram at various points and the shape of the dendrogram might seem like logical things to check, but they are not valid ways to gauge algorithm performance or appropriateness of clustering. This is just the inexperienced researcher hunting for confirmation bias, not hypothesis testing. For something closer to hypothesis testing for hierarchical clustering, see the tools in the pvclust package, which was designed to assess cluster stability and uncertainty.
0 Comments
A Brief Comment oN ClusteringIf you were the jargon-oriented type, you might call clustering an unsupervised machine learning technique. It's especially tempting to do so since clustering, in contrast to classification, isn't trying to produce a result that is readily interpretable to humans. Rather than identifying and extending patterns in a training set to assign meaningful labels (classification) to new records, clustering algorithms iteratively group observations, updating the groupings until some learning and similarity criteria are satisfied. At first blush, any statistics major would tell you that the results obtained via clustering are not only random but dangerous - most clustering algorithms are complete, meaning they will assign every observation to a group, regardless of the absolute effect this has on groups' integrity (the algorithms minimize the relative harm done by a poor assignment). Your friendly, neighborhood stats major is not entirely wrong. These algorithms usually need to pick random points to begin updating (improving). As a result, the final states of clustering can change from run to run. In some pathological cases (or, more likely, when an error has been made), they can be entirely unstable from one run to the next. However, armed with cross validation, a few metrics, and some empirical care, reliable clustering results should be achievable when natural clusters exist in the data. What Does "Close" Mean?Fair question. In all but the most textbook problems, there are usually big wrinkles in figuring out exactly how to measure distance. Got categorical variables? Ordinal variables? Did you discretize a continuous variable? In most applied work, the answer is "Yes, and I also have a bunch of NA's." This is far from being a show-stopper, but it should be pretty intuitive that it throws a wrench in the works if you're thinking in Euclidean (continuous) space. How far is the point {Green, Cat, 3} from {Steel, Man, 14}? We can't just arrange the values of categorical variables at random (not even alphabetical order) across an axis and go off to the races; the chosen order determines the distance that will be used by the clustering algorithm. And what about the interval length? Is the interval [Cat, Dog] shorter than [1, 4]? The answer to this question usually comes in the form of a distance or dissimilarity matrix. When an algorithm is given a set of points (a dataset, where a row is an observation of k columns or facets), the default behavior is generally to cluster with the goal being to minimizing the sum of squared Euclidean distances between the points. However, [Cat, Dog], as you may have noticed, is not a point in continuous space. In these more general cases, we need to build dissimilarity matrices that capture how far apart these observations are from one another. We then will need to alert the algorithm (like the ones in R's "cluster" package) that it should minimize the dissimilarity between observations rather than outright distance. These are conceptually similar, but the algorithm needs to treat the inputs differently.
This article touches on the following distance metrics: Gower, Manhattan, Mahalanobis, Euclidean, and Cosine similarity. There are, of course, many others, and the intuition provided here should give you what you need to be comfortable enough to dive into, say, the Jaccard-based family of metrics, of which Cosine Similarity is a member. I chose this collection to give a blend of stuff you can use right now and things that will help you build up your knowledge later. Almost all metrics you'll find in applied work are derivatives or family members of the types discussed here. Finally, this article will not go into depth (showing actual equations) on the calculations of most of these distance metrics because modern statistical software packages will do the math after the user specifies his preference as a function argument or method parameter. For clustering purposes, it's sufficient to get a conceptual hold on what these distances mean. Gower - Generalized Manhattan Distances For mixed Variable TypesNote: Gower distance, Gower similarity, Gower dissimilarity, and Gower coefficient can all be used to refer to the same metric. For other metrics, the relation Dissimilarity = (1 - Similarity) holds, but not in the Gower context. The Gower coefficient was specifically created to be a metric of similarity that formed a positive semi-definite similarity matrix. Usually, sophisticated distance measures would be at the end of the list. However, because most analyses work with mixed variable types, I'm putting this up front. The Gower dissimilarity matrix is a square matrix of Gower distances between observations. Standard software packages will do these calculations for you, which is convenient and ensures consistency across applications and analysts on one hand, but comes at the risk of creating "grey boxes" within the analysis packages created by your group or company. We'll get into exactly what a Gower dissimilarity matrix contains, but leave implementation-specific points alone. Recall that the intuitive linear distance between points is of limited use in this context. In this case, the "line" is passing through discrete space - it's bouncing between observations (or the probability thereof, in the case of categorical variables). Accordingly, a modified version of Manhattan distance is what we need.
"Dice coefficient?" Fear not - this is just the Jaccard similarity coefficient, re-weighted to reflect the fact that a change from any category to any other is a move of the same "distance". Centrality across the categorical variables is determined by how often their levels coincide across the individuals (in fact, the Dice coefficient can be calculated from a simple cross-tab). Also, an inherent issue with ordinal variables is repeated ranks across many individuals. While ties can be resolved with some algorithmic trickery, you should be aware that the Spearman rank correlation coefficient (frequently used as the measure of similarity in this context) breaks down when many ties exist. This is a real issue if many of your variables are categorical, and the advantage of using Kendall's Tau is explained in this page of Janos Podani's text. In R's "cluster" package, ranking is avoided altogether - the levels of ordinal variables are simply given integer codes (this is called "standard scoring"). Two final notes: 1) Even though Gower distances are unit-agnostic, the presence of outliers will have a large impact on distance calculations since they will effectively "push" other points together when the variables are re-scaled. Outliers must to be dealt with or removed in preprocessing. 2) By default, a Gower distance matrix weights differences across variables evenly - this may be inappropriate if, say, your dataset has important binary variables. Euclidean Distance - So, You're Here For Homework?Good old-fashioned, finite-dimensional, continuous space. In this case, no distance matrix is required - the observations themselves contain valid information about how far apart they are, and a clustering algorithm can take the data directly. However, even when working with model data, there are usually some adjustments to be made. Clustering algorithms are easily sidetracked by outliers - just like the more general case, these need to be processed out. Also, variables must be on the same scale for most clustering algorithms, especially the simple ones. K-Means, for instance, requires that clusters be "globular" - if the surface is stretched out by multiple scales, the clusters identified will be unreliable. In fact, these very drawbacks are the reason for the invention of Mahalanobis Distance. You may see a lot of resources and discussions on StackExchange mention that Mahalanobis Distance is actually about handling correlated data, and they are correct (technically correct - the best kind of correct). The matrix algebra used in the calculation of Mahalanobis distance effectively "divides out" the correlation among the variables. However, it is mathematically equivalent to simply say we are correcting stretched out (elliptical) clusters because linear trends in the space of any variables is the definition of correlation. Cosine Distance - Build your own Distance Matrix!This can be a head-scratcher if it isn't natural for you to think of variables as vectors in space. However, cosine similarity is related to the Jaccard similarity coefficient, the foundation for most classic document distance metrics, and it deserves to be mentioned here. Cosine similarity is, in fact, a special case of correlation. Recall the following from matrix algebra & geometry:
Cosine similarity has a useful geometric interpretation: If we treat two sets of word counts as vectors, then we can use the angle formed between them to determine their similarity. However, we won't need a bunch of trigonometry trickery to find the coefficient (we don't even need the cosine function, so you can stop trying to remember the SOH-CAH-TOA rules). In fact, the coefficient is found by dividing each vector by its length and then multiplying them together. To make this concrete, check out the R code below and the executed result to the right: Cosine similarity lends itself well to document data because it utilizes the vector lengths instead of frequency counts. As a result, 0-0 "matches" in the arrays will be ignored. This is desirable because, if phrases or paragraphs have exactly 0 words in common, that isn't a positive match, and it certainly shouldn't inflate their computed commonality. In fact, you can confirm this by dropping the middle 0 from each array in the code above - the answer will be the same. Cosine distances naturally fall on the interval [0, 1] since they are just correlations, and outliers shouldn't be an issue because the scaling isn't a column-wise operation. However, there are many more modern techniques for capturing document distances that utilize syntax, semantics, and even access thesauruses to assess how similar the meanings of the words are. However, in every data mining class, cosine similarity will be the starting point for the topic of text mining. Next - Hierarchical & Partitioning MethodsThe following two parts of this article will cover Hierarchical and Partitioning clustering methods in turn, with coded examples.
THE MODEL What Successful Managers, Researchers, and Analysts Know Before Trusting Regression Analysis Everyone will be relieved to hear that this article will not attempt to cover the proper way to specify regression models. That topic is both too broad and too deep for this forum. Instead, I'll focus on the real-world pitfalls and opportunities that come up most often when looking at, analyzing, and understanding regression studies. This article is not designed to have the reader come out the other side a regression wizard. In any case, the last thing most quantitative projects need is another cook in the kitchen. Rather, its purpose is to give managers insight into issues that most often make it through the modeling process unchecked so they can be addressed by the analysts themselves. Note: Depending on the flavor of regression being used, estimated coefficients can take on a wide variety of meanings. I'll do my best to stick to general terms like "effect" for the right side of the equation. A Model Is A (Set Of) Distributional Claim(s)
Danger: When the independent variables' joint distribution doesn't match the response distribution, estimates become extremely sensitive outside (or even within) the sampled domain & significance tests are unreliable Solution: Always check & publish the Q-Q Plot
Having estimated the model, he speculates that there may be an additional "kicker" effect of being a black woman that is not captured by the impacts of being either a woman or black. To test this idea, he adds an interaction term to estimate the effect of being both black and a woman on wages. Hold the phone. With occupation type already in the model, the interaction term is interpreted as the average effect of being a black woman across occupational fields. This opens the model up to two possible problem scenarios:
In the first case, the p-value would show the variable is significant because of the lack of observations outside the occupations wherein being a black woman has an impact. Thus, our model would improperly extend the effect to unsampled fields. The second case is simply a traditional small sample problem: We shouldn't feel comfortable extending the results across all people and fields. In either case, cross-validation would fail to call our results into question because the samples lack counterexamples that likely exist in the larger population. Danger: Adding complex terms can rapidly reduce the sample size used to estimate effects Solution: Analysts should explicitly list sample sizes used to estimate each term in the model Errors Tell A StoryDon't Minimize Errors (They're Important) Quants know to look at error distributions and magnitudes. They know to check for autocorrelation and heteroskedasticity. However, rarely will an analyst actually map the errors back to the data to see how they evolve over all the data's dimensions throughout the model specification process. The story of how errors evolve as the model is reformulated and tuned is often summarized by aggregate metrics (Information Criteria; Predicted vs. Actuals; etc.), but mapping the errors to the data allows analysts to be specific about the way in which the model develops ("After adding the term CollegeDegree, the error for predicted wages decreased across the sample except for people who attended trade school"). Narratives like this motivate the progression from one specification to the next. Using words anyone can understand to intuitively summarize the system as it is being modeled pays dividends when it's time to use the results in anger by providing deep insight with low mental overhead. I hear you: "Okay...How do I actually do this?" Use A Storytelling Platform Personally, I recommend R Markdown, which eliminates the need to deliver results and reporting separately (read: No Slides). This approach swaps the traditional model-and-a-slide-deck deliverable for institutional knowledge and reproducible research your organization can apply effectively going forward. It scales your group's work, increases your impact, and decreases the long-run cost of analytical development. Regression Is As Good As Your Counterfactual"Everything should be made as simple as possible, but not simpler" - Probably Not Einstein "Dad, I heard this story about a gorilla-" "We don't speak of 2016, son." In the real world, simplifying and condensing quantitative work is crucial if our results are to be trusted and relied upon. However, counterfactuals should not be left on the cutting room floor. Those hypotheses that, if true, rebut our models help us understand how sensitive the claims are. Nate Silver's FiveThirtyEight famously forecast Hillary's 2016 election win for a long time. However, unlike cases of an upset in sports, people were extremely frustrated with the model, and most immediately concluded that the model must be fixed: that it was not specified properly. This knee-jerk reaction springs from a failure to understand the counterfactual. To illustrate this point, I'll refer to Silver's model as if the output was the sum of 50 (one for each state) logistic regression models. The Electoral College sets up a situation in which candidates should generate just enough votes in each state to win (except Maine and Nebraska) and then concentrate efforts elsewhere. As it happened, Hillary generated votes above and beyond what she needed in states she had already won, while Trump won lots of states by a low margin. Turning to the simplified Silver model, we see that it was very nearly correct, failing to predict the critical result only slightly - but repeatedly. Complex situations (Electoral College vs. Popular Vote) confer valuable information to our counterfactuals. This model has some multiple of 50 counterfactuals: Each state having several variables that may need a different formulation to properly predict its outcome. One such alternative could posit that the outcome in each state is dependent on the concentration of efforts outside the state. Since the total amount of effort to expend is fixed, resources spent outside each state are considered lost. At the end of the day, a review of the many model specifications possible due to sensitivities introduced by the Electoral College, one comes to the valuable conclusion that <bad joke> claims should be kept conservative </bad joke>. In Silver's own post mortem of the election forecasts, he says the outcome models that were there at the time, trained on identical data and coming to wildly different conclusions should have been indicative of precisely this sensitivity. In business, we must take care to understand precisely how sensitive to misspecification our models are, especially when we are comfortable with the projections. The solution is being meticulous about testing competing formulations and fostering this discipline in others. Wrap UpManagers working with analytics professionals may often and suddenly find themselves feeling technically outclassed. Using this guide, I hope you will feel empowered to ask the questions that analysts and researchers, caught up in their work, often fail to ask themselves.
10/30/2016 0 Comments Regression: The Manager's GuideThe word "Regression" evokes something different in every mind at any table. The weary manager staring at a line drawn through 12 points, the intern halfway through crashing his Excel session, and the scientist crawling through 19 ggplot calls to figure out why his trendline code failed all have the same problem when they go to Google their issues: Everyone uses the same toolbox, but we all speak different languages. If we stepped into that scientist's shoes, we would likely conclude the various understandings of regression came from a common ancestor whose identity has been lost to time. The origin of this divergence is the same Bruce Lee saw in martial arts styles when he said "Any [style], however worthy and desirable, becomes a disease when the mind is obsessed with it." Groups develop dogmas over time. Since problems are approached and solved wildly differently across fields (say, between Economics and Biology), it's natural for certain points to be emphasized in one and not in another. Today, we don't even use the same textbooks. So, what really matters? Obviously, I'm not putting all the rules and assumptions of regression into this post. Instead, I want to give people who don't think about regression every day (or week) a useful guide for critiquing and interpreting what they're seeing. Think of it as a cheat sheet - all questions a researcher or analyst should answer and all the concerns you should have when you're presented with regression results. It is instructive to cover regression in two parts: Understanding the Data and Interpreting the Model. UNDERSTANDING THE DATA What Successful Managers, Researchers, and Analysts Know Before Trusting Regression Analysis The specter that haunts the dreams of any analytics manager is Revision. New information, data problems, and model errors all may mean you need to change reported numbers, and even when it's completely unavoidable, that will always take a toll on credibility. We can't always control the need to revise, but there are some unforced errors that come up consistently. Here are the questions you need to have answered so your group can sidestep some classic slip-ups. QUESTION ONE: HOW DID THIS DATA GET HERE?Are the new poll numbers in? I hope the pollster didn't use a landline - my grandparents are the only people I know who own a phone with a cord. What do the new SAT scores say about inner city performance? Check the fine print for how they handled students who dropped out before they could take it. Reading the Times piece on Millennials caring more about money than previously believed? It turns out people who will participate in a study for $7 think about money quite a bit. Understanding how your data got from the world to your computer is critical for detecting serious problems. And these aren't the "I noticed a .21 autocorrelation in your errors." problems that only concern you in theory, these are the "There's a box in your office, and your security badge won't open the doors on your way out." kind of issues: Pseudoreplication, one of many big words for simple problems, means there is something the same (or similar) about all (or many) of your data points that you couldn't see. This error throws all the advantages of a random sample right out the window. In the case above, a telephone survey is misleading because the respondents had something in common: they were all three-quarters fossilized. Censoring, besides being an annoying detail of sub-premium TV, happens when you wind up looking at what's left of your data by accident. For example, if students drop out before they can take the test that shows how poorly they are doing, the results will appear artificially high. Selection Bias has a few different flavors, but the case above pervasive in Behavioral Economics. If you entice a bunch of people to complete a survey in the pursuit of knowledge on a particular group (say, Millennials), you will wind up with information on a subset of that group: Millennials who will take 30 minutes to complete a survey for $7, which shouldn't be used to represent the broader group. Always ask if you or others are inadvertently selecting data. Question Two: Can this data answer your question?When you understand how the data came to be in the model and what systems generated that data, it's time to evaluate it in the context of the question posed by your researcher. The point is this: You can have a great model with perfect data and ruin it by asking a bad question. "The Average Person Has Fewer Than Two Feet" Always be aware that regressions are complex averages. Parameter estimates are average derivatives, and "forecasts" are average responses. Things like outlier need to be worried about (exactly how worried and why is a topic for another post) for exactly the same reason as in any other context: the average is not a reasonable stand-in for the expected value. Accordingly, regression (in its vanilla form) is not suitable for forecasting unlikely events, though it can show how unlikely past events were. OLS assumes (asymptotically) normally distributed errors, implying they have not only the bell-curve shape, but span an infinite range. If you're looking at an inherently non-negative variable with a bunch of observations at or near 0, that has consequences for the methodology. Be wary of questions that can't be answered well by averages at all. While it may be arithmetically true that the average number of feet per person is less than two, that data has all useful information averaged out of it. Does the data capture the system or just the hypothesis? If you looked at the relationship between changes in the minimum wage and social cohesion, you'd likely conclude that it could be done using data from either 19th century France or modern-day South Africa. However, if you used both, you would likely find nothing at all. Why? When the minimum wage went up in France, it was about the beginning of democracy: An impoverished population overcoming aristocracy. In South Africa, the minimum wage is used to price blacks out of the labor market. Taken as a panel, the offsetting "effects" of the minimum wage on social cohesion (France's wage being positively correlated with social cohesion while South Africa's is negatively correlated) would yield an wrong answer for both France and South Africa. (This case is an example of Omitted Variable Bias: Getting useless results by leaving something out.) This is why I'm paranoid about my data capturing details of the mechanism being studied, not just my question. In any system, (economies, environments, etc.) the thing you're interested in probably interacts with that system in many ways. When people rush to isolate the variables of interest (as in the example above), they're bound to miss important details. Question Three: How (if at all) should this data be extrapolated?This is the question that likely matters most to you and almost definitely the one the researcher spent the least time thinking about. There are two dimensions of extrapolation: Intensive & Extensive. You need to understand exactly how far the data lets you go in both dimensions.
A subtle version of this exists when Using regression coefficients for control variables and applying extreme values to variables of interest. Consider this: For many, car maintenance expenses are unrelated to weight, but the morbidly obese may have costly problems with alignment and tire wear, especially if they are more likely to own older, less robust cars. If I were to study the impact of weight on annual expenses, but control for vehicle maintenance, which is a function of weight at extreme values, the estimated impact of weight on total expenses would be too low, since I accidentally attributed some of weight's impact to car maintenance. wrap upFinally, it's important to understand that, by asking these questions, no one is hoping to "catch" the researcher in an error. This is just due diligence. In the real world, the data always has issues, and it's no one's job to create the perfect dataset - we're after results. When giving the results, though, managers and executives should be able to articulate both the insight AND the scenarios when it should be applied. Put simply: We need to know both what we believe and why we believe it. Understanding the data takes us much closer to being confident in our results and clear about how they should be put incorporated into strategy. |
AuthorChristopher Waldeck believes he can eradicate the misuse if the term "Big Data". Others are not so sure. Archives
December 2017
Categories |
Copyright © 2018