Why This Function (WTF)? A diatribe on parametric model specification

It is not uncommon to encounter practitioners who scribble down a parametric regression model and then proceed as if their model had generated the data.¹ No model selection exercise is attempted, model uncertainty is ignored, and statements are made about the underlying process that generated the data based solely on the model they scribbled down. That is, they proceed as if their ad hoc model is the one true model that faithfully mimics the unknown data generating process (DGP) and therefore could plausibly have generated the data. Any bridge that might otherwise connect the DGP to their model remains uncrossed. This is a conceit that betrays an excessively favourable opinion of the practitioner’s ability to divine the true nature of unknown relationships.

Before proceeding further, a serious scientist engages in the process of model building and establishes that their parametric model runs the gauntlet of model diagnostics and is not likely to be at odds with the DGP. Serious scientists confront data because they are interested in learning about some aspect of the underlying DGP, not because they are entranced by the coefficients of some ad hoc model they scribbled down. Model uncertainty is acknowledged from the outset and a proper treatment of its existence is recognized to be the foundation upon which sound statistical inference rests. Analysis is based on data that is acknowledged to be generated by some unknown DGP, and not by some simplistic ad hoc parametric function; apparently, the two are easily confused.

Suppose that Mother Nature took a random draw from the real number line \(\mathbb{R}\) (call this draw \(\theta\)). Now suppose that we took a random draw from \(\mathbb{R}\) (call this \(\theta^*\)). The real number line is dense, and the likelihood that the two draws are equal is zero (i.e., \(\operatorname{Pr}(\theta^*=\theta)=0\)). We teach students in introductory statistics that this is a measure zero event and is why we assign zero probability mass to the event. It would be a conceit for us to presume or assert that \(\theta^*=\theta\), and we would be wrong 100% of the time.

Suppose that Mother Nature drew a function at random from the space of, say, Lipschitz continuous functions \(\mathcal{F}\) (call this \(\theta(x)\)). Now suppose that we drew a function at random from \(\mathcal{F}\) (call this \(\theta^*(x)\)). The space \(\mathcal{F}\) is dense, and obviously \(\operatorname{Pr}(\theta^*(x)=\theta(x))=0\) almost everywhere (a.e.). It would be a conceit for us to presume or assert that \(\theta^*(x)=\theta(x)\) a.e., and we would be wrong 100% of the time.

We conduct regression analysis because we are interested in learning about some unknown conditional mean function \(\operatorname{E}(Y|X=x)\), denoted by \(\theta(x)\). Presuming additive errors and strict exogeneity, then the unknown relationship of interest can be written as \(y=\theta(x)+\epsilon\), where \(\epsilon\) is a stochastic error term, \(y\) is some outcome of interest, and \(x\) is a vector of predictors. It is a conceit to presume or assert that, for any function we fancy writing down (call this \(\theta^*(x)\)), the function just so happens to coincide with that for the underlying DGP, i.e., to assert that \(\theta^*(x)=\theta(x)\) a.e. After all, the space of conditional mean functions is dense.

The statement that a fitted ad hoc parametric model (call this \(\hat\theta^*(x)\)) such as the popular linear additive specification is statistically consistent is therefore most curious. For instance, people often state that the OLS/ML estimator \(\hat\beta=(X'X)^{-1}X'Y\) is consistent for \(\beta\) in the model \(X'\beta\), i.e., that \(X'\hat\beta=\hat\theta^*(x)\) is consistent for \(X'\beta=\theta^*(x)\). But the claim “a straight line fitted by the method of least squares is a consistent unbiased estimator of a straight line” is a far cry from the claim “a straight line fitted by the method of least squares is a consistent unbiased estimator of any unknown conditional mean function.” Since we are interested in the underlying DGP, what we in fact require is that \(X'\hat\beta=\hat\theta^*(x)\) is consistent for \(\theta(x)=\operatorname{E}(Y|X=x)\); this will only hold if \(\theta^*(x)=\theta(x)\) which, as noted above, is a rather unlikely event.

There is no getting around the fact that any ad hoc underspecified model \(\hat\theta^*(x)\) will be biased and inconsistent for \(\theta(x)\). Furthermore, the linear (in parameters and predictors) additive model devoid of interaction terms is the simplest and most underspecified model possible—it is a corner solution, a limiting case; the only model more underspecified is a constant function, which would deliver the unconditional mean rather than the conditional mean. Notwithstanding, it is the go-to model for a not inconsequential number of practitioners. To claim consistency for every unknown DGP encountered betrays an alarming lack of sophistication. Furthermore, since the practitioner has assumed a fixed model that does not depend on the sample size, this approach typically indicates that the estimate is more precise (i.e., less variable) than it really is. Consequently, conclusions supported by such models may in fact be unsupportable, which is particularly worrisome; the production of biased and inconsistent estimates that appear more precise than they actually are should not be given a free pass.

The not uncommon practice of writing down a linear additive model and sidestepping the important and non-trivial process of model selection is difficult to justify. The linear (in predictors and parameters) additive model lacking in interaction terms is a limiting and extreme case, a fiction that should be used solely for the purpose of teaching the principles of estimation and inference. Many have had it drilled into their heads that the presumption of linearity and additivity is for pedagogical purposes only and should not be used to describe actual relationships. Some apparently missed that lecture and use this naïve specification for all of their applied analysis regardless of the provenance of the data.

If a practitioner behaves as if any parametric model they scribble down is correctly specified, then they are asserting they know the one true functional form of some underlying unknown relationship. But if they know the true functional form, why do they not know the true parameter values as well? Both require the same degree of self-deception to my way of thinking (though I presume that they would be less likely to get away with the latter). So, the next time you are presented with results from a simple additive, linear (in parameters and predictors), no interactions regression model, it would behoove you to pose the following question to the presenter—“WTF?”

Footnotes

An easy tell is a parametric model that is additive, linear in parameters and predictors, and devoid of interaction terms.↩︎

Categories

Footnotes