Kernel Primer

Conceptual and practical introduction to mixed-data kernels, bandwidths, and the basic np workflow.

Keywords

npreg, npregbw, plot.np, mixed-data kernels, bandwidth object, kernel primer

This page is a website-first rewrite of the old np vignette material. The emphasis here is practical: what makes the np approach distinctive, how the workflow is organized, and what to keep in mind when working with mixed data.

Why kernel methods?

Kernel methods are attractive because they can reveal structure that a rigid parametric model may miss. The cost, of course, is that they are often computationally more demanding than the parametric alternative.

For np, the important point is not just that the methods are nonparametric, but that they are built to handle the mix of continuous, unordered, and ordered data that often appears in applied work.

Why `np` is different from a basic kernel smoother

Many users first meet kernel methods through routines aimed at continuous data only. In applied work, however, one often has a mixture of:

continuous regressors,
unordered factors,
ordered factors.

Traditional sample-splitting or cell-based approaches for categorical variables can be very wasteful. The np package instead uses generalized product kernels so that smoothing can proceed directly with mixed data rather than by breaking the sample into many separate cells.

The primacy of the bandwidth

This is the single most important practical idea in np.

In parametric work, one typically thinks first about the functional form. In np, one should think first about the bandwidth. The bandwidth object determines the effective smoothness of the estimator and carries the key details of the method with it.

The usual workflow is:

compute a bandwidth object,
estimate the model using that bandwidth object,
inspect summaries, fitted values, predictions, plots, and intervals.

A simple example:

## Choose the bandwidth first, then fit from the resulting object
library(np)
data(cps71, package = "np")

bw <- npregbw(logwage ~ age, data = cps71)
summary(bw)

fit <- npreg(bws = bw)
summary(fit)

You can let npreg() perform the bandwidth selection implicitly, but for serious work it is often better to be explicit so that you can inspect the resulting bandwidth object.

Why inspect the bandwidth object?

Bandwidth selection is automatic, but not infallible. Outliers, rounding, and discretization can all affect the result.

A good habit is therefore:

## Inspect the chosen bandwidths directly before moving on
summary(bw)

That lets you see the chosen bandwidths and decide whether they look plausible for the application at hand.

Mixed-data workflow

The package relies on the class of each variable to determine which weighting rule is appropriate. This means that correct classing is not incidental; it is part of the estimator definition.

For example:

## Class the variables correctly before fitting the mixed-data model
mydat <- data.frame(y = rnorm(200),
  x_cont = runif(200),
  x_unordered = factor(sample(c("a", "b", "c"), 200, replace = TRUE)),
  x_ordered = ordered(sample(1:4, 200, replace = TRUE)))

## Build and fit the mixed-data regression using those classes
bw <- npregbw(y ~ x_cont + x_unordered + x_ordered, data = mydat)
fit <- npreg(bws = bw)

The same formula syntax is used, but the estimator is not fitting a linear-additive model in the parametric sense. The formula is simply the interface for specifying which variable is the response and which are the covariates.

Formula interface versus data-frame interface

np supports both. The formula interface is often the easiest to read:

## The same modeling interface works directly through a formula and data frame
bw <- npregbw(y ~ x1 + x2 + x3, data = mydat)

The data-frame interface is also available and can feel more natural in some workflows:

## The unconditional density route also accepts a whole data frame
bw <- npudensbw(dat = mydat)

The important point is that the variable classes in the data frame should be correct. If a categorical variable is not classed as a factor or ordered factor when appropriate, the wrong kernel treatment will follow.

Generalized product kernels in one sentence

If the data contain continuous and categorical variables, np builds a product kernel using the appropriate component for each variable type, then lets the bandwidths differ across variables.

That is the core device that lets the package handle mixed data in a coherent way.

Computational reality

Automatic bandwidth selection is one of the strengths of the package, but it can be expensive.

Practical advice:

start with a smaller problem while you are learning,
use explicit bandwidth objects in serious work,
inspect the bandwidth summary,
expect some plots or bootstrap-based intervals to take time,
move to npRmpi when the workflow is right but the runtime becomes the bottleneck.

Modern LP shortcut: `nomad = "auto"`

For LP-capable families, there is now a second modern entry path alongside the basic bandwidth-object workflow:

## Let np choose the local-polynomial degree and bandwidths in one call.
## nomad = "auto" uses exhaustive search for p = 1 and NOMAD otherwise.
library(np)
data(cps71, package = "np")

fit_lp <- npreg(logwage ~ age, data = cps71, nomad = "auto")
plot(fit_lp)
summary(fit_lp)

This is the recommended convenience route when you want automatic local-polynomial degree and bandwidth search in one call. It does not replace the bandwidth-object-first workflow explained above; it complements it.

If you want a small runnable script for this route, start with np_regression_nomad_quickstart.R.

Practical note:

nomad = "auto" requires the crs package because NOMAD degree search is provided there.
after fitting, inspect fit$bws$nomad.shortcut to see the normalized preset.

A small regression example

## Compare local-constant and local-linear fits on the same data
library(np)
data(cps71, package = "np")

fit_lc <- npreg(logwage ~ age, data = cps71)
fit_ll <- npreg(logwage ~ age, data = cps71, regtype = "ll")

## Plot the first fit, then add the second fit and a legend
plot(fit_lc)
lines(cps71$age, fitted(fit_ll), col = 2, lty = 2)

legend("topleft",
  c("Local constant", "Local linear"),
  lty = c(1, 2),
  col = c(1, 2),
  bty = "n")

Historical note

The original Sweave vignette went much deeper into estimator families, function inventories, and a long series of applications. For a website, the cleaner first move is a shorter primer focused on mixed-data kernels, bandwidth workflow, and practical use. The more specialized material can then be split into separate pages rather than preserved as one large article.