Semiparametric Models

Practical route to partially linear, single-index, and varying-coefficient models in np.

Keywords

npplreg, npindex, npscoef, semiparametric models, partially linear, single index

The np package is not limited to fully nonparametric regression and density estimation. It also includes a useful set of semiparametric models that are often a good compromise when a fully parametric model feels too restrictive but a fully nonparametric model feels too expensive or too unconstrained.

This page highlights three important families:

partially linear models via npplreg,
single-index models via npindex,
varying coefficient models via npscoef.

If you want a minimal downloadable script for the semiparametric route, start with np_semiparametric_quickstart.R.

For LP-capable semiparametric families, there is now also a modern convenience route via nomad = "auto". The simplest example is partially linear regression:

## Simulate a small partially linear example, then use the automatic LP shortcut.
## nomad = "auto" uses exhaustive search for p = 1 and NOMAD otherwise.
library(np)

set.seed(42)
n <- 200
x1 <- rnorm(n)
z <- runif(n)
y <- 1 + 2 * x1 + sin(2 * pi * z) + rnorm(n, sd = 0.2)
dat <- data.frame(y, x1, z)

fit_pl_lp <- npplreg(y ~ x1 | z, data = dat, nomad = "auto", degree.max = 2L, nmulti = 1)

The same shortcut is available for npindex and npscoef on their supported LP-capable routes. As with the regression and conditional-density families, it requires the crs package because NOMAD degree search is provided there.

The current release line also uses the native crs NOMAD runtime directly, so the semiparametric nomad = "auto" route now sits on the same modern search engine as the rest of the LP-capable conditional family.

Why use a semiparametric model?

A semiparametric model is often attractive when:

you want some structure for interpretability,
one part of the relationship is clearly nonlinear,
a full nonparametric model is more flexible than you need,
or the computational burden of the full model is too high for the problem at hand.

Partially linear models: `npplreg`

The partially linear model is useful when some regressors can reasonably enter linearly while another part of the relationship is left nonparametric.

In the old wage1 example, the idea was to treat some regressors parametrically while allowing experience to enter nonparametrically.

## Keep some regressors linear while letting experience vary nonparametrically
library(np)
data(wage1, package = "np")

model_pl <- npplreg(lwage ~ female + married + educ + tenure | exper, data = wage1)

summary(model_pl)

Practical warning

Bandwidth selection for partially linear models can be much more expensive than users expect. The model looks simpler than a fully nonparametric regression, but the bandwidth selection problem can be substantially more burdensome because several cross-validated regressions are involved under the hood.

So the practical rule is: do not assume that “partially linear” means “cheap”.

Single-index models: `npindex`

Single-index models reduce the dimension of the problem by letting the response depend on a low-dimensional index rather than on the regressors directly.

Two important routes are available in np:

Klein-Spady style models for binary outcomes,
Ichimura style models for continuous outcomes.

Binary outcomes

## Fit a binary single-index model on the mixed-data birthwt example
library(np)
data(birthwt, package = "MASS")

birthwt$low <- factor(birthwt$low)
birthwt$smoke <- factor(birthwt$smoke)
birthwt$race <- factor(birthwt$race)
birthwt$ht <- factor(birthwt$ht)
birthwt$ui <- factor(birthwt$ui)
birthwt$ftv <- factor(birthwt$ftv)

model_index_binary <- npindex(low ~ smoke + race + ht + ui + ftv + age + lwt,
  method = "kleinspady",
  gradients = TRUE,
  data = birthwt)

summary(model_index_binary)

Continuous outcomes

## Use the continuous-outcome single-index route on wage data
library(np)
data(wage1, package = "np")

model_index_continuous <- npindex(lwage ~ female + married + educ + exper + tenure,
  data = wage1,
  nmulti = 1)

summary(model_index_continuous)

The single-index route can be appealing when you want more flexibility than a parametric model but less curse-of-dimensionality pressure than a fully nonparametric multivariate fit.

Interpreting and tuning the single index

The 0.70-5 release substantially reworks the Ichimura and Klein-Spady objectives so they reuse the established npreg leave-one-out backend where appropriate. Bounded continuous-kernel choices are also carried consistently through fitting, evaluation, variance, and bootstrap routes. In npRmpi, the post-CV fit, evaluation, and gradient work is distributed across evaluation rows in an active session.

The coefficient vector is identified only up to scale, so npindexbw() normalizes its first coefficient to one. Read the remaining coefficients as relative movements in the index, not as standalone linear-model slopes, and choose the first continuous regressor deliberately when coefficient interpretation matters.

The default optim.method = "Nelder-Mead" remains a robust derivative-free starting point for a low-dimensional search. For a higher-dimensional index, optim.method = "BFGS" can be a useful performance and sensitivity comparison. Because the objective can be nonconvex, compare objective values, convergence behavior, and fitted objects rather than selecting an optimizer by runtime alone.

Varying coefficient models: `npscoef`

A varying coefficient model lets the effective coefficients change with another variable. This is useful when you believe the regression effect itself varies systematically across a conditioning variable.

In the old wage1 illustration, coefficients were allowed to vary with sex.

## Create numeric coefficient variables, then let them vary with sex
library(np)
data(wage1, package = "np")

wage1_augmented <- wage1
wage1_augmented$dfemale <- as.integer(wage1$female == "Male")
wage1_augmented$dmarried <- as.integer(wage1$married == "Notmarried")

model_scoef <- npscoef(lwage ~ dfemale + dmarried + educ + exper + tenure | female,
  betas = TRUE,
  data = wage1_augmented)

summary(model_scoef)

One practical detail from the older vignette still matters: the X variables in the varying-coefficient part need to be numeric, so creating explicit 0/1 versions of factors can be appropriate there.

Hat and operator helpers

The public helper family now follows the estimator family directly: npreghat, npindexhat, npplreghat, and npscoefhat cover regression and semiparametric fits, while npudenshat, npudisthat, npcdenshat, and npcdisthat cover density and distribution objects. These helpers can return an explicit matrix, apply the operator without asking users to reconstruct it, or produce the transposed response-weighted output = "constraint" form used in constrained estimation.

In 0.70-5, npreghat() also tightens generalized-nearest-neighbor and local-constant derivative ownership. For mixed polynomial degrees, available derivative components are routed through the correct operator and unavailable components are reported explicitly rather than silently applying a different derivative.

The constrained local-polynomial scripts in the Code Catalog show the practical npreghat() workflow.

How should I choose among these?

If you want…	Start with
A mostly linear model with one clearly nonlinear part	`npplreg`
Dimension reduction through an unknown index	`npindex`
Coefficients that vary with another variable	`npscoef`
Maximum flexibility with fewer structural assumptions	`npreg`

Practical advice

Start with a smaller version of the problem first.
Be explicit about bandwidth selection when the run is expensive.
Inspect summaries carefully.
Compare in-sample fit and, where possible, hold-out predictive performance.