Runtime, Memory, and Scaling

Practical advice on cross-validation runtime, memory, search size, and when to move from np to npRmpi or narrow crs search.
Keywords

runtime, memory, scaling, cross-validation, npRmpi, crs, NOMAD, bootstrap

This page collects the practical advice that matters once the method is right but the run becomes slow, memory-hungry, or otherwise awkward. The goal is not to promise that every job will be cheap. The goal is to help you decide what to simplify, what to tune, and when to change execution mode.

Why can these methods take time?

Because many of the key routines are doing real search, repeated fitting, or resampling rather than evaluating a closed-form expression once.

Common reasons a run is slow:

  • bandwidth selection by cross-validation,
  • multistart optimization,
  • bootstrap intervals,
  • multivariate fits,
  • extreme quantiles,
  • large categorical search spaces,
  • spline search over degree and knot structure.

That is normal behavior, not necessarily a sign that something is broken.

A good default workflow

For np and npRmpi, a conservative sequence is:

  1. get the method right on a small serial run,
  2. inspect the fitted object and bandwidth object,
  3. simplify plotting or interval requests if needed,
  4. only then move to npRmpi if the serial workflow is right but too slow.

That sequence avoids introducing MPI complexity before the basic model is settled.

np: bandwidth selection and large jobs

Bandwidth selection is often the expensive part. A practical habit is to make the bandwidth object explicit:

## Separate bandwidth selection from fitting when the run may be expensive
bw <- npregbw(y ~ x1 + x2, data = mydat)
fit <- npreg(bws = bw, data = mydat)

That makes it easier to:

  • inspect the selected bandwidths,
  • reuse them across later fits,
  • avoid recomputing the same object repeatedly.

np: quantile runtime is driven more by the fit than by tau

On the current np and npRmpi packages, the older rule of thumb that extreme values of tau are materially slower is no longer the right mental model.

The quantile route now uses a one-dimensional numerical refinement step controlled by tol, small, and itmax, so the main runtime driver is usually the underlying conditional-distribution fit and evaluation problem, not whether tau is central or extreme.

Practical takeaway:

  • choose tau for the scientific question rather than for expected speed,
  • expect runtime to depend more on sample size, covariate dimension, and the conditional-distribution fit than on the specific quantile being requested.

np: many variables and long formulas

If you have a large number of variables and the formula interface starts failing with an “improper formula” style message, the practical workaround is simple: use the data-frame interface instead of pushing a very long formula string.

np: repeated interruptions and memory

If you repeatedly interrupt large jobs, R may hold on to memory that would otherwise have been released at normal completion. When that starts to bite, the practical fix is often just to restart R and begin a fresh session.

np: turn off status messages in batch work

For quiet runs:

## Silence routine np status messages in a controlled batch run
options(np.messages = FALSE)

If you also want to silence warnings for a controlled batch run, wrap the call in suppressWarnings(...).

np: sparse categorical designs

In some categorical settings, the design can be very sparse in the sense that there are far fewer unique support points than observations. That can create opportunities for custom speedups, but that is an advanced route rather than the normal first stop.

The practical recommendation is:

  • first get the model working with the standard high-level interface,
  • then only consider specialized sparse-design logic if the structure is genuinely repetitive and the runtime justifies the extra coding.

macOS: BLAS/LAPACK choice can matter

For some macOS users, a practical speed lever is the BLAS/LAPACK library that R is using. The choice is up to the user: current CRAN macOS R binaries ship with both the reference BLAS and Apple’s accelerated vecLib BLAS available.

The official CRAN macOS FAQ notes that current CRAN R binaries can provide both:

  • the reference BLAS shipped with R,
  • and Apple Accelerate’s vecLib BLAS.

It also notes that the active choice is controlled by the libRblas.dylib symlink and that sessionInfo() reports the BLAS in use.

Practical takeaway:

  • vecLib can speed up some linear-algebra-heavy workloads on Apple hardware,
  • on Apple Silicon, the gain can be dramatic because Accelerate may use Apple’s AMX matrix hardware for BLAS operations,
  • but gains are workflow-dependent rather than universal,
  • and for np, npRmpi, and crs, many expensive routes are driven as much by search, resampling, kernels, or MPI orchestration as by BLAS alone.

There is also a real caution. Accelerate/vecLib is not fork-safe. A January 2026 R-SIG-Mac report describes parallel failures in some mclapply() workloads that call BLAS operations such as crossprod(). That is why the CRAN macOS default reverted to the reference BLAS in R 4.5.3, even though Accelerate can be much faster on some Apple Silicon jobs.

So the best advice is:

  1. check the current BLAS with sessionInfo(),
  2. benchmark a representative script before and after changing it,
  3. be especially cautious if your workflow relies on fork-based parallelism.

Official references:

When to move to npRmpi

Move to npRmpi when:

  • the serial np workflow is the right workflow,
  • the job is large enough that runtime has become the real bottleneck,
  • or you already know the workload belongs on an MPI-capable host.

For most users:

  • session / spawn is the cleanest first move on macOS and Linux,
  • attach is the right first move when the MPI world is already launched,
  • profile is the more explicit advanced route, especially on heterogeneous clusters.

See MPI and Large Data for the current mode map and quickstart scripts.

crs: search can be expensive too

With crs, the expensive part is often the search over degree, knots, basis structure, and categorical handling rather than the final fitted model alone.

If search feels too large:

  • restrict the basis dimension,
  • reduce degree.max or segments.max,
  • search over a smaller complexity class first,
  • or temporarily use additive structure when that is scientifically reasonable.

crs: if it seems to just sit there

Sometimes the right move is not to abort, but to ask the optimizer to tell you what it is doing.

## Ask the spline optimizer to display more of the search process
opts <- list("DISPLAY_DEGREE" = 3)
model <- crs(y ~ x1 + x2, opts = opts)

If that reveals that the search space is too large, then reduce the problem:

  • lower degree.max,
  • lower segments.max,
  • use complexity = "degree" or another narrower search,
  • or use basis = "additive" if that is a defensible modeling restriction.

crs: quiet runs

For quiet runs:

## Silence routine crs status messages in a controlled batch run
options(crs.messages = FALSE)

If you are working directly with snomadr, use an options list with DISPLAY_DEGREE = 0.

For the more package-specific version of this advice, see Spline Search and Tuning.

Practical triage

If a run is too slow or too heavy, work down this list:

  1. confirm the model on a small problem,
  2. make the bandwidth or tuning object explicit if possible,
  3. remove avoidable plotting or bootstrap overhead,
  4. simplify the search space,
  5. change execution mode only after the statistical workflow itself is settled.

Where to go next

Back to top