Runtime, Memory, and Scaling
runtime, memory, scaling, cross-validation, npRmpi, crs, NOMAD, bootstrap
This page collects the practical advice that matters once the method is right but the run becomes slow, memory-hungry, or otherwise awkward. The goal is not to promise that every job will be cheap. The goal is to help you decide what to simplify, what to tune, and when to change execution mode.
Why can these methods take time?
Because many of the key routines are doing real search, repeated fitting, or resampling rather than evaluating a closed-form expression once.
Common reasons a run is slow:
- bandwidth selection by cross-validation,
- multistart optimization,
- bootstrap intervals,
- multivariate fits,
- extreme quantiles,
- large categorical search spaces,
- spline search over degree and knot structure.
That is normal behavior, not necessarily a sign that something is broken.
A good default workflow
For np and npRmpi, a conservative sequence is:
- get the method right on a small serial run,
- inspect the fitted object and bandwidth object,
- simplify plotting or interval requests if needed,
- only then move to
npRmpiif the serial workflow is right but too slow.
That sequence avoids introducing MPI complexity before the basic model is settled.
np: bandwidth selection and large jobs
Bandwidth selection is often the expensive part. A practical habit is to make the bandwidth object explicit:
## Separate bandwidth selection from fitting when the run may be expensive
bw <- npregbw(y ~ x1 + x2, data = mydat)
fit <- npreg(bws = bw, data = mydat)That makes it easier to:
- inspect the selected bandwidths,
- reuse them across later fits,
- avoid recomputing the same object repeatedly.
np: quantile runtime is driven more by the fit than by tau
On the current np and npRmpi packages, the older rule of thumb that extreme values of tau are materially slower is no longer the right mental model.
The quantile route now uses a one-dimensional numerical refinement step controlled by tol, small, and itmax, so the main runtime driver is usually the underlying conditional-distribution fit and evaluation problem, not whether tau is central or extreme.
Practical takeaway:
- choose
taufor the scientific question rather than for expected speed, - expect runtime to depend more on sample size, covariate dimension, and the conditional-distribution fit than on the specific quantile being requested.
np: many variables and long formulas
If you have a large number of variables and the formula interface starts failing with an “improper formula” style message, the practical workaround is simple: use the data-frame interface instead of pushing a very long formula string.
np: repeated interruptions and memory
If you repeatedly interrupt large jobs, R may hold on to memory that would otherwise have been released at normal completion. When that starts to bite, the practical fix is often just to restart R and begin a fresh session.
np: turn off status messages in batch work
For quiet runs:
## Silence routine np status messages in a controlled batch run
options(np.messages = FALSE)If you also want to silence warnings for a controlled batch run, wrap the call in suppressWarnings(...).
np: sparse categorical designs
In some categorical settings, the design can be very sparse in the sense that there are far fewer unique support points than observations. That can create opportunities for custom speedups, but that is an advanced route rather than the normal first stop.
The practical recommendation is:
- first get the model working with the standard high-level interface,
- then only consider specialized sparse-design logic if the structure is genuinely repetitive and the runtime justifies the extra coding.
macOS: BLAS/LAPACK choice can matter
For some macOS users, a practical speed lever is the BLAS/LAPACK library that R is using. The choice is up to the user: current CRAN macOS R binaries ship with both the reference BLAS and Apple’s accelerated vecLib BLAS available.
The official CRAN macOS FAQ notes that current CRAN R binaries can provide both:
- the reference BLAS shipped with R,
- and Apple Accelerate’s vecLib BLAS.
It also notes that the active choice is controlled by the libRblas.dylib symlink and that sessionInfo() reports the BLAS in use.
Practical takeaway:
- vecLib can speed up some linear-algebra-heavy workloads on Apple hardware,
- on Apple Silicon, the gain can be dramatic because Accelerate may use Apple’s AMX matrix hardware for BLAS operations,
- but gains are workflow-dependent rather than universal,
- and for
np,npRmpi, andcrs, many expensive routes are driven as much by search, resampling, kernels, or MPI orchestration as by BLAS alone.
There is also a real caution. Accelerate/vecLib is not fork-safe. A January 2026 R-SIG-Mac report describes parallel failures in some mclapply() workloads that call BLAS operations such as crossprod(). That is why the CRAN macOS default reverted to the reference BLAS in R 4.5.3, even though Accelerate can be much faster on some Apple Silicon jobs.
So the best advice is:
- check the current BLAS with
sessionInfo(), - benchmark a representative script before and after changing it,
- be especially cautious if your workflow relies on fork-based parallelism.
Official references:
- CRAN macOS FAQ: Which BLAS is used and how can it be changed?
R-SIG-Macdiscussion: Parallel errors with default BLAS (vecLib) in R 4.5.2 for arm64
When to move to npRmpi
Move to npRmpi when:
- the serial
npworkflow is the right workflow, - the job is large enough that runtime has become the real bottleneck,
- or you already know the workload belongs on an MPI-capable host.
For most users:
session/spawnis the cleanest first move on macOS and Linux,attachis the right first move when the MPI world is already launched,profileis the more explicit advanced route, especially on heterogeneous clusters.
See MPI and Large Data for the current mode map and quickstart scripts.
crs: search can be expensive too
With crs, the expensive part is often the search over degree, knots, basis structure, and categorical handling rather than the final fitted model alone.
If search feels too large:
- restrict the basis dimension,
- reduce
degree.maxorsegments.max, - search over a smaller complexity class first,
- or temporarily use additive structure when that is scientifically reasonable.
crs: watch memory during search
If the basis can become very large, restricting the minimum degrees of freedom can help keep the search from wandering into impractical models.
A common practical control is cv.df.min, which lets you stop the search from considering bases that are simply too large to be sensible for the problem.
crs: if it seems to just sit there
Sometimes the right move is not to abort, but to ask the optimizer to tell you what it is doing.
## Ask the spline optimizer to display more of the search process
opts <- list("DISPLAY_DEGREE" = 3)
model <- crs(y ~ x1 + x2, opts = opts)If that reveals that the search space is too large, then reduce the problem:
- lower
degree.max, - lower
segments.max, - use
complexity = "degree"or another narrower search, - or use
basis = "additive"if that is a defensible modeling restriction.
crs: quiet runs
For quiet runs:
## Silence routine crs status messages in a controlled batch run
options(crs.messages = FALSE)If you are working directly with snomadr, use an options list with DISPLAY_DEGREE = 0.
For the more package-specific version of this advice, see Spline Search and Tuning.
Practical triage
If a run is too slow or too heavy, work down this list:
- confirm the model on a small problem,
- make the bandwidth or tuning object explicit if possible,
- remove avoidable plotting or bootstrap overhead,
- simplify the search space,
- change execution mode only after the statistical workflow itself is settled.