Boundary Corrected Adaptive Conditional Kernel Density Estimation

Jeffrey S. Racine1

1Department of Economics, Graduate Program in Statistics, McMaster University

Angda Li2

2Department of Economics, Texas A&M University

Qi Li3

3Department of Economics, Texas A&M University

Updated March 12, 2026

Slide Pro-Tips

  • View full screen in A2L by clicking on the slide and then pressing the F key (press the Esc key to revert)

  • Access navigation menu by pressing the M key (click X in navigation menu to close)

  • Advance using arrow keys

  • Zoom in by holding down the Alt key in MS Windows, Opt key in macOS or Ctrl key in Linux, and clicking on any screen element (Alt/Opt/Ctrl click again to zoom out)

  • Use copy to clipboard button for R code blocks (upper right in block) to copy and paste into R/RStudio

  • Export to a PDF by pressing the E key (wait a few seconds, then print [or print using system dialog], enable landscape layout, then save as PDF - press the E key to revert)

  • Enable drawing tools - chalk board by pressing the B key (B to revert), notes canvas by pressing the C key (C to revert), press the Del key to erase, press the D key to download drawings

Overview of Talk

  • This talk is about a proposed boundary adaptive, polynomial order adaptive, conditional kernel density estimator, and its finite-sample performance

  • I am aiming for an interactive seminar format, which will require either your participation or your sitting next to someone who is participating

  • Hopefully you came with your laptop, the Firefox browser, R/RStudio, and the suggested libraries installed (if you have your laptop you can do this now if you didn’t get the memo).

  • If you have R/RStudio and the libraries installed, you can work with the methods and data interactively as the talk proceeds

  • My hope is that you will follow along and reproduce the results on your own computer

  • Let’s begin by opening the following link in Firefox:

    JeffreyRacine.github.io/bk

Overview of Talk

  • We will cover the following topics:

    1. Unconditional kernel density estimation with/without bounded data
    2. Conditional kernel density estimation
    3. Conditional local polynomial kernel density estimation
    4. Conditional density estimation, bandwidth and polynomial order selection
    5. Conditional density estimation, bandwidth and polynomial order selection with/without bounded data
    6. Theoretical underpinnings, uniform consistency results, pointwise, Bonferroni, and simultaneous confidence intervals
  • An R package is available at:

    github.com/JeffreyRacine/bkcde

Unconditional Density Estimation with Bounded Data

KDE with Bounded Data

  • Kernel density estimation has a rich history (Rosenblatt 1956; Parzen 1962)

  • Assuming smoothness, an unknown density \(f(x)\) can be estimated using a local average of weight (“kernel”) functions, \[\begin{equation*} \hat f(x)=\frac{1}{n}\sum_{i=1}^n \frac{1}{h_x}K\left(\frac{x-X_i}{h_x}\right), \end{equation*}\] where \(K(\cdot)\) is a kernel function, \(h_x\) is a bandwidth, and \(X_i\) are the data

  • However, in the presence of finite bounds (e.g., \(X\in[a,b]\), \(a<b\), \(|a|<\infty\), \(|b|<\infty\)) “classical” kernels (e.g., Gaussian, Epanechnikov) can behave poorly at the boundary of support; \(\hat f(x)\) may converge very slowly at or near boundaries, or even be inconsistent

  • This artifact is known as the “edge” or “boundary” effect

  • Let’s examine this feature via some interactive illustrations

Example of KDE with Bounded Data - Boundary Bias

We simulate two datasets, one \(N(0,1)\) and one \(U[0,1]\) (a “worst case” scenario for classical kernel density estimation), and use data-driven methods of bandwidth selection with a “classical” Gaussian kernel to estimate unknown smooth densities, \(f(x)\)

Example (\(N(0,1)\) (left figure), \(U{[0,1]}\) (right figure), R function density())

set.seed(42)
## Simulate n=1000 observations from N(0,1) and U[0,1]
n <- 1000
par(mfrow=c(1,2))
x <- sort(rnorm(1000))
## Plot the density using R's density() for N(0,1)
?density
plot(density(x),main="Gaussian Data")
lines(x,dnorm(x),col=2,lwd=2,lty=2)
legend("topleft",c("DGP","density()"),col=c(2,1),lty=c(2,1),bty="n")
rug(x)
x <- sort(runif(1000))
## Plot the density using R's density() for U[0,1]
plot(density(x),main="Uniform Data")
lines(x,dunif(x),col=2,lwd=2,lty=2)
segments(0,0,0,1,lty=2,col=2,lwd=2)
segments(1,0,1,1,lty=2,col=2,lwd=2)
legend("topleft",c("DGP","density()"),col=c(2,1),lty=c(2,1),bty="n")
rug(x)

Example of KDE with Bounded Data - Boundary Bias

Figure 1: Simulated data from \(N(0,1)\) and \(U[0,1]\) with classical KDE

KDE with Bounded Data - Reflection, Transformation

  • Boundary bias arises mostly when there is substantial probability mass at a support boundary

  • Boundary bias when (finite) bounds \([a,b]\) are known a-priori has been widely studied

  • A number of procedures exist, including data-reflection, data-transformation and the use of kernel carpentry (i.e., so-called “boundary kernels”)

  • Data-reflection involves duplicating symmetrically (i.e., reflecting) data around its boundary, running standard bandwidth selection and kernel estimation, then adjusting the resulting estimate to ensure it is proper

  • Data-transformation involves some mathematical transform of the data that, when rescaled, has the desired effect

KDE with Bounded Data - Carpentry

  • Kernel carpentry constructs kernels that adjust to the presence of boundaries

  • A range of left- and right-bounded kernels have been proposed for three cases:

    • \([a,b]=[a_0,\infty]\), \(|a_0|<\infty\)
    • \([a,b]=[-\infty,b_0]\), \(|b_0|<\infty\)
    • \([a,b]=[a_0,b_0]\), \(|a_0|<\infty\) and \(|b_0|<\infty\), \(a_0<b_0\)
  • Some proposed estimators even require the use of different kernel types near the boundary versus in the interior of support, and also different degrees of smoothing in boundary regions (“floating-boundary”, Scott (1992))

  • To some degree, all methods (i.e., reflection, transformation, carpentry) can reduce the amount of bias that would otherwise be present near a boundary to that which holds in the interior of support where it is free from boundary effects

KDE with Bounded Data - Carpentry

  • Existing boundary kernel functions include

  • Regarding the use of boundary kernels, Scott (1992), p 149, notes:

    While boundary kernels can be very useful, there are potentially serious problems with real data. There are an infinite number of boundary kernels reflecting the spectrum of possible design constraints, and these kernels are not interchangeable. Severe artifacts can be introduced by any one of them in inappropriate situations. Very careful examination is required to avoid being victimized by the particular boundary kernel chosen. Artifacts can unfortunately be introduced by the choice of the support interval for the boundary kernel.

KDE with Bounded Data - A Priori Information

  • It bears noting that virtually all existing boundary kernels require the bounds to be known a priori

  • Furthermore, they typically involve weight functions bounded on finite intervals \([a,b]\) (e.g., the Beta (\(X\in[a,b]\)) and Gamma kernels (\(X\in[a,\infty]\)) where \(a\) and \(b\) are finite)

  • A more recent addition is the Gaussian kernel (\(X\in[a,b]\), \(|a|\le\infty\), \(|b|\le\infty\), \(a<b\)) of Racine, Li, and Wang (2024)

  • Unlike its peers, however, the kernel proposed by Racine, Li, and Wang (2024),

    • is applicable whether or not bounds are known a-priori
    • allows \(a\) and/or \(b\) to be infinite

KDE with Bounded Data - Shrinkage

  • If bounds are known, naturally one should use that information

  • But in the presence of unknown bounds, Racine, Li, and Wang (2024) considered a simple kernel function that can automatically adapt (more shortly) and nests the classical Gaussian kernel with \(X\in[-\infty,\infty]\) as a special case

  • They considered a classical method of bandwidth selection and demonstrated that it can deliver consistent estimates even when \(h_x\to\infty\) as \(n\to\infty\) is appropriate, which violates classical conditions for consistency

  • This result is the unconditional counterpart to the conditional result provided by Hall, Racine, and Li (2004) who showed that cross-validated bandwidth selection “automatically determines which components are relevant and which are not, through assigning large smoothing parameters to the latter and consequently shrinking them toward the uniform distribution on the respective marginals”

  • Here we observe shrinkage in the Bayesian sense (Kiefer and Racine 2009, 2015)

KDE with Bounded Data - Bandwidths and Shrinkage

  • For \(X\in [a,b]\) and kernel \(K(z)\) (e.g., Gaussian or Epanechnikov), a simple boundary kernel \(K_{[a,b]}(z)\) is constructed as follows (Racine, Li, and Wang 2024): \[\begin{equation*} K_{[a,b]}(z)=\left\{ \begin{array}{cl} \frac{K(z)}{G(z_b)-G(z_a)}&\text{ if } z_a\le z\le z_b,\\ 0 & \text{ otherwise} \end{array} \right. \end{equation*}\]

  • Here, \(z=(x-X_i)/h_x\), \(z_b=(b-x)/h_x\), \(z_a=(a-x)/h_x\), \(a<b\), and \(G(z)=\int_{-\infty}^z K(t)\,dt\) (for a Gaussian kernel this is the “doubly truncated Gaussian”)

  • The key to this kernel formulation lies in its adaptability:

    • If \(K(z)\) has infinite support then \(K_{[a,b]}(z)\) seamlessly admits infinite support situations without modification (i.e., \([a,b]=[-\infty,\infty]\), \([a,b]=[-\infty,b_0]\), \([a,b]=[a_0,\infty]\), \([a,b]=[a_0,b_0]\) for \(|a_0|<\infty\) and \(|b_0|<\infty\), \(a_0<b_0\))
    • If the bounds are unknown, the kernel can automatically adapt to the range of the data

Unconditional KDE in R - CPS Age Distribution

Let’s consider an illustration where we estimate the age distribution for working age males from a random sample taken from the Current Population Survey (CPS) using the classical kernel and that of Racine, Li, and Wang (2024) when bounds are not assumed to be known a priori (we use the empirical support \([\min(X),\max(X)]\))

Example (Canadian High School Graduate Age Distribution 21-65)

library(np)
## Display help for npuniden.boundary
?npuniden.boundary
data(cps71)
attach(cps71)
## Use npuniden.boundary() with unknown bounds
model <- npuniden.boundary(age)
par(mfrow=c(1,1))
hist(age,prob=TRUE,main="")
## Plot the boundary-corrected KDE
lines(age,model$f)
## Plot the classical density using R's density()
lines(density(age),col=2)
legend("topright",c("Boundary Kernel (unknown bounds)","Classical density()"),lty=c(1,1),lwd=c(1,1),col=1:2,bty="n")

Unconditional KDE in R - CPS Age Distribution

Figure 2: Age distribution of Canadian high school graduates

Unconditional KDE in R - Simulated Data

Now let’s consider some simulated data drawn from the \(U[0,1]\) (a “worst case” scenario for classical kernel density estimation) and assess the shrinkage phenomenon

Example (\(N(0,1)\) (left figure), \(U{[0,1]}\) (right figure), R function density())

## Consider the previous simulated illustration and add the boundary-corrected
## KDE
set.seed(42)
## Simulate n=1000 observations from N(0,1) and U[0,1]
n <- 1000
par(mfrow=c(1,2),cex=.8)
x <- sort(rnorm(1000))
model <- npuniden.boundary(x)
## Plot the density using R's density() for N(0,1)
plot(density(x),main="Gaussian Data")
lines(x,dnorm(x),col=2,lwd=2,lty=2)
lines(x,model$f,col=3,lwd=3,lty=2)
legend("topleft",c("DGP","density()","Boundary Kernel (unknown bounds)"),col=c(2,1,3),lwd=c(1,2,3),lty=c(2,1,2),bty="n")
rug(x)
x <- sort(runif(1000))
model <- npuniden.boundary(x)
## Plot the density using R's density() for U[0,1]
plot(density(x),main="Uniform Data")
lines(x,dunif(x),col=2,lwd=2,lty=2)
segments(0,0,0,1,lty=2,col=2,lwd=2)
segments(1,0,1,1,lty=2,col=2,lwd=2)
lines(x,model$f,col=3,lwd=3,lty=2)
legend("topleft",c("DGP","density()","Boundary Kernel (unknown bounds)"),lwd=c(1,2,3),col=c(2,1,3),lty=c(2,1,2),bty="n")
rug(x)

Unconditional KDE in R - Simulated Data

Figure 3: Simulated data from \(N(0,1)\) and \(U[0,1]\) with boundary-corrected KDE

Conditional Density Estimation

Conditional KDE

  • The conditional PDF \(f(y|x)=f(y,x)/f(x)\) is a fundamental object in statistics

  • Related objects include

    • The conditional CDF \(F(y|x)=\int_{-\infty}^y f(t|x)dt=\int_{-\infty}^y f(t,x)dt/f(x)\)
    • The conditional expectation \(E(Y|X=x)=\int_{-\infty}^{\infty} yf(y|x)dy\)
    • The conditional quantile \(Q_{\tau}(x)=\inf\{y:F(y|x)\ge \tau\}\), etc.
  • Also, derivatives are often required for constructing marginal effects, e.g. \(d E(Y|X=x)/d x\), along with counterparts for the conditional distribution function and quantile function

  • Sound nonparametric estimation must address a range of issues including boundary bias, bandwidth selection, choice of kernel function, and the choice of the order of the local polynomial approximation

Conditional KDE

  • We consider the estimation of the conditional density \(f(y|x)=f(y,x)/f(x)\) when \(Y\in[a,b]\)

  • The classical case is when \(X\) is continuous and \(Y\) is continuous lying in \([-\infty,\infty]\), and we are interested in estimating \(f(y|x)\)

  • The classical case replaces \(f(x,y)\) and \(f(x)\) with their kernel estimates and then divides the former by the latter keeping the same bandwidth \(h_x\) for the numerator and denominator, which can also be derived as the solution to a local polynomial approximation problem

  • The resulting kernel density estimator of the conditional PDF with \(Y\in\mathbb{R}^1\) and \(X\in\mathbb{R}^1\) is given by \[\begin{equation*} \hat f(y|x) = \frac{\hat f(y,x)}{\hat f(x)}=\frac{\sum_{i=1}^nh_y^{-1}h_x^{-1}K\left(\frac{y-Y_i}{h_y}\right)K\left(\frac{x-X_i}{h_x}\right)}{\sum_{i=1}^nh_x^{-1}K\left(\frac{x-X_i}{h_x}\right)} \end{equation*}\]

Local Polynomial Conditional KDE

  • This estimator is also the solution to a locally weighted least squares regression of \(h_y^{-1}K\left(\frac{y-Y_i}{h_y}\right)\) on a constant \(\alpha\) using local weights \(h_x^{-1}K\left(\frac{x-X_i}{h_x}\right)\), i.e., \[\begin{equation*} \hat \alpha = \operatorname{argmin}_{\alpha}\sum_{i=1}^n\left(h_y^{-1}K\left(\frac{y-Y_i}{h_y}\right)-\alpha\right)^2h_x^{-1}K\left(\frac{x-X_i}{h_x}\right), \end{equation*}\] which delivers \(\hat\alpha=\hat f(y|x)\) identical to that from the previous approach

  • The local polynomial extension of this estimator (Fan, Yao, and Tong 1996) is given by \[\begin{equation*} \hat \alpha = \operatorname{argmin}_{\alpha}\sum_{i=1}^n\left(h_y^{-1}K\left(\frac{y-Y_i}{h_y}\right)-\sum_{j=0}^p\alpha_j(x-X_i)^j\right)^2h_x^{-1}K\left(\frac{x-X_i}{h_x}\right), \end{equation*}\] which delivers \(\hat\alpha_0=\hat f(y|x)\), a \(p\)-th order local polynomial estimator

Local Polynomial Conditional KDE

  • The local polynomial estimator is often adopted for its boundary properties with odd-order polynomials \(p\ge 1\)

  • These properties are that the bias at the boundary is the same order as that in the interior of support, unlike that for the classical estimator with \(p=0\)

  • But, the local polynomial estimator is not without its own problems, in particular the estimated density can be negative and fail to integrate to unity (i.e., is not guaranteed to be proper)

  • Then there is the conventional wisdom about polynomial order selection (Fan and Gijbels 1995, p 59; Fan and Gijbels 1996):

    “Another issue in local polynomial fitting is the choice of the order of the local polynomial. Since the modelling bias is primarily controlled by the bandwidth, this issue is less crucial however. [] Since the bandwidth is used to control the modelling complexity, we recommend the use of the lowest odd order, i.e. \(p=\nu+1\), or occasionally \(p=\nu+3\). [\(\nu\) is the order of the derivative required].”

Conditional Density Estimation:
Bandwidth and Polynomial Order Selection

Conditional KDE - Bandwidth Selection

  • Classical least squares cross-validation is a fully automatic and data-driven method of selecting the bandwidth with \(Y\in[-\infty,\infty]\) and fixed polynomial order (Rudemo 1982; Bowman 1984; Hall, Racine, and Li 2004)

  • This method is based on the principle of selecting a bandwidth that minimizes the integrated square error (ISE(\(h_x\),\(h_y\)) of the resulting estimate

  • The integrated squared difference between \(\hat f(y|x)\) and \(f(y|x)\) is \[\begin{equation*} \int \left(\hat f(y|x) - f(y|x)\right)^2\, dx = \int \hat f(y|x)^2\, dx - 2\int \hat f(y|x) f(y|x)\, dx + \int f(y|x)^2\, dy \end{equation*}\]

  • The third term can be ignored since it does not depend on \(h_x\) or \(h_y\)

  • Replacing the first two terms with their sample counterparts and adjusting for bias, and fixing the polynomial order, we can minimize ISE(\(h_x\),\(h_y\)) with respect to \(h_x\) and \(h_y\)

Conditional KDE - Bandwidth and Order Selection

  • Hall and Racine (2015) consider local polynomial kernel regression where, jointly, the polynomial order and bandwidths are selected via cross-validation

  • They argue that the polynomial order selection cannot be fixed at an ad hoc value since the optimal order depends on the unknown DGP, as do the bandwidths, contrary to the conventional wisdom (Fan and Gijbels 1995, p 59; Fan and Gijbels 1996)

  • This joint optimization of the polynomial order vector (integers) and the bandwidth vector (continuous positive bounded scalars \(>0\)), is a mixed-integer problem that can be solved using the NOMAD algorithm (Le Digabel 2011; Abramson et al. 2022)

  • We extend this to the conditional density estimation case, where the choice of polynomial order is also crucial

  • We minimize ISE(\(p\), \(h_x\),\(h_y\)), while the classical approach minimizes ISE(\(h_x\),\(h_y\)) for ad hoc \(p\)

Conditional Density Estimation - Bandwidth and Polynomial Order Selection

Conditional KDE - Bandwidth and Order Selection with Bounded \(Y\)

  • We consider the case where \(Y\in[a,b]\) and \(X\in\mathbb{R}^1\), and allow for the bounds to be known or unknown a priori

  • The local polynomial estimator using the boundary adaptive kernel \(K_{[a,b]}(z_y)\) is \(\hat\alpha_0=\hat f(y|x)\) from the solution of \[\begin{equation*} \hat \alpha = \operatorname{argmin}_{\alpha}\sum_{i=1}^n\left(h_y^{-1}K_{[a,b]}\left(\frac{y-Y_i}{h_y}\right)-\sum_{j=0}^p\alpha_j(x-X_i)^j\right)^2h_x^{-1}K\left(\frac{x-X_i}{h_x}\right) \end{equation*}\]

  • We minimize ISE(\(p\), \(h_x\),\(h_y\)) with known bounds \(a\) and \(b\) or unknown bounds using the empirical support \([\hat a,\hat b]=[\min(Y),\max(Y)]\)

  • We provide uniform consistency results for the resulting estimator

  • This supports the construction of uniform or “simultaneous” confidence intervals

  • Let’s consider some illustrations of the fully data-driven method of bandwidth and polynomial order selection

Example - Conditional KDE with Bounded \(Y\) and Irrelevant \(X\)

We consider what is known to be the “worst case” scenario for classical kernel density estimation, namely, when the unknown conditional density has a uniform distribution

Example (Uniform, Irrelevant \(X\))

library(bkcde)
set.seed(42)
n <- 350
x <- runif(n)
y <- rbeta(n,1,1)
f <- bkcde(x=x,y=y,proper=TRUE)
f.classic <- bkcde(x=x,y=y,proper=TRUE,
                   y.ub=Inf,
                   y.lb=-Inf,
                   x.ub=Inf,
                   x.lb=-Inf,
                   degree.min=0,
                   degree.max=0)
dgp <- dbeta(f$y.eval,1,1)
par(mfrow=c(1,3),cex=.75,mar=c(2,2,2,1))
zlim <- range(f$f,f.classic$f,dgp,0,2)
plot(f.classic,main=paste("Estimated Density (Classic, p = 0), n = ",n),zlim=zlim)
plot(f,main=paste("Estimated Density (Bounded, p = ",f$degree,"), n = ",n,sep=""),zlim=zlim)
persp(unique(f$x.eval),unique(f$y.eval),
      matrix(dgp,nrow=length(unique(f$x.eval))),
      theta=120,phi=45,ticktype="detailed",xlab="x",ylab="y",zlab="f(y|x)",zlim=zlim,main="True density")

Example - Conditional KDE with Bounded \(Y\) and Irrelevant \(X\)

Figure 4: Conditional KDE for \(U[0,1]\) with classical bounds and ad hoc order (\(Y\in[-\infty,\infty]\), \(p=0\)), empirical bounds with adaptive polynomial choice (\(Y\in[\min(Y),\max(Y)]\))

Example - Conditional KDE with Unbounded \(Y\)

Below we consider the case where the distribution is unbounded (\(Y\in(-\infty,\infty)\)) but empirical bounds are used and the bandwidths and polynomial order are data-driven

Example (Gaussian, Empirical Bounds)

library(bkcde)
set.seed(42)
n <- 250
a <- 0
b <- 1
sd.unif <- sqrt((b-a)^2/12)
sn <- 1
x <- runif(n,a,b)
y <- rnorm(n,mean=x,sd=sn*sd.unif)
f <- bkcde(x=x,y=y,proper=TRUE)
dgp <- dnorm(f$y.eval,mean=f$x.eval,sd=sn*sd.unif)
zlim <- range(f$f,dgp)
par(mfrow=c(1,2),cex=.75,mar=c(2,2,2,1))
plot(f,main=paste("Estimated Density, n = ",n),zlim=zlim)
persp(unique(f$x.eval),unique(f$y.eval),
      matrix(dgp,nrow=length(unique(f$x.eval))),
      theta=120,phi=45,ticktype="detailed",xlab="x",ylab="y",zlab="f(y|x)",zlim=zlim,main="True density")

Example - Conditional KDE with Unbounded \(Y\)

Figure 5: Conditional KDE for \(N(x,1)\) with empirical bounds

Example - Conditional KDE with Bounded \(Y\)

Below we consider the case where the distribution is bounded (\(Y\in[0,1]\)) but empirical bounds are used and the bandwidths and polynomial order are data-driven

Example (Beta(1+x,2-x), Empirical Bounds)

library(bkcde)
set.seed(42)
n <- 250
x <- runif(n)
y <- rbeta(n,1+x,2-x)
f <- bkcde(x=x,y=y,proper=TRUE)
dgp <- dbeta(f$y.eval,1+f$x.eval,2-f$x.eval)
par(mfrow=c(1,2),cex=.75,mar=c(2,2,2,1))
zlim <- range(f$f,dgp)
plot(f,main=paste("Estimated Density, n = ",n),zlim=zlim)
persp(unique(f$x.eval),unique(f$y.eval),
      matrix(dgp,nrow=length(unique(f$x.eval))),
      theta=120,phi=45,ticktype="detailed",xlab="x",ylab="y",zlab="f(y|x)",zlim=zlim,main="True density")

Example - Conditional KDE with Bounded \(Y\)

Figure 6: Conditional KDE for \(\textrm{Beta}(1+x,2-x)\) with empirical bounds

Example - Simultaneous Confidence Intervals

Simultaneous or “uniform” confidence intervals present a multiple comparison problem, requiring both bias adjustment and solving a problem known not to be easy. The R package you are using can generate pointwise, simultaneous, and Bonferroni bounds that are theoretically supported and trivial to render, as the following example highlights (this may take a bit of time as there are a large number of bootstrap replications being drawn, \(B=3,999\))

Example (Beta, Empirical Bounds, Simultaneous Confidence Intervals)

library(bkcde)
set.seed(42)
n <- 250
x <- runif(n)
y <- rbeta(n,1+x,2-x)
f <- bkcde(x=x,y=y,proper=TRUE)
par(mfrow=c(1,2),cex=.75,mar=c(2,2,2,1))
foo <- plot(f,ci=TRUE,ci.method="all",ci.preplot=FALSE,plot.behavior="plot-data",main=paste("Estimated Density, n = ",n))
persp(unique(f$x.eval),unique(f$y.eval),
      matrix(dbeta(f$y.eval,1+f$x.eval,2-f$x.eval),nrow=length(unique(f$x.eval))),
      theta=120,phi=45,ticktype="detailed",xlab="x",ylab="y",zlab="f(y|x)",zlim=foo$zlim,main="True density")

Example - Simultaneous Confidence Intervals

Figure 7: Conditional KDE for \(\textrm{Beta}(1+x,2-x)\) with empirical bounds and simultaneous confidence intervals

Example - Patent Application Data

As a nod to being in the patent capital of the world, consider estimating the density of patent counts using a panel of 181 firms from 1983 to 1991 (again, this may take a bit of time)

Example (Annual numbers of European patent applications)

library(bkcde)
library(pglm)
## Annual observations of 181 firms from 1983 to 1991, number of European patent
## applications (count), 1629 observations. 
data(PatentsRD)
attach(PatentsRD)
## Patent data often has a preponderance of zeros and "over-dispersion". Since
## the data has a boundary at zero, a zero-inflated negative binomial model is
## often used. Instead we consider a kernel density estimation approach that
## adapts to the boundary automatically and also chooses the polynomial order in
## a data-driven fashion.
x <- as.integer(as.character(year))
y <- patent
## To most quickly run this example, we take results from a prior run so that in
## real-time computation proceeds as quickly as possible for illustrative
## purposes (you can comment out/re-enable to confirm they produce identical
## results)
# f.bkcde <- bkcde(x=x,y=y,proper=TRUE)
# f.bkcde$h [1] 5.022254 5.233370
# f.bkcde$degree [1] 0
f.bkcde <- bkcde(h=c(5.022254,5.233370),degree=0,x=x,y=y,proper=TRUE)
## Estimator of Hall, Racine & Li (polynomial order 0 conditional density estimate)
## with bandwidth selection via likelihood cross-validation
# f.hrl <- bkcde(degree.max=0,x=x,y=y,proper=TRUE,y.ub=Inf,y.lb=-Inf,x.ub=Inf,x.lb=-Inf)
# f.hrl$h [1] 4.166873 5.442247
# f.hrl$degree [1] 0
f.hrl <- bkcde(h=c(4.166873,5.442247),degree=0,x=x,y=y,proper=TRUE,y.ub=Inf,y.lb=-Inf,x.ub=Inf,x.lb=-Inf)
## In the following plots note that the data has a very long tail (right skewed)
## so we hone in on firms having 100 or fewer patents/year. We first take
## univariate histogram and density() estimators for the year 1987, then overlay
## the conditional kernel density estimates from npcdens() and bkcde() for the
## same year (these use data for all years). We also plot the bkcde() 2D
## estimates for the year 1987 without, then with simultaneous confidence
## intervals. Finally, we plot the bkcde() estimate in 3D with and without
## confidence intervals (all years). We override the default number of bootstrap
## replications to 399.
B <- 399
par(mfrow=c(2,2),cex=.65,mar=c(4,4,4,2))
plot(f.bkcde,persp=FALSE,x.eval=1987,y.eval=seq(min(y),100,length=100),xlim=c(0,100))
hist(y[x==1987],xlab="Patent counts",main="",ylim=c(0,0.06),xlim=c(0,100),prob=TRUE,breaks=250)
lines(density(y[x==1987],from=0,to=100),lty=1,col=1)
lines(seq(min(y),100,length=100),predict(f.hrl,newdata=data.frame(x=rep(1987,100),y=seq(min(y),100,length=100))),
      col=2,lty=2,lwd=2)
lines(seq(min(y),100,length=100),predict(f.bkcde,newdata=data.frame(x=rep(1987,100),y=seq(min(y),100,length=100))),
      col=3,lty=3,lwd=3)
legend("topright",legend=c("density() classical kernel [1987 only]","HRL classical kernel","BKPA boundary kernel"),
       col=1:3,lty=1:3,lwd=c(1,2,3),bty="n")
plot(f.bkcde,persp=FALSE,x.eval=1987,y.eval=seq(min(y),100,length=100),xlim=c(0,100),
     ci=TRUE,ci.method="Simultaneous",ci.preplot=FALSE,B=B)
plot(f.bkcde,x.eval=seq(min(x),max(x),length=25),y.eval=seq(min(y),100,length=25),
     ci=TRUE,ci.method="Simultaneous",ci.preplot=FALSE,B=B)

Example - Patent Application Data

Figure 8: Conditional KDE for patent data with simultaneous confidence intervals, 2D plot for year 1987, 3D plot for all years

Finite-Sample Performance:
Monte Carlo Simulation Results

Simulation Details

  • We provide theoretical underpinnings for joint selection of \(p\), \(h_y\), and \(h_x\) via least squares cross-validation (LS-CV) minimization of \[\mathrm{ISE}(p,h_y,h_x)=\int \{\widehat{g}(y | \mathbf{x})-{g}(y | \mathbf{x})\}^2dy\]

  • However, no data-driven method is guaranteed to always deliver reliable results

  • We therefore compare the method with maximum likelihood cross-validation (ML-CV) which maximizes \[\mathcal{L}(p,h_y,h_x)=\sum_{i=1}^n \log \widehat{g}_{-i}(y_i | \mathbf{x}_i)\]

  • Both ML-CV and LS-CV perform comparably and appear to deliver consistent estimates

Simulation Details

  • We present a range of simulations for the beta and Gaussian densities, \(f(y|x)\), with x being irrelevant, i.e. \(f(y|x)=f(y)\), or the mean of \(f(y|x)\) being quadratic in \(x\)

  • We consider sample sizes of \(n=400,800,1600\)

  • We conduct \(M\) Monte Carlo replications for each scenario and compare

    • BKPA: The proposed approach using a polynomial adaptive method and empirical boundary kernel
    • HRL: Hall, Racine, and Li (2004) which is a local constant method (\(p=0\))
    • FYT: Fan, Yao, and Tong (1996) which is a local polynomial method (\(p=1\))
  • Neither Hall, Racine, and Li (2004) nor Fan, Yao, and Tong (1996) make use of boundary information

  • For BKPA we conduct search jointly over candidate models of order \(p=0,1,2,3,4,5\) and \(h_y\) and \(h_x\)

Simulation Details

  • The proposed approach uses the empirical support of \(Y\) \([\min(Y),\max(Y)]\) for the boundary kernel (we don’t assume bounds \([a,b]\) are known a-priori)

  • For the Beta distribution, the true bounds are \([0,1]\)

  • For the Gaussian distribution, the true bounds are \([-\infty,\infty]\)

  • HRL: Hall, Racine, and Li (2004) (\(p=0\)) uses the classical bounds, \([-\infty,\infty]\)

  • FYT: Fan, Yao, and Tong (1996) (\(p=1\)) uses the classical bounds, \([-\infty,\infty]\)

  • The \(\mathrm{Beta}(s_1,s_2)\) random variate has mean \(s_1/(s_1+s_2)\) and variance \(s_1s_2/((s_1+s_2)^2(s_1+s_2+1))\)

  • \(X\) is drawn from a \(U[0,1]\) distribution which has mean \(1/2\) and variance \(1/12\)

Simulation DGPs

Figure 9: DGPs for the conditional density simulations (top Beta, bottom Gaussian, figures on the left \(X\) irrelevant, figures on the right mean quadratic in \(X\))

Candidate Model Comparison

Candidate Model Comparison: Beta DGP

  • We conduct Monte Carlo simulations and report the median relative RMSE of fixed polynomial estimators and the polynomial adaptive approach (PA)

  • Numbers \(<1\) are more efficient than the PA method

  • If you knew ex ante which polynomial order to select, you would use that

    Table 1: Relative RMSE Irrelevant ML
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    400 0.94 1.13 1.30 1.41 1.53 1.63 1.00
    800 0.96 1.19 1.36 1.49 1.62 1.72 1.00
    1600 0.98 1.24 1.42 1.57 1.71 1.84 1.00
    Table 2: Relative RMSE Irrelevant LS
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    400 0.90 1.08 1.24 1.35 1.45 1.54 1.00
    800 0.94 1.14 1.31 1.44 1.57 1.67 1.00
    1600 0.97 1.19 1.36 1.50 1.64 1.77 1.00
    Table 3: Relative RMSE Quadratic ML
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    400 1.05 1.03 0.97 1.00 1.00 1.05 1.00
    800 1.05 1.03 1.02 1.00 0.99 1.03 1.00
    1600 1.06 1.04 1.05 1.00 0.98 1.02 1.00
    Table 4: Relative RMSE Quadratic LS
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    100 0.93 0.93 0.88 0.95 0.98 1.05 1.00
    200 1.02 1.02 0.96 1.01 1.03 1.09 1.00
    400 1.05 1.03 0.98 1.01 1.01 1.07 1.00
    800 1.04 1.02 1.00 0.99 0.98 1.03 1.00
    1600 1.06 1.02 1.02 0.99 0.96 1.01 1.00

Candidate Model Comparison: Gaussian DGP

  • We conduct Monte Carlo simulations and report the median relative RMSE of fixed polynomial estimators and the polynomial adaptive approach (PA)

  • Numbers \(<1\) are more efficient than the PA method

  • If you knew ex ante which polynomial order to select, you would use that

    Table 5: Relative RMSE Irrelevant ML
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    400 0.96 1.21 1.39 1.56 1.70 1.82 1.00
    800 0.98 1.24 1.44 1.62 1.78 1.94 1.00
    1600 0.99 1.28 1.50 1.71 1.88 2.04 1.00
    Table 6: Relative RMSE Irrelevant LS
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    400 0.97 1.22 1.41 1.57 1.72 1.84 1.00
    800 0.97 1.23 1.43 1.60 1.74 1.89 1.00
    1600 0.98 1.21 1.42 1.59 1.75 1.90 1.00
    Table 7: Relative RMSE Quadratic ML
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    400 1.04 1.02 1.01 1.00 1.01 1.02 1.00
    800 1.08 1.02 1.00 1.02 1.02 1.01 1.00
    1600 1.11 1.02 1.01 1.04 1.04 1.01 1.00
    Table 8: Relative RMSE Quadratic LS
    n p=0 p=1 p=2 p=3 p=4 p=5 BKPA
    100 0.92 0.96 0.97 0.97 0.98 1.03 1.00
    200 0.97 0.99 0.98 0.98 0.98 1.00 1.00
    400 1.05 1.02 1.00 0.99 1.00 1.00 1.00
    800 1.08 1.02 0.99 0.99 1.01 0.99 1.00
    1600 1.16 1.05 1.01 1.01 1.02 1.00 1.00

Scale Factor Summaries

Scale Factor Summaries: Beta DGP

The “scale factor” \(s\) is an unknown constant that, adjusted for the sample size and data scale, related to the “bandwidth” \(h\), i.e., \(h_x=s_x\sigma_x/n^{1/6}\) and \(h_y=s_y\sigma_y/n^{1/6}\), where \(n\) is the sample size. The scale factor is estimated by optimizing the cross-validation score. The scale factor summaries below show the median scale factor for the three methods: BKPA, HRL, and FYT, and are presented for the irrelevant \(x\) and quadratic \(x\) DGPs

Table 9: Scale Factor Summary Irrelevant ML
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
400 1.05 185.19 0.5 3260.66 0.5 1092.02
800 0.91 207.33 0.5 3032.45 0.5 2218.23
1600 0.78 232.80 0.5 2325.28 0.5 1587.36
Table 10: Scale Factor Summary Irrelevant LS
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
400 1.17 195.20 0.5 1020.72 0.5 253.88
800 1.13 197.04 0.5 1302.18 0.5 1054.54
1600 1.05 175.11 0.5 1142.90 0.5 1340.14
Table 11: Scale Factor Summary Quadratic ML
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
400 1.63 7.46 0.5 0.95 0.5 1.42
800 1.61 2.57 0.5 0.89 0.5 1.28
1600 1.62 2.30 0.5 0.84 0.5 1.17
Table 12: Scale Factor Summary Quadratic LS
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
100 1.61 7.57 0.91 0.89 0.97 1.12
200 1.52 8.81 0.82 0.85 0.86 1.07
400 1.40 97.77 0.71 0.84 0.74 1.13
800 1.31 96.11 0.63 0.85 0.63 1.15
1600 1.18 27.92 0.55 0.84 0.54 1.16

Scale Factor Summaries: Gaussian DGP

The “scale factor” \(s\) is an unknown constant that, adjusted for the sample size and data scale, related to the “bandwidth” \(h\), i.e., \(h_x=s_x\sigma_x/n^{1/6}\) and \(h_y=s_y\sigma_y/n^{1/6}\), where \(n\) is the sample size. The scale factor is estimated by optimizing the cross-validation score. The scale factor summaries below show the median scale factor for the three methods: BKPA, HRL, and FYT, and are presented for the irrelevant \(x\) and quadratic \(x\) DGPs

Table 13: Scale Factor Summary Irrelevant ML
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
400 1.12 213.48 0.93 169.25 1.07 13.21
800 1.02 246.56 0.90 175.62 1.02 47.94
1600 0.96 314.23 0.88 255.67 1.01 14.35
Table 14: Scale Factor Summary Irrelevant LS
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
400 0.94 166.16 0.96 80.29 1.10 86.54
800 0.91 253.72 0.91 140.47 1.05 112.00
1600 0.87 204.60 0.83 107.26 0.99 75.46
Table 15: Scale Factor Summary Quadratic ML
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
400 0.96 1.58 0.84 0.5 0.86 0.5
800 0.97 1.46 0.82 0.5 0.86 0.5
1600 0.99 1.26 0.81 0.5 0.85 0.5
Table 16: Scale Factor Summary Quadratic LS
BKPA
HRL
FYT
n sf-y sf-x sf-y sf-x sf-y sf-x
100 0.99 1.83 0.99 0.5 1.00 0.53
200 0.94 1.78 0.93 0.5 0.95 0.50
400 0.92 1.86 0.89 0.5 0.91 0.50
800 0.92 1.84 0.86 0.5 0.89 0.50
1600 0.90 1.88 0.83 0.5 0.85 0.50

Comparison with HRL and FYT

Comparison with HRL and FYT: Beta DGP

  • We compare the performance of the proposed method using empirical bounds with the methods of Hall, Racine, and Li (2004) and Fan, Yao, and Tong (1996), which use a classical kernel presuming infinite support (inappropriate for the Beta)

  • Numbers \(<1\) are more efficient than the BKPA method

    Table 17: Median Relative RMSE Irrelevant ML
    n BKPA HLR FYT
    400 1.00 1.99 2.22
    800 1.00 2.51 2.70
    1600 1.00 3.24 3.39
    Table 18: Median Relative RMSE Irrelevant LS
    n BKPA HLR FYT
    400 1.00 1.89 2.11
    800 1.00 2.46 2.64
    1600 1.00 3.14 3.29
    Table 19: Median Relative RMSE Quadratic ML
    n BKPA HLR FYT
    400 1.00 1.29 1.27
    800 1.00 1.25 1.22
    1600 1.00 1.21 1.15
    Table 20: Median Relative RMSE Quadratic LS
    n BKPA HLR FYT
    100 1.00 1.06 1.06
    200 1.00 1.20 1.19
    400 1.00 1.23 1.21
    800 1.00 1.22 1.17
    1600 1.00 1.24 1.16

Comparison with HRL and FYT: Gaussian DGP

  • We compare the performance of the proposed method using empirical bounds with the methods of Hall, Racine, and Li (2004) and Fan, Yao, and Tong (1996), which use a classical kernel presuming infinite support (appropriate for the Gaussian)

  • Numbers \(<1\) are more efficient than the BKPA method

    Table 21: Median Relative RMSE Irrelevant ML
    n BKPA HLR FYT
    400 1.00 0.84 1.09
    800 1.00 0.93 1.19
    1600 1.00 0.96 1.26
    Table 22: Median Relative RMSE Irrelevant LS
    n BKPA HLR FYT
    400 1.00 0.94 1.17
    800 1.00 0.96 1.21
    1600 1.00 1.01 1.27
    Table 23: Median Relative RMSE Quadratic ML
    n BKPA HLR FYT
    400 1.00 1.02 0.97
    800 1.00 1.07 0.99
    1600 1.00 1.11 1.01
    Table 24: Median Relative RMSE Quadratic LS
    n BKPA HLR FYT
    100 1.00 0.84 0.84
    200 1.00 0.93 0.90
    400 1.00 1.02 0.97
    800 1.00 1.07 0.99
    1600 1.00 1.16 1.03

References (scrollable)

Abramson, M. A., C. Audet, G. Couture, J. E. Dennis Jr., and S. Le Digabel. 2022. “The NOMAD Project.” GERAD (Groupe D’Études et de Recherche en Analyse des Décisions). https://www.gerad.ca/en/software/nomad.
Bouezmarni, Taoufik, and Jean-Marie Rolin. 2003. “Consistency of the Beta Kernel Density Function Estimator.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 31 (1): 89–98.
Bowman, A. W. 1984. “An Alternative Method of Cross-Validation for the Smoothing of Density Estimates.” Biometrika 71: 353–60.
Chen, Song Xi. 1999. “Beta Kernel Estimators for Density Functions.” Computational Statistics & Data Analysis 31 (2): 131–45.
———. 2000. “Probability Density Function Estimation Using Gamma Kernels.” Annals of the Institute of Statistical Mathematics 52 (3): 471–80.
Fan, J., and I. Gijbels. 1995. “Adaptive Order Polynomial Fitting: Band- Width Robustification and Bias Reduction.” Journal of Computational and Graphical Statistics 4: 213–27.
———. 1996. Local Polynomial Modelling and Its Applications. London: Chapman; Hall.
Fan, J., Q. Yao, and H. Tong. 1996. Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems.” Biometrika 83 (1): 189–206. https://doi.org/10.1093/biomet/83.1.189.
Hall, P., and J. S. Racine. 2015. “Infinite Order Cross-Validated Local Polynomial Regression.” Journal of Econometrics 185 (2): 510–25.
Hall, P., J. S. Racine, and Q. Li. 2004. “Cross-Validation and the Estimation of Conditional Probability Densities.” Journal of the American Statistical Association 99 (468): 1015–26.
Igarashi, Gaku. 2016. “Bias Reductions for Beta Kernel Estimation.” Journal of Nonparametric Statistics 28 (1): 1–30.
Igarashi, Gaku, and Yoshihide Kakizawa. 2014. “Re-Formulation of Inverse Gaussian, Reciprocal Inverse Gaussian, and Birnbaum–Saunders Kernel Estimators.” Statistics & Probability Letters 84: 235–46.
Kiefer, N. M., and J. S. Racine. 2009. “The Smooth Colonel Meets the Reverend.” Journal of Nonparametric Statistics 21: 521–33.
———. 2015. “The Smooth Colonel and the Reverend Find Common Ground.” Econometric Reviews, (forthcoming).
Le Digabel, S. 2011. “Algorithm 909: NOMAD: Nonlinear Optimization with the MADS Algorithm.” ACM Transactions on Mathematical Software 37 (4): 44:1–15.
Malec, Peter, and Melanie Schienle. 2014. “Nonparametric Kernel Density Estimation Near the Boundary.” Computational Statistics & Data Analysis 72: 57–76. https://doi.org/https://doi.org/10.1016/j.csda.2013.10.023.
Parzen, E. 1962. “On Estimation of a Probability Density Function and Mode.” The Annals of Mathematical Statistics 33: 1065–76.
Racine, J. S., Q. Li, and Q. Wang. 2024. “Boundary-Adaptive Kernel Density Estimation: The Case of (Near) Uniform Density.” Journal of Nonparametric Statistics 36 (1): 146–64. https://doi.org/10.1080/10485252.2023.2250011.
Rosenblatt, M. 1956. “Remarks on Some Nonparametric Estimates of a Density Function.” The Annals of Mathematical Statistics 27: 832–37.
Rudemo, M. 1982. “Empirical Choice of Histograms and Kernel Density Estimators.” Scandinavian Journal of Statistics 9: 65–78.
Scaillet, O. 2004. “Density Estimation Using Inverse and Reciprocal Inverse Gaussian Kernels.” Journal of Nonparametric Statistics 16 (1-2): 217–26.
Scott, David W. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley.
Zhang, Shunpu, and Rohana J. Karunamuni. 2010. “Boundary Performance of the Beta Kernel Estimators.” Journal of Nonparametric Statistics 22 (1): 81–104.