Labs

1 How to proceed

The goal of these labs is to get you up to speed with certain aspects of working with R, presuming of course that you are not already familiar with R, RStudio, and applied nonparametric estimation in general. It is also my sincere hope that these labs will enable you to work with your own data in R as early as even the first lab session, if you choose.

Note that each chunk of code below can be cut-and-paste into RStudio (or elsewhere) by moving your cursor to the upper right portion of the code chunk then clicking on the clipboard icon (this copies the code to your clipboard and you can then copy it into RStudio using whichever method you are accustomed to).

These labs complement material presented in the lectures rather than simply repeating what you have already seen. They present an under-the-hood view of capabilities that may not be obvious unless you have spent hours devouring vignettes1 and assorted help files. Note that once you have installed the np package then you can view various vignettes as follows:

## The vignette below presents an overview of the np package
vignette("np",package="np")
## The vignette below provides answers to frequently asked questions
vignette("np_faq",package="np")
## The vignette below present an overview of entropy-based inference
## in the np package
vignette("entropy_np",package="np")

I expect you to work through these labs at your own pace over the course of this five-day workshop. Also, pull up the lecture slides and try the various examples presented there as well as you see fit. Follow your curiosity and poke around! Read through the various help files and vignettes, and of course use me as a resource while I am physically present! Please do not hesitate to ask me anything whatsoever, and I will do my utmost to try to assist.

2 Preliminaries - reading datasets created by other software programs into R

One of the nice features of R lies in its ability to work with data stored in binary formats that have been generated by non-R (foreign) software. Furthermore, one can at the same time read data from a URL thereby avoiding having to physically download and store the file locally which in some cases is quite helpful.

  • The functionality for reading data in a variety of formats is contained in the foreign library.2 Load the foreign package and see the help file for read.dta (i.e. type ?read.dta after loading the foreign library). You load the foreign library as follows:

    library(foreign)

    Having accomplished this, try typing ?read.dta.

  • Let’s do two things at once, namely a) read a binary Stata file into R and b) do so from a URL:

    mydat <- read.dta(file="http://www.principlesofeconometrics.com/stata/mroz.dta")
  • You can see the names of the variables in the object mydat along with a summary of one of the variables as follows:

    names(mydat)
    summary(wage)
  • Oops - you have encountered one of the most common mistakes in R - scope is important, so in order to access data in the object mydat, you need to attach it first, so try the following:

    attach(mydat)
    summary(wage)
  • If you currently work with some data stored in the binary format of a non-R software platform (e.g. Stata, SPSS, SAS, Minitab), kindly try to read and attach the data in R right away. It is perhaps most beneficial for you do do this right away since you can then get up to speed with replicating something you are doing on another platform in the R environment.

So, at this stage you ought to be able to read and analyze data in R, an open, free, and extensible platform. And you too are now free, i.e. emancipated, because you no longer have to limit yourself to using a closed, proprietary system for which you are forced to pay for the privilege. You can also, if the need arises, write data in the same (closed) format (see ?write.dta). Note from ?read.dta and ?write.dta that you may only be able to read/write some versions of Stata binary files.3

3 Preliminaries - nonparametric estimation of density functions using base R

The questions below make use of a chi-squared random variable with \(\nu=5\) degrees of freedom, which we may express as \[\begin{align*} X\sim \chi^2_{5} \end{align*}\]

  • What is the mean \(E(X)\) and variance \(V(X)\) of a \(\chi^2\) random variable with \(\nu=5\) degrees of freedom?

  • Simulate a sample \(X\) of length \(10^6\), and compute the mean and variance of this sample using the following code chunk:

    set.seed(42)
    x <- rchisq(10^6,df=5)
    mean(x)
    var(x)

How do the means of the sample match up with your theoretical mean \(E(X)\) and variance \(V(X)\)? Why do they differ if they do?

  • Plot the parametric population density function \(f(x)\) by generating a sequence x.seq from 0 to 25 then generating the density function using R’s complement of parametric density functions (here we use dchisq()).

    x.seq <- seq(0,25,length=100)
    myden <- dchisq(x.seq,df=5)
    plot(x.seq,myden,type="l")
  • Using R’s density function, superimpose the nonparametric Rosenblatt-Parzen kernel density on the parametric density above via something like

    plot(x.seq,myden,xlab="x",ylab="f(x)",type="l")
    lines(density(x),col=2,lty=2)
    legend("topright",c("Parametric","Nonparametric"),col=1:2,lty=1:2,bty="n")
  • The R function density is an amazing function for Rosenblatt-Parzen kernel density estimation. For one, it is blazingly fast (it relies on a Fast Fourier Transform, and uses plug-in bandwidth selectors, both of which are incredibly fast). But it is also limited to scalar \(x\), does not support categorical kernels, and so forth.

    So, let’s replicate the above using the nonparametric unconditional density function npudens in the R package np. First you must install the np package via install.packages(np) or via the RStudio install pane, and then load the library via library(np). Note that if you don’t want screen I/O to be produced by functions in this library, you can disable it with options(np.messages=FALSE), which will prevent I/O such as Multistart 1 of 1 | from appearing while routines are executing.

    ## Note that you have to load the np package in order to execute e.g.,
    ## npudens(), npudensbw(), npreg() etc. (I may presume you will have loaded it
    ## already for some of the code chunks that follow).
    library(np)
    f <- npudens(tdat=x,edat=x.seq,bws=bw.nrd0(x))
    plot(x.seq,myden,xlab="x",ylab="f(x)",type="l")
    lines(x.seq,fitted(f),col=2,lty=2)
    legend("topright",c("Parametric","Nonparametric"),col=1:2,lty=1:2,bty="n")

    (the options tdat and edat are for training and evaluation data - typically we invoke the function via the formula interface which is discussed below - note that bw.nrd0 is the default plug in bandwidth selector used by the R function density).

4 The R package np and working with npudens

Many functions in R support what is known as the formula interface, and npudens is no exception. Furthermore, sometimes it is desirable to separate data-driven methods of bandwidth selection from estimation of the density itself. In the np package, this is accomplished by functions such as npudensbw (which handles bandwidth selection) and npudens (which handles density estimation). The authors of the np package have overloaded (i.e. embedded lots of functionality) and tried to correctly guess intended usage where appropriate.

  • Here we consider the formula interface and least-squares cross-validation for bandwidth selection, and then feed the bandwidth object bw to the function that computes the density, among other things (i.e. we first invoke npudensbw and then invoke npudens).

    set.seed(42)
    x <- rchisq(1000,df=10)
    bw <- npudensbw(~x,bwmethod="cv.ls")
    summary(bw)
    f <- npudens(bw)
    summary(f)
    plot(f)

    (the tilde ~ is part of the R formula interface). In general variables to the left of the tilde are left hand side variables, those to the right are right hand side ones For unconditional density estimation there are no left hand side variables hence the ~x formula. For (linear) regression though you might have lm(y~x) or lm(y~x1+x2), while for nonparametric regression you might have npreg(y~x1+x2). Take note, however, that while lm(y~x1+x2) imposes a linear additive structure, npreg(y~x1+x2) does not - the use of the formula interface here is simply and solely to list the predictor variables x1 and x2).

  • Above we invoked two functions, npudensbw and npudens. We could perform both of these actions in one step (if you don’t provide a bandwidth but do provide the method, npudensbw will be called in the background, and any arguments provided to npudens will be passed along to npudensbw; note the bandwidth object will be stored as f$bws).

    f <- npudens(~x,bwmethod="cv.ls")
    summary(f$bws)
    summary(f)
    plot(f)
  • We might add asymptotic confidence bounds to the plot (plot calls npplot - see ?npplot for details).

    plot(f,plot.errors.method="asymptotic",plot.errors.style="band")
  • We might also change the kernel function from the default (ckertype="gaussian") to, say, the epanechnikov (ckertype="epanechnikov").

    f <- npudens(~x,ckertype="epanechnikov",bwmethod="cv.ls")
    summary(f$bws)
    summary(f)
    plot(f)
  • We might wish to change the order of the kernel function from the default (order 2) to, say, order 4 (ckerorder=4).

    f <- npudens(~x,ckerorder=4,bwmethod="cv.ls")
    summary(f$bws)
    summary(f)
    plot(f)
  • We might even try a non-fixed bandwidth such as the adaptive approach.

    f <- npudens(~x,bwtype="adaptive_nn",bwmethod="cv.ls")
    summary(f$bws)
    summary(f)
    plot(f)

So, as you can see there are many options we might wish to modify if the occasion arises. See ?npudensbw and ?npudens for further details, and this is probably a good time to become familiar with the flow of R’s help system.4

5 The R package np and working with the npksum function

Sometimes you may need to compute kernel weighted sums of various objects. The function npksum in the R package np exists to compute kernel sums of various types. It makes calls to compiled C code hence can be fairly fast. Many functions in the np package make calls to this function (or directly to the C code underlying this function). Becoming familiar with this function might be of value if you wish to implement a novel technique that does not exist in any package. By making calls to npksum you can generate efficient estimator prototypes very quickly indeed.

  • You could use npksum to compute the Rosenblatt-Parzen density estimate above. Let’s do so for a simple illustration, recalling that the density estimate \(\hat f(x)\) is given by \[\begin{equation*} \hat f(x)=\frac{1}{nh}\sum_{i=1}^n K((x-X_i)/h). \end{equation*}\]

    set.seed(42)
    n <- 1000
    x <- sort(rchisq(n,df=10))
    h <- bw.nrd0(x)
    f.hat <- npksum(~x,bws=h)$ksum/(n*h)
    plot(x,f.hat,type="l")

    (we sort the data simply so that we can plot the resulting density with a line).

  • You could superimpose the estimate given by the R density function density for comparison purposes (they ought to be identical).

    set.seed(42)
    n <- 1000
    x <- sort(rchisq(n,df=10))
    h <- bw.nrd0(x)
    f.hat <- npksum(~x,bws=h)$ksum/(n*h)
    plot(x,f.hat,type="l")
    lines(density(x),col=2,lty=2)
  • You could compute and plot the kernel function if you wished using npksum (here we consider the Epanechnikov kernel of orders 2 [default], 4, 6, and 8).

    Z <- seq(-sqrt(5),sqrt(5),length=100)
    par(mfrow=c(2,2))
    plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov",
                                ckerorder=2)$ksum,type="l",main="Epanechnikov [order = 2]")
    plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov",
                                ckerorder=4)$ksum,type="l",main="Epanechnikov [order = 4]")
    plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov",
                                ckerorder=6)$ksum,type="l",main="Epanechnikov [order = 6]")
    plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov",
                                ckerorder=8)$ksum,type="l",main="Epanechnikov [order = 8]")

So, there is really no limit to what you might do with npksum. There are many options that could be fed to this function. Check out ?npksum for details, and scroll through the numerous examples for illustrations of its capabilities (it can save you tons of programming if you wish to prototype a new kernel-based approach).

6 The R package np and applied nonparametric density estimation

We consider a classic dataset (Pagan and Ullah 1999) consisting of a random sample (\(n=205\)) taken from the 1971 Canadian Census Public Use Tapes for male individuals having common education (grade 13). You can load the data and attach the variables logwage and age as follows:

library(np)
data(cps71)
attach(cps71)
## See what is in the dataset
names(cps71)
## For help try
?cps71
  • Estimate the density for logwage using the Gaussian parametric distribution, describe the Shapiro-Wilk test for normality (Shapiro and Wilk 1965), i.e. consult the reference at the end of this assignment, also see ?shapiro.test(x)), then apply the test to logwage and plot the resulting density estimate. Is the data consistent with this parametric specification?

    library(np)
    options(np.messages=FALSE)
    data(cps71)
    attach(cps71)
    plot(sort(logwage),dnorm(sort(logwage),mean=mean(logwage),sd=sd(logwage)),type="l")
    shapiro.test(logwage)
  • Compute and plot the kernel density estimator for logwage using an Epanechnikov kernel using the bandwidth 0.1 and again using the bandwidth -0. The following allows you to specify a bandwidth of 0.5, then feed the bandwidth object to the density estimation routine (we keep them separate for a number of reasons)

    bw <- npudensbw(~logwage,ckertype="epanechnikov",bws=0.5,bandwidth.compute=FALSE)
    fhat <- npudens(bws=bw)
    plot(fhat)
    bw <- npudensbw(~logwage,ckertype="epanechnikov",bws=0.1,bandwidth.compute=FALSE)
    fhat <- npudens(bws=bw)
    plot(fhat)
    bw <- npudensbw(~logwage,ckertype="epanechnikov",bws=1.0,bandwidth.compute=FALSE)
    fhat <- npudens(bws=bw)
    plot(fhat)
  • Repeat the above (i.e. compute and plot) using likelihood cross-validation and least-squares cross-validation. What are the differences between the cross-validated estimates and that based on the ad-hoc bandwidths (i.e. 0.1 and 1.0)?

    bw <- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ml")
    fhat <- npudens(bws=bw)
    plot(fhat)
    bw <- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ls")
    fhat <- npudens(bws=bw)
    plot(fhat)
  • Construct and plot the likelihood cross-validated estimate with asymptotic and then bootstrap error bars (see ?npplot for help, but you will use plot(foo,plot.errors.method=...) where foo is your model (i.e. fhat above)). How do these error bars differ?

    ## Asymptotic
    bw <- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ml")
    fhat <- npudens(bws=bw)
    plot(fhat,plot.errors.method="asymptotic",plot.errors.style="band")
    ## Bootstrap
    bw <- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ml")
    fhat <- npudens(bws=bw)
    plot(fhat,plot.errors.method="bootstrap",plot.errors.style="band")

7 The R package np and applied nonparametric regression

  • Estimate a linear and quadratic parametric regression model via

    model.linear <- lm(logwage~age)

    and

    model.quadratic <- lm(logwage~age+I(age^2))
    model.linear <- lm(logwage~age)
    summary(model.linear)
    model.quadratic <- lm(logwage~age+I(age^2))
    summary(model.quadratic)
  • Describe the RESET test (Ramsey 1969) for functional form (i.e. consult the reference at the end of this assignment), install the lmtest library in R, and then test each model for correct parametric specification using resettest() and clearly report the outcome from this procedure. Based on this result, would you be comfortable using either of these models for applied work?

    library(lmtest)
    resettest(model.linear)
    resettest(model.quadratic)
  • Construct a local linear regression estimator for the regression of logwage on age (see ?npreg and ?npregbw for examples). Plot the resulting estimate and the asymptotic standard errors.

    bw <- npregbw(logwage~age,regtype="ll")
    model.ll <- npreg(bws=bw)
    plot(model.ll,plot.errors.method="asymptotic",plot.errors.style="band")
  • Plot the resulting gradient estimate and its asymptotic standard errors.

    bw <- npregbw(logwage~age,regtype="ll")
    model.ll <- npreg(bws=bw)
    plot(model.ll,plot.errors.method="asymptotic",plot.errors.style="band",gradients=TRUE)
  • Compare your estimates with that from the quadratic parametric model in a plot (the generic R function fitted() extracts fitted values from a model).

    plot(age,logwage,main="Quadratic Earnings Profile",xlab="Age",ylab="log(Wage)")
    lines(age,fitted(model.quadratic),col=1)
    lines(age,fitted(model.ll),col=2)
    legend(min(age),max(logwage),c("Quadratic","Kernel"),col=1:2)
  • What is the in-sample fit (\(R^2\)) of the parametric and nonparametric models? On the basis of this criterion function, which model would you be most comfortable using? What are the drawbacks of using \(R^2\) as a guide to model selection?

    summary(model.quadratic)
    summary(model.ll)
  • Often we wish to compute predictions or construct counter-factuals during the course of applied nonparametric regression. In R there is a generic function predict() that allows one to do this. You create an evaluation dataset then use the option predict(...,newdata=...) to generate the predictions. Let’s load Woodridge’s wage1 dataset then consider local constant estimation with two predictors, one continuous and one categorical. Note that one must exercise caution when creating the evaluation data, particularly when categorical predictors are involved.

    See also the generic functions residuals() and fitted().

    data(wage1)
    attach(wage1)
    ## Construct the parametric model
    model.lm <- lm(lwage~exper+female)
    ## Construct the nonparametric model
    model.lc <- npreg(lwage~exper+female)
    ## Create an evaluation dataset (counter-factual) for a female
    ## with 5 years of experience
    evaldata.female <- data.frame(exper=5,
                                  female=factor("Female",levels=levels(female)))
    
    ## Use the generic R function predict(...,newdata=...)
    predict(model.lm,newdata=evaldata.female)
    predict(model.lc,newdata=evaldata.female)
    ## Create an evaluation dataset (counter-factual) for a male
    ## with 5 years of experience
    evaldata.male <- data.frame(exper=5,
                                female=factor("Male",levels=levels(female)))
    ## Use the generic R function predict(...,newdata=...)
    predict(model.lm,newdata=evaldata.male)
    predict(model.lc,newdata=evaldata.male)
    
    ## We could compute Oaxaca-Blinder estimates of wage differentials
    ## Parametric wage differential (for Oaxaca-Blinder you would use
    ## the e.g. mean values of the female predictors which you could
    ## implement on your own)
    predict(model.lm,newdata=evaldata.female)-predict(model.lm,newdata=evaldata.male)
    ## Nonparametric wage differential
    predict(model.lc,newdata=evaldata.female)-predict(model.lc,newdata=evaldata.male)

8 The R package np and advanced use of the npksum function

Sometimes you may need to compute kernel weighted sums of various objects. The function npksum in the R package np exists to compute kernel sums of various types. It makes calls to compiled C code hence can be fairly fast. Many functions in the np package make calls to this function (or directly to the C code underlying this function). Becoming familiar with this function might be of value if you wish to implement a novel technique that does not exist in any package. By making calls to npksum you can generate efficient prototypes very quickly indeed.

  • You could use npksum to compute the Nadaraya-Watson regression estimator used above, \[\begin{equation*} \hat g(x) =\frac{\sum_{i=1}^n Y_i K_\gamma(X_i,x)} {\sum_{i=1}^n K_\gamma(X_i,x)}. \end{equation*}\] Compute the local constant estimator using the npksum function for the cps71 data.

    data(cps71)
    attach(cps71)
    ## Compute the bandwidths using npregbw()
    bw <- npregbw(xdat=age, ydat=logwage)
    
    ## Compute the fit using npksum
    fit.lc <- npksum(txdat=age, tydat=logwage, bws=bw$bw)$ksum/
              npksum(txdat=age, bws=bw$bw)$ksum
    
    ## Plot the results
    plot(age, logwage, xlab="Age", ylab="log(wage)")
    lines(age, fit.lc)
    
    ## Compare results with npreg() for the first 10 observations
    cbind(fit.lc,fitted(npreg(bws=bw)))[1:10,]
  • Often we also need to implement some form of data-driven bandwidth selection, such as for the local constant estimator defined using the npksum function outlined above. If we wanted to implement least-squares cross-validation for the Nadaraya-Watson estimator, this can be accomplished via npksum using the option leave.one.out=TRUE. Below we consider a simulated example with \(q=3\) continuous predictors. Minimization will rely on the R function nlm().

    ## We conduct least-squares cross-validation for the local-constant
    ## regression estimator. We first write an R function `ss' that
    ## computes the leave-one-out sum of squares using the npksum()
    ## function, and then feed this function, along with random starting
    ## values for the bandwidth vector, to the nlm() routine in R (nlm =
    ## Non-Linear Minimization). Finally, we compare results with the
    ## function npregbw() that is written solely in C and calls a tightly
    ## coupled C-level search routine.  Note that one could make repeated
    ## calls to nlm() using different starting values for h (highly
    ## recommended in general).
    
    ## Increase the number of digits printed out by default in R and avoid
    ## using scientific notation for this example (we wish to compare
    ## objective function minima)
    options(scipen=100, digits=12)
    
    ## Generate 100 observations from a simple DGP where one explanatory
    ## variable is irrelevant.
    n <- 100
    set.seed(42)
    x1 <- runif(n)
    x2 <- rnorm(n)
    x3 <- runif(n)
    ## Create a data-frame for the predictors
    txdat <- data.frame(x1, x2, x3)
    ## Note - x3 is irrelevant
    tydat <- x1 + sin(x2) + rnorm(n)
    
    ## Write an R function that returns the average leave-one-out sum of
    ## squared residuals for the local constant estimator based upon
    ## npksum(). This function accepts one argument and presumes that
    ## txdat and tydat have been defined already.
    ss <- function(h) {
    ## Test for valid (non-negative) bandwidths - return infinite penalty
    ## when this occurs
    
      if(min(h)<=0) {
    
        return(.Machine$double.xmax)
    
      } else {
    
          mean.loo <-  npksum(txdat,
                              tydat,
                              leave.one.out=TRUE,
                              bws=h)$ksum/
                       npksum(txdat,
                              leave.one.out=TRUE,
                              bws=h)$ksum
    
        return(mean((tydat-mean.loo)^2))
    
      }
    
    }
    
    ## Now pass this function to R's nlm() routine along with random starting
    ## values and place results in `nlm.return'.
    nlm.return <- nlm(ss, runif(NCOL(txdat)))
    
    ## Now compute the bandwidths using the np function
    ## npregbw.
    bw <- npregbw(xdat=txdat, ydat=tydat)
    
    ## Bandwidths from nlm()
    nlm.return$estimate
    
    ## Bandwidths from npregbw()
    bw$bw
    
    ## Least-squares objective function value (minimum) from nlm()
    nlm.return$minimum
    
    ## Least-squares objective function value (minimum) from npregbw()
    bw$fval

References

Pagan, A., and A. Ullah. 1999. Nonparametric Econometrics. Cambridge University Press.
Ramsey, J. B. 1969. “Tests for Specification Error in Classical Linear Least Squares Regression Analysis.” Journal of the Royal Statistical Society, Series B 31: 350–71.
Shapiro, S. S., and M. B. Wilk. 1965. “An Analysis of Variance Test for Normality (Complete Samples).” Biometrika 52 (3/4): 591–611.

Footnotes

  1. Function documentation is great if you know the name of the function you need, but it’s useless otherwise. An R vignette is a long-form guide to an R package written to provide a more complete picture of the package and its use.↩︎

  2. The foreign library is part of base R so does not require separate installation, unlike other packages we might wish to use.↩︎

  3. For some bizarre reason, certain closed proprietary systems seem to continuously change the structure of their binary read/write calls - perhaps this is to encourage (read force) users to update? But you ought to be able to save your Stata file in an earlier format within Stata and get around such planned obsolescence.↩︎

  4. The flow of these help pages is Description/Usage/Arguments/Details/Value/Usage Issues/Examples. So, you can always find Examples at the end of a help page, can find out what values a function returns in Value, etc.↩︎