## The vignette below presents an overview of the np package
vignette("np",package="np")
## The vignette below provides answers to frequently asked questions
vignette("np_faq",package="np")
## The vignette below present an overview of entropy-based inference
## in the np package
vignette("entropy_np",package="np")
Labs
1 How to proceed
The goal of these labs is to get you up to speed with certain aspects of working with R
, presuming of course that you are not already familiar with R
, RStudio
, and applied nonparametric estimation in general. It is also my sincere hope that these labs will enable you to work with your own data in R
as early as even the first lab session, if you choose.
Note that each chunk of code below can be cut-and-paste into RStudio
(or elsewhere) by moving your cursor to the upper right portion of the code chunk then clicking on the clipboard icon (this copies the code to your clipboard and you can then copy it into RStudio using whichever method you are accustomed to).
These labs complement material presented in the lectures rather than simply repeating what you have already seen. They present an under-the-hood view of capabilities that may not be obvious unless you have spent hours devouring vignettes1 and assorted help files. Note that once you have installed the np
package then you can view various vignettes as follows:
I expect you to work through these labs at your own pace over the course of this five-day workshop. Also, pull up the lecture slides and try the various examples presented there as well as you see fit. Follow your curiosity and poke around! Read through the various help files and vignettes, and of course use me as a resource while I am physically present! Please do not hesitate to ask me anything whatsoever, and I will do my utmost to try to assist.
2 Preliminaries - reading datasets created by other software programs into R
One of the nice features of R
lies in its ability to work with data stored in binary formats that have been generated by non-R
(foreign
) software. Furthermore, one can at the same time read data from a URL thereby avoiding having to physically download and store the file locally which in some cases is quite helpful.
The functionality for reading data in a variety of formats is contained in the
foreign
library.2 Load theforeign
package and see the help file forread.dta
(i.e. type?read.dta
after loading theforeign
library). You load the foreign library as follows:library(foreign)
Having accomplished this, try typing
?read.dta
.Let’s do two things at once, namely a) read a binary Stata file into
R
and b) do so from a URL:<- read.dta(file="http://www.principlesofeconometrics.com/stata/mroz.dta") mydat
You can see the names of the variables in the object
mydat
along with a summary of one of the variables as follows:names(mydat) summary(wage)
Oops - you have encountered one of the most common mistakes in
R
- scope is important, so in order to access data in the objectmydat
, you need to attach it first, so try the following:attach(mydat) summary(wage)
If you currently work with some data stored in the binary format of a non-
R
software platform (e.g.Stata
,SPSS
,SAS
,Minitab
), kindly try to read and attach the data inR
right away. It is perhaps most beneficial for you do do this right away since you can then get up to speed with replicating something you are doing on another platform in theR
environment.
So, at this stage you ought to be able to read and analyze data in R
, an open, free, and extensible platform. And you too are now free, i.e. emancipated, because you no longer have to limit yourself to using a closed, proprietary system for which you are forced to pay for the privilege. You can also, if the need arises, write data in the same (closed) format (see ?write.dta
). Note from ?read.dta
and ?write.dta
that you may only be able to read/write some versions of Stata binary files.3
3 Preliminaries - nonparametric estimation of density functions using base R
The questions below make use of a chi-squared random variable with \(\nu=5\) degrees of freedom, which we may express as \[\begin{align*} X\sim \chi^2_{5} \end{align*}\]
What is the mean \(E(X)\) and variance \(V(X)\) of a \(\chi^2\) random variable with \(\nu=5\) degrees of freedom?
Simulate a sample \(X\) of length \(10^6\), and compute the mean and variance of this sample using the following code chunk:
set.seed(42) <- rchisq(10^6,df=5) x mean(x) var(x)
How do the means of the sample match up with your theoretical mean \(E(X)\) and variance \(V(X)\)? Why do they differ if they do?
Plot the parametric population density function \(f(x)\) by generating a sequence
x.seq
from 0 to 25 then generating the density function usingR
’s complement of parametric density functions (here we usedchisq()
).<- seq(0,25,length=100) x.seq <- dchisq(x.seq,df=5) myden plot(x.seq,myden,type="l")
Using
R
’sdensity
function, superimpose the nonparametric Rosenblatt-Parzen kernel density on the parametric density above via something likeplot(x.seq,myden,xlab="x",ylab="f(x)",type="l") lines(density(x),col=2,lty=2) legend("topright",c("Parametric","Nonparametric"),col=1:2,lty=1:2,bty="n")
The
R
functiondensity
is an amazing function for Rosenblatt-Parzen kernel density estimation. For one, it is blazingly fast (it relies on a Fast Fourier Transform, and uses plug-in bandwidth selectors, both of which are incredibly fast). But it is also limited to scalar \(x\), does not support categorical kernels, and so forth.So, let’s replicate the above using the nonparametric unconditional density function
npudens
in theR
packagenp
. First you must install thenp
package viainstall.packages(np)
or via theRStudio
install pane, and then load the library vialibrary(np)
. Note that if you don’t want screen I/O to be produced by functions in this library, you can disable it withoptions(np.messages=FALSE)
, which will prevent I/O such asMultistart 1 of 1 |
from appearing while routines are executing.## Note that you have to load the np package in order to execute e.g., ## npudens(), npudensbw(), npreg() etc. (I may presume you will have loaded it ## already for some of the code chunks that follow). library(np) <- npudens(tdat=x,edat=x.seq,bws=bw.nrd0(x)) f plot(x.seq,myden,xlab="x",ylab="f(x)",type="l") lines(x.seq,fitted(f),col=2,lty=2) legend("topright",c("Parametric","Nonparametric"),col=1:2,lty=1:2,bty="n")
(the options
tdat
andedat
are for training and evaluation data - typically we invoke the function via the formula interface which is discussed below - note thatbw.nrd0
is the default plug in bandwidth selector used by theR
functiondensity
).
4 The R
package np
and working with npudens
Many functions in R
support what is known as the formula interface, and npudens
is no exception. Furthermore, sometimes it is desirable to separate data-driven methods of bandwidth selection from estimation of the density itself. In the np
package, this is accomplished by functions such as npudensbw
(which handles bandwidth selection) and npudens
(which handles density estimation). The authors of the np
package have overloaded (i.e. embedded lots of functionality) and tried to correctly guess intended usage where appropriate.
Here we consider the formula interface and least-squares cross-validation for bandwidth selection, and then feed the bandwidth object
bw
to the function that computes the density, among other things (i.e. we first invokenpudensbw
and then invokenpudens
).set.seed(42) <- rchisq(1000,df=10) x <- npudensbw(~x,bwmethod="cv.ls") bw summary(bw) <- npudens(bw) f summary(f) plot(f)
(the tilde
~
is part of theR
formula interface). In general variables to the left of the tilde are left hand side variables, those to the right are right hand side ones For unconditional density estimation there are no left hand side variables hence the~x
formula. For (linear) regression though you might havelm(y~x)
orlm(y~x1+x2)
, while for nonparametric regression you might havenpreg(y~x1+x2)
. Take note, however, that whilelm(y~x1+x2)
imposes a linear additive structure,npreg(y~x1+x2)
does not - the use of the formula interface here is simply and solely to list the predictor variablesx1
andx2
).Above we invoked two functions,
npudensbw
andnpudens
. We could perform both of these actions in one step (if you don’t provide a bandwidth but do provide the method,npudensbw
will be called in the background, and any arguments provided tonpudens
will be passed along tonpudensbw
; note the bandwidth object will be stored asf$bws
).<- npudens(~x,bwmethod="cv.ls") f summary(f$bws) summary(f) plot(f)
We might add asymptotic confidence bounds to the plot (
plot
callsnpplot
- see?npplot
for details).plot(f,plot.errors.method="asymptotic",plot.errors.style="band")
We might also change the kernel function from the default (
ckertype="gaussian"
) to, say, the epanechnikov (ckertype="epanechnikov"
).<- npudens(~x,ckertype="epanechnikov",bwmethod="cv.ls") f summary(f$bws) summary(f) plot(f)
We might wish to change the order of the kernel function from the default (order 2) to, say, order 4 (
ckerorder=4
).<- npudens(~x,ckerorder=4,bwmethod="cv.ls") f summary(f$bws) summary(f) plot(f)
We might even try a non-fixed bandwidth such as the adaptive approach.
<- npudens(~x,bwtype="adaptive_nn",bwmethod="cv.ls") f summary(f$bws) summary(f) plot(f)
So, as you can see there are many options we might wish to modify if the occasion arises. See ?npudensbw
and ?npudens
for further details, and this is probably a good time to become familiar with the flow of R
’s help system.4
5 The R
package np
and working with the npksum
function
Sometimes you may need to compute kernel weighted sums of various objects. The function npksum
in the R
package np
exists to compute kernel sums of various types. It makes calls to compiled C code hence can be fairly fast. Many functions in the np
package make calls to this function (or directly to the C code underlying this function). Becoming familiar with this function might be of value if you wish to implement a novel technique that does not exist in any package. By making calls to npksum
you can generate efficient estimator prototypes very quickly indeed.
You could use
npksum
to compute the Rosenblatt-Parzen density estimate above. Let’s do so for a simple illustration, recalling that the density estimate \(\hat f(x)\) is given by \[\begin{equation*} \hat f(x)=\frac{1}{nh}\sum_{i=1}^n K((x-X_i)/h). \end{equation*}\]set.seed(42) <- 1000 n <- sort(rchisq(n,df=10)) x <- bw.nrd0(x) h <- npksum(~x,bws=h)$ksum/(n*h) f.hat plot(x,f.hat,type="l")
(we sort the data simply so that we can plot the resulting density with a line).
You could superimpose the estimate given by the
R
density functiondensity
for comparison purposes (they ought to be identical).set.seed(42) <- 1000 n <- sort(rchisq(n,df=10)) x <- bw.nrd0(x) h <- npksum(~x,bws=h)$ksum/(n*h) f.hat plot(x,f.hat,type="l") lines(density(x),col=2,lty=2)
You could compute and plot the kernel function if you wished using
npksum
(here we consider the Epanechnikov kernel of orders 2 [default], 4, 6, and 8).<- seq(-sqrt(5),sqrt(5),length=100) Z par(mfrow=c(2,2)) plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=2)$ksum,type="l",main="Epanechnikov [order = 2]") plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=4)$ksum,type="l",main="Epanechnikov [order = 4]") plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=6)$ksum,type="l",main="Epanechnikov [order = 6]") plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=8)$ksum,type="l",main="Epanechnikov [order = 8]")
So, there is really no limit to what you might do with npksum
. There are many options that could be fed to this function. Check out ?npksum
for details, and scroll through the numerous examples for illustrations of its capabilities (it can save you tons of programming if you wish to prototype a new kernel-based approach).
6 The R
package np
and applied nonparametric density estimation
We consider a classic dataset (Pagan and Ullah 1999) consisting of a random sample (\(n=205\)) taken from the 1971 Canadian Census Public Use Tapes for male individuals having common education (grade 13). You can load the data and attach the variables logwage
and age
as follows:
library(np)
data(cps71)
attach(cps71)
## See what is in the dataset
names(cps71)
## For help try
?cps71
Estimate the density for
logwage
using the Gaussian parametric distribution, describe the Shapiro-Wilk test for normality (Shapiro and Wilk 1965), i.e. consult the reference at the end of this assignment, also see?shapiro.test(x)
), then apply the test tologwage
and plot the resulting density estimate. Is the data consistent with this parametric specification?library(np) options(np.messages=FALSE) data(cps71) attach(cps71) plot(sort(logwage),dnorm(sort(logwage),mean=mean(logwage),sd=sd(logwage)),type="l") shapiro.test(logwage)
Compute and plot the kernel density estimator for
logwage
using an Epanechnikov kernel using the bandwidth 0.1 and again using the bandwidth -0. The following allows you to specify a bandwidth of 0.5, then feed the bandwidth object to the density estimation routine (we keep them separate for a number of reasons)<- npudensbw(~logwage,ckertype="epanechnikov",bws=0.5,bandwidth.compute=FALSE) bw <- npudens(bws=bw) fhat plot(fhat)
<- npudensbw(~logwage,ckertype="epanechnikov",bws=0.1,bandwidth.compute=FALSE) bw <- npudens(bws=bw) fhat plot(fhat)
<- npudensbw(~logwage,ckertype="epanechnikov",bws=1.0,bandwidth.compute=FALSE) bw <- npudens(bws=bw) fhat plot(fhat)
Repeat the above (i.e. compute and plot) using likelihood cross-validation and least-squares cross-validation. What are the differences between the cross-validated estimates and that based on the ad-hoc bandwidths (i.e. 0.1 and 1.0)?
<- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ml") bw <- npudens(bws=bw) fhat plot(fhat)
<- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ls") bw <- npudens(bws=bw) fhat plot(fhat)
Construct and plot the likelihood cross-validated estimate with asymptotic and then bootstrap error bars (see
?npplot
for help, but you will useplot(foo,plot.errors.method=...)
wherefoo
is your model (i.e. fhat above)). How do these error bars differ?## Asymptotic <- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ml") bw <- npudens(bws=bw) fhat plot(fhat,plot.errors.method="asymptotic",plot.errors.style="band")
## Bootstrap <- npudensbw(~logwage,ckertype="epanechnikov",bwmethod="cv.ml") bw <- npudens(bws=bw) fhat plot(fhat,plot.errors.method="bootstrap",plot.errors.style="band")
7 The R
package np
and applied nonparametric regression
Estimate a linear and quadratic parametric regression model via
<- lm(logwage~age) model.linear
and
<- lm(logwage~age+I(age^2)) model.quadratic
<- lm(logwage~age) model.linear summary(model.linear) <- lm(logwage~age+I(age^2)) model.quadratic summary(model.quadratic)
Describe the RESET test (Ramsey 1969) for functional form (i.e. consult the reference at the end of this assignment), install the
lmtest
library inR
, and then test each model for correct parametric specification usingresettest()
and clearly report the outcome from this procedure. Based on this result, would you be comfortable using either of these models for applied work?library(lmtest) resettest(model.linear) resettest(model.quadratic)
Construct a local linear regression estimator for the regression of
logwage
onage
(see?npreg
and?npregbw
for examples). Plot the resulting estimate and the asymptotic standard errors.<- npregbw(logwage~age,regtype="ll") bw <- npreg(bws=bw) model.ll plot(model.ll,plot.errors.method="asymptotic",plot.errors.style="band")
Plot the resulting gradient estimate and its asymptotic standard errors.
<- npregbw(logwage~age,regtype="ll") bw <- npreg(bws=bw) model.ll plot(model.ll,plot.errors.method="asymptotic",plot.errors.style="band",gradients=TRUE)
Compare your estimates with that from the quadratic parametric model in a plot (the generic
R
functionfitted()
extracts fitted values from a model).plot(age,logwage,main="Quadratic Earnings Profile",xlab="Age",ylab="log(Wage)") lines(age,fitted(model.quadratic),col=1) lines(age,fitted(model.ll),col=2) legend(min(age),max(logwage),c("Quadratic","Kernel"),col=1:2)
What is the in-sample fit (\(R^2\)) of the parametric and nonparametric models? On the basis of this criterion function, which model would you be most comfortable using? What are the drawbacks of using \(R^2\) as a guide to model selection?
summary(model.quadratic) summary(model.ll)
Often we wish to compute predictions or construct counter-factuals during the course of applied nonparametric regression. In
R
there is a generic functionpredict()
that allows one to do this. You create an evaluation dataset then use the optionpredict(...,newdata=...)
to generate the predictions. Let’s load Woodridge’swage1
dataset then consider local constant estimation with two predictors, one continuous and one categorical. Note that one must exercise caution when creating the evaluation data, particularly when categorical predictors are involved.See also the generic functions
residuals()
andfitted()
.data(wage1) attach(wage1) ## Construct the parametric model <- lm(lwage~exper+female) model.lm ## Construct the nonparametric model <- npreg(lwage~exper+female) model.lc ## Create an evaluation dataset (counter-factual) for a female ## with 5 years of experience <- data.frame(exper=5, evaldata.female female=factor("Female",levels=levels(female))) ## Use the generic R function predict(...,newdata=...) predict(model.lm,newdata=evaldata.female) predict(model.lc,newdata=evaldata.female) ## Create an evaluation dataset (counter-factual) for a male ## with 5 years of experience <- data.frame(exper=5, evaldata.male female=factor("Male",levels=levels(female))) ## Use the generic R function predict(...,newdata=...) predict(model.lm,newdata=evaldata.male) predict(model.lc,newdata=evaldata.male) ## We could compute Oaxaca-Blinder estimates of wage differentials ## Parametric wage differential (for Oaxaca-Blinder you would use ## the e.g. mean values of the female predictors which you could ## implement on your own) predict(model.lm,newdata=evaldata.female)-predict(model.lm,newdata=evaldata.male) ## Nonparametric wage differential predict(model.lc,newdata=evaldata.female)-predict(model.lc,newdata=evaldata.male)
8 The R
package np
and advanced use of the npksum
function
Sometimes you may need to compute kernel weighted sums of various objects. The function npksum
in the R
package np
exists to compute kernel sums of various types. It makes calls to compiled C code hence can be fairly fast. Many functions in the np
package make calls to this function (or directly to the C code underlying this function). Becoming familiar with this function might be of value if you wish to implement a novel technique that does not exist in any package. By making calls to npksum
you can generate efficient prototypes very quickly indeed.
You could use
npksum
to compute the Nadaraya-Watson regression estimator used above, \[\begin{equation*} \hat g(x) =\frac{\sum_{i=1}^n Y_i K_\gamma(X_i,x)} {\sum_{i=1}^n K_\gamma(X_i,x)}. \end{equation*}\] Compute the local constant estimator using thenpksum
function for thecps71
data.data(cps71) attach(cps71) ## Compute the bandwidths using npregbw() <- npregbw(xdat=age, ydat=logwage) bw ## Compute the fit using npksum <- npksum(txdat=age, tydat=logwage, bws=bw$bw)$ksum/ fit.lc npksum(txdat=age, bws=bw$bw)$ksum ## Plot the results plot(age, logwage, xlab="Age", ylab="log(wage)") lines(age, fit.lc) ## Compare results with npreg() for the first 10 observations cbind(fit.lc,fitted(npreg(bws=bw)))[1:10,]
Often we also need to implement some form of data-driven bandwidth selection, such as for the local constant estimator defined using the
npksum
function outlined above. If we wanted to implement least-squares cross-validation for the Nadaraya-Watson estimator, this can be accomplished vianpksum
using the optionleave.one.out=TRUE
. Below we consider a simulated example with \(q=3\) continuous predictors. Minimization will rely on the R functionnlm()
.## We conduct least-squares cross-validation for the local-constant ## regression estimator. We first write an R function `ss' that ## computes the leave-one-out sum of squares using the npksum() ## function, and then feed this function, along with random starting ## values for the bandwidth vector, to the nlm() routine in R (nlm = ## Non-Linear Minimization). Finally, we compare results with the ## function npregbw() that is written solely in C and calls a tightly ## coupled C-level search routine. Note that one could make repeated ## calls to nlm() using different starting values for h (highly ## recommended in general). ## Increase the number of digits printed out by default in R and avoid ## using scientific notation for this example (we wish to compare ## objective function minima) options(scipen=100, digits=12) ## Generate 100 observations from a simple DGP where one explanatory ## variable is irrelevant. <- 100 n set.seed(42) <- runif(n) x1 <- rnorm(n) x2 <- runif(n) x3 ## Create a data-frame for the predictors <- data.frame(x1, x2, x3) txdat ## Note - x3 is irrelevant <- x1 + sin(x2) + rnorm(n) tydat ## Write an R function that returns the average leave-one-out sum of ## squared residuals for the local constant estimator based upon ## npksum(). This function accepts one argument and presumes that ## txdat and tydat have been defined already. <- function(h) { ss ## Test for valid (non-negative) bandwidths - return infinite penalty ## when this occurs if(min(h)<=0) { return(.Machine$double.xmax) else { } <- npksum(txdat, mean.loo tydat,leave.one.out=TRUE, bws=h)$ksum/ npksum(txdat, leave.one.out=TRUE, bws=h)$ksum return(mean((tydat-mean.loo)^2)) } } ## Now pass this function to R's nlm() routine along with random starting ## values and place results in `nlm.return'. <- nlm(ss, runif(NCOL(txdat))) nlm.return ## Now compute the bandwidths using the np function ## npregbw. <- npregbw(xdat=txdat, ydat=tydat) bw ## Bandwidths from nlm() $estimate nlm.return ## Bandwidths from npregbw() $bw bw ## Least-squares objective function value (minimum) from nlm() $minimum nlm.return ## Least-squares objective function value (minimum) from npregbw() $fval bw
References
Footnotes
Function documentation is great if you know the name of the function you need, but it’s useless otherwise. An
R
vignette is a long-form guide to anR
package written to provide a more complete picture of the package and its use.↩︎The
foreign
library is part of baseR
so does not require separate installation, unlike other packages we might wish to use.↩︎For some bizarre reason, certain closed proprietary systems seem to continuously change the structure of their binary read/write calls - perhaps this is to encourage (read force) users to update? But you ought to be able to save your Stata file in an earlier format within Stata and get around such planned obsolescence.↩︎
The flow of these help pages is Description/Usage/Arguments/Details/Value/Usage Issues/Examples. So, you can always find Examples at the end of a help page, can find out what values a function returns in Value, etc.↩︎