HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multiple Regression (HDMR), as described in:
Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. Working paper.
It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).
This tutorial explains how to use this package from R via the JuliaCall package that is available on CRAN.
First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com
Once installed, open start julia in a terminal and add the following packages:
Pkg.add("RCall")
Pkg.clone("https://github.com/AsafManela/Lasso.jl")
Pkg.clone("https://github.com/AsafManela/HurdleDMR.jl")
Now, back to R
install.packages("JuliaCall")
Load the JuliaCall library and setup julia
library(JuliaCall)
j <- julia_setup()
We can now use j$xx
to call julia as in
j$eval("1+2")
We will use for this example data that ships with the fantastic textir
package for R.
#install.packages("textir")
library(textir)
data(we8there)
covars <- we8thereRatings[,'Overall',drop=FALSE]
counts <- we8thereCounts
To compare the two, we first fit a dmr in R using textir
(a wrapper for distrom
).
Make a cluster of nprocs
processors
nprocs <- as.integer(detectCores()-2)
cl <- makeCluster(nprocs,type=ifelse(.Platform$OS.type=="unix","FORK","PSOCK"))
system.time(fits <- dmr(cl, covars, counts, verb=1))
Good. Now stop the cluster to clean up.
stopCluster(cl)
Get AICc optimal coefficients
BR <- coef(fits)
dim(BR)
Get SR projection onto factors
zR <- srproj(BR,counts)
dim(zR)
The fitted object can be used to it a forward regression to predict a covariate using the low dimensional SRproj of the counts
X <- cbind(covars,zR)
colnames(X) <- c("Overall","zOverall","m")
fmR <- lm(Overall ~ zOverall + m, X)
summary(fmR)
We can now predict Overall with new counts data
newX = as.data.frame(srproj(BR,counts[1:10,]))
colnames(newX) <-c("zOverall","m")
yhatdmrR <- predict(fmR, newX)
as.vector(yhatdmrR)
j$command("import SparseArrays")
j$assign("covars",covars)
## there are probably more efficient ways to pass the sparse matrix, but JuliaCall doesn't do this automatically at the moment
j$assign("counts",as.matrix(counts))
j$command("counts = SparseArrays.sparse(counts)")
add parallel workers
j$command("using Distributed")
j$call("addprocs", nprocs)
make our package available to all workers
j$command("import HurdleDMR; @everywhere using HurdleDMR")
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
The above returns a lightweight wrapper with basically just the coefficients. To get the entire regularization paths, try the following
system.time(j$command("m = fit(DMRPaths,@model(c ~ 1 + Overall),covars,counts);"))
Julia compiles each function on its first call, which may be slower for one-off applications, but faster when the function is called many times. So to get a sense of runtime without that fixed cost, you may wish to run it again.
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
On our machine, julia fits dmr in about half the time as R (see 'elapsed' entries above). The speed improvment is mostly due to sharing of memory across parallel workers.
Get AICc optimal coefficients
Bjulia <- j$eval("coef(m)")
dim(Bjulia)
Get SR projection onto factors
zjulia <- j$eval("srproj(m,counts)")
dim(zjulia)
Comparing zR to zjulia we see that the estimates are about the same.
all.equal(zR, zjulia, check.attributes = FALSE)
The differences are mostly due to default regularization paths differences.
The HurdleDMR package provides a general method to fit Counts Inverse Regressions (CIR), fit(CIR...)
that can fit both backward and forward parts of the MNIR. For example:
j$command("using GLM")
system.time(j$command("mnir = fit(CIR{DMR,LinearModel},@model(c ~ 1 + Overall),covars,counts,:Overall);"))
j$eval("mnir.model.fwdm")
The fitted model can be used for prediction as follows:
yhatdmrJ <- j$eval("predict(mnir,covars[1:10,:],counts[1:10,:])")
yhatdmrJ
Comparing the R and julia versions of the predicted values, they appear to be quite similar:
all.equal(yhatdmrR, yhatdmrJ, check.attributes = FALSE)
Another advantage of the julia package is allowing for text selection via HDMR. Here we specify the two parts of the model via two formulas:
system.time(j$command("m = fit(HDMR,@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts);"))
Fitted HDMR involves two coefficient arrays, one for the model for positives c ~ ...
and one for the model for hurdle crossing or zeros h ~ ...
Cjulia <- j$eval("coef(m)")
The projection onto factors now gives [zpos, zzero, m] instead of [z, m] as before
Zjulia <- j$eval("srproj(m,counts)")
If we wish to run a CIR with HDMR instead of DMR, all we need to do is specify it in a very similar call to fit(CIR...)
:
system.time(j$command("cir = fit(CIR{HDMR,LinearModel},@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts,:Overall);"))
j$eval("cir.model.fwdm")
The HDMR-based CIR model can be used to predict with new data
yhathdmr <- j$eval("predict(cir,covars[1:10,:],counts[1:10,:])")
yhathdmr
Comparing the three predicted values shows only minor differences in this toy example.
cbind(yhatdmrR,yhatdmrJ,yhathdmr)
Kelly, Manela, and Moreira (2018) show, however, that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.
Please reference the paper for additional details and example applications.