HurdleDMR from R

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multiple Regression (HDMR), as described in:

Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. Working paper.

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).

This tutorial explains how to use this package from R via the JuliaCall package that is available on CRAN.

Setup

Install Julia

First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com

Once installed, open start julia in a terminal and add the following packages:

Pkg.add("RCall")
Pkg.clone("https://github.com/AsafManela/Lasso.jl")
Pkg.clone("https://github.com/AsafManela/HurdleDMR.jl")

The JuliaCall package for R

Now, back to R

In [1]:
install.packages("JuliaCall")

Load the JuliaCall library and setup julia

In [2]:
library(JuliaCall)
j <- julia_setup()
Julia version 1.0.3 at location C:\Users\user\AppData\Local\JULIA-~1.3\bin will be used.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.

We can now use j$xx to call julia as in

In [3]:
j$eval("1+2")
3

Example data

We will use for this example data that ships with the fantastic textir package for R.

In [4]:
#install.packages("textir") 
library(textir)

data(we8there)

covars <- we8thereRatings[,'Overall',drop=FALSE]
counts <- we8thereCounts
Loading required package: distrom
Loading required package: Matrix
Loading required package: gamlr
Loading required package: parallel

Benchmark in R

To compare the two, we first fit a dmr in R using textir (a wrapper for distrom).

Make a cluster of nprocs processors

In [5]:
nprocs <- as.integer(detectCores()-2)
cl <- makeCluster(nprocs,type=ifelse(.Platform$OS.type=="unix","FORK","PSOCK"))

Fit Distributed mutlinomial regression (DMR) in parallel

In [6]:
system.time(fits <- dmr(cl, covars, counts, verb=1))
fitting 6166 observations on 2640 categories, 1 covariates.
converting counts matrix to column list...
distributed run.
socket cluster with 10 nodes on host 'localhost'
   user  system elapsed 
   0.33    0.08   26.83 

Good. Now stop the cluster to clean up.

In [7]:
stopCluster(cl)

Get AICc optimal coefficients

In [8]:
BR <- coef(fits)
dim(BR)
  1. 2
  2. 2640

Get SR projection onto factors

In [9]:
zR <- srproj(BR,counts) 
dim(zR)
  1. 6166
  2. 2

Multinomial inverse regression (MNIR)

The fitted object can be used to it a forward regression to predict a covariate using the low dimensional SRproj of the counts

In [10]:
X <- cbind(covars,zR)
colnames(X) <- c("Overall","zOverall","m")
fmR <- lm(Overall ~ zOverall + m, X)
summary(fmR)
Call:
lm(formula = Overall ~ zOverall + m, data = X)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5142 -0.5608  0.1370  0.6838  4.0842 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.402149   0.019292 176.348  < 2e-16 ***
zOverall    3.181332   0.041696  76.298  < 2e-16 ***
m           0.006737   0.001096   6.146 8.42e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9654 on 6163 degrees of freedom
Multiple R-squared:  0.4896,	Adjusted R-squared:  0.4894 
F-statistic:  2956 on 2 and 6163 DF,  p-value: < 2.2e-16

We can now predict Overall with new counts data

In [11]:
newX = as.data.frame(srproj(BR,counts[1:10,]))
colnames(newX) <-c("zOverall","m")
yhatdmrR <- predict(fmR, newX)
as.vector(yhatdmrR)
  1. 4.86419983554951
  2. 2.24484606535357
  3. 5.65307467025703
  4. 4.568446245483
  5. 4.66848593149624
  6. 5.06290073040162
  7. 3.66278776217998
  8. 4.47468035358368
  9. 4.00369363965046
  10. 7.35463703886027

Same model but in Julia

Now lets try that in julia.

We need to pass the data to julia

In [12]:
j$command("import SparseArrays")
j$assign("covars",covars)
## there are probably more efficient ways to pass the sparse matrix, but JuliaCall doesn't do this automatically at the moment
j$assign("counts",as.matrix(counts))
j$command("counts = SparseArrays.sparse(counts)")
6166×2640 SparseArrays.SparseMatrixCSC{Float64,Int64} with 66459 stored entries: [11 , 1] = 1.0 [20 , 1] = 1.0 [43 , 1] = 1.0 [63 , 1] = 1.0 [80 , 1] = 1.0 [87 , 1] = 1.0 [88 , 1] = 1.0 [97 , 1] = 1.0 [141 , 1] = 1.0 ⋮ [1273, 2640] = 1.0 [1955, 2640] = 1.0 [2509, 2640] = 1.0 [2842, 2640] = 1.0 [3929, 2640] = 1.0 [4314, 2640] = 1.0 [4862, 2640] = 1.0 [5702, 2640] = 1.0 [6007, 2640] = 1.0

add parallel workers

In [13]:
j$command("using Distributed")
j$call("addprocs", nprocs)
  1. 2
  2. 3
  3. 4
  4. 5
  5. 6
  6. 7
  7. 8
  8. 9
  9. 10
  10. 11

make our package available to all workers

In [14]:
j$command("import HurdleDMR; @everywhere using HurdleDMR")

Fit Distributed mutlinomial regression (DMR) in Julia

In [15]:
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  16.54    0.14   59.09 

The above returns a lightweight wrapper with basically just the coefficients. To get the entire regularization paths, try the following

In [16]:
system.time(j$command("m = fit(DMRPaths,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
   5.47    1.45   11.50 

Julia compiles each function on its first call, which may be slower for one-off applications, but faster when the function is called many times. So to get a sense of runtime without that fixed cost, you may wish to run it again.

In [17]:
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
   0.02    0.04    4.52 

On our machine, julia fits dmr in about half the time as R (see 'elapsed' entries above). The speed improvment is mostly due to sharing of memory across parallel workers.

Get AICc optimal coefficients

In [18]:
Bjulia <- j$eval("coef(m)")
dim(Bjulia)
  1. 2
  2. 2640

Get SR projection onto factors

In [19]:
zjulia <- j$eval("srproj(m,counts)")
dim(zjulia)
  1. 6166
  2. 2

Comparing zR to zjulia we see that the estimates are about the same.

In [20]:
all.equal(zR, zjulia, check.attributes = FALSE)
'Mean relative difference: 0.0250051'

The differences are mostly due to default regularization paths differences.

Multinomial inverse regression (MNIR)

The HurdleDMR package provides a general method to fit Counts Inverse Regressions (CIR), fit(CIR...) that can fit both backward and forward parts of the MNIR. For example:

In [21]:
j$command("using GLM")
system.time(j$command("mnir = fit(CIR{DMR,LinearModel},@model(c ~ 1 + Overall),covars,counts,:Overall);"))
   user  system elapsed 
   1.45    0.05    5.86 
In [22]:
j$eval("mnir.model.fwdm")
Julia Object of type LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}.
LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate  Std.Error t value Pr(>|t|)
x1      3.42207  0.0194634 175.821   <1e-99
x2      3.12546  0.0420257 74.3701   <1e-99
x3   0.00619096 0.00110878 5.58357    <1e-7

The fitted model can be used for prediction as follows:

In [23]:
yhatdmrJ <- j$eval("predict(mnir,covars[1:10,:],counts[1:10,:])")
yhatdmrJ
  1. 4.82560813985272
  2. 2.29188447231926
  3. 5.64079872447367
  4. 4.54721361660461
  5. 4.64900142586795
  6. 5.01050775607373
  7. 3.67264187569116
  8. 4.44109783266488
  9. 3.97904110850575
  10. 7.29252606765281

Comparing the R and julia versions of the predicted values, they appear to be quite similar:

In [24]:
all.equal(yhatdmrR, yhatdmrJ, check.attributes = FALSE)
'Mean relative difference: 0.006899308'

Hurdle Distributed Multiple Regression (HDMR)

Another advantage of the julia package is allowing for text selection via HDMR. Here we specify the two parts of the model via two formulas:

In [25]:
system.time(j$command("m = fit(HDMR,@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
   2.34    0.10   21.34 

Fitted HDMR involves two coefficient arrays, one for the model for positives c ~ ... and one for the model for hurdle crossing or zeros h ~ ...

In [26]:
Cjulia <- j$eval("coef(m)")

The projection onto factors now gives [zpos, zzero, m] instead of [z, m] as before

In [27]:
Zjulia <- j$eval("srproj(m,counts)")

If we wish to run a CIR with HDMR instead of DMR, all we need to do is specify it in a very similar call to fit(CIR...):

In [28]:
system.time(j$command("cir = fit(CIR{HDMR,LinearModel},@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts,:Overall);"))
   user  system elapsed 
   0.82    0.04   10.67 
In [29]:
j$eval("cir.model.fwdm")
Julia Object of type LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}.
LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate  Std.Error t value Pr(>|t|)
x1      3.45647   0.019695   175.5   <1e-99
x2     0.325451   0.116531 2.79282   0.0052
x3      3.21951  0.0443889 72.5297   <1e-99
x4   0.00557669 0.00111955  4.9812    <1e-6

The HDMR-based CIR model can be used to predict with new data

In [30]:
yhathdmr <- j$eval("predict(cir,covars[1:10,:],counts[1:10,:])")
yhathdmr
  1. 4.89765647648426
  2. 2.1886062371493
  3. 5.74273601134927
  4. 4.59929418883643
  5. 4.70140333079505
  6. 4.87705036869567
  7. 3.6771148111424
  8. 4.35870860551018
  9. 4.03657422748465
  10. 7.44228280574888

Comparing the three predicted values shows only minor differences in this toy example.

In [31]:
cbind(yhatdmrR,yhatdmrJ,yhathdmr)
yhatdmrRyhatdmrJyhathdmr
14.8642004.8256084.897656
22.2448462.2918842.188606
55.6530755.6407995.742736
114.5684464.5472144.599294
124.6684864.6490014.701403
135.0629015.0105084.877050
143.6627883.6726423.677115
154.4746804.4410984.358709
174.0036943.9790414.036574
187.3546377.2925267.442283

Kelly, Manela, and Moreira (2018) show, however, that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.