HurdleDMR

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multinomial Regression (HDMR), as described in:

Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. Working paper.

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).

This tutorial explains how to use this package.

Setup

Install Julia

First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com

Once installed, open julia in a terminal (or in Juno), press ] to activate package manager and add the following packages:

pkg> add HurdleDMR

Add parallel workers and make package available to workers

In [1]:
using Distributed
addprocs(4)
import HurdleDMR; @everywhere using HurdleDMR

Example Data

Setup your data into an n-by-p covars matrix, and a (sparse) n-by-d counts matrix. Here we generate some random data.

if not installed, install following package with

pkg> add CSV GLM DataFrames Distributions
In [2]:
using CSV, GLM, Lasso, DataFrames, Distributions, Random, LinearAlgebra, SparseArrays
n = 100
p = 3
d = 4

Random.seed!(13)
m = 1 .+ rand(Poisson(5),n)
covars = rand(n,p)
ηfn(vi) = exp.([0 + i*sum(vi) for i=1:d])
q = [ηfn(covars[i,:]) for i=1:n]
rmul!.(q,ones(n)./sum.(q))
counts = convert(SparseMatrixCSC{Float64,Int},hcat(broadcast((qi,mi)->rand(Multinomial(mi, qi)),q,m)...)')
covarsdf = DataFrame(covars,[:vy, :v1, :v2])
Out[2]:

100 rows × 3 columns

vyv1v2
Float64Float64Float64
10.6930730.8771160.401554
20.9381630.7374910.997271
30.7558780.7432680.595892
40.1910580.2964430.30533
50.007535420.3604740.335553
60.4109740.7738710.657641
70.2799420.1542840.321258
80.2084540.8496530.22147
90.6398720.9267060.444675
100.2691320.837850.0137366
110.7049590.1201370.401541
120.8202480.3795420.704862
130.7528490.7453830.907775
140.6344010.3835280.276991
150.3706040.5955420.0999965
160.4004540.5961320.00424357
170.3310370.7772710.963936
180.1092560.454040.873842
190.03846270.3580230.369017
200.6551150.9848530.284056
210.9611480.4251150.836061
220.3136930.6312120.33691
230.4686630.2034750.0895971
240.1309950.4164740.25323
250.5103550.4181230.542134
260.7638790.5016350.257008
270.1392970.3375390.143543
280.3153560.8388060.0502037
290.3591660.9927760.882517
300.5112670.9832820.976795

Distributed Multinomial Regression (DMR)

The Distributed Multinomial Regression (DMR) model of Taddy (2015) is a highly scalable approximation to the Multinomial using distributed (independent, parallel) Poisson regressions, one for each of the d categories (columns) of a large counts matrix, on the covars.

To fit a DMR:

In [3]:
m = dmr(covars, counts)
┌ Info: fitting 100 observations on 4 categories, 3 covariates 
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/dmr.jl:302
┌ Info: distributed poisson run on local cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/dmr.jl:314
Out[3]:
DMRCoefs{Array{Float64,2},MinAICc}([-1.84481 -0.478703 -0.631308 -0.760733; -1.88824 -2.50035 -0.898109 0.368329; -0.91088 -2.47381 -0.318847 0.201772; -2.24309 -1.38126 -1.22846 0.37388], true, 100, 4, 3, MinAICc(2))

or with a dataframe and formula

In [4]:
mf = @model(c ~ vy + v1 + v2)
m = fit(DMR, mf, covarsdf, counts)
┌ Info: fitting 100 observations on 4 categories, 3 covariates 
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/dmr.jl:302
┌ Info: distributed poisson run on local cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/dmr.jl:314
Out[4]:
DMRCoefs{Array{Float64,2},MinAICc}([-1.84481 -0.478703 -0.631308 -0.760733; -1.88824 -2.50035 -0.898109 0.368329; -0.91088 -2.47381 -0.318847 0.201772; -2.24309 -1.38126 -1.22846 0.37388], true, 100, 4, 3, MinAICc(2))

in either case we can get the coefficients matrix for each variable + intercept as usual with

In [5]:
coef(m)
Out[5]:
4×4 Array{Float64,2}:
 -1.84481  -0.478703  -0.631308  -0.760733
 -1.88824  -2.50035   -0.898109   0.368329
 -0.91088  -2.47381   -0.318847   0.201772
 -2.24309  -1.38126   -1.22846    0.37388 

By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run

In [6]:
paths = fit(DMRPaths, mf, covarsdf, counts)
┌ Info: fitting 100 observations on 4 categories, 3 covariates 
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/dmr.jl:362
┌ Info: distributed poisson run on remote cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/dmr.jl:384
Out[6]:
DMRPaths(Union{Missing, GammaLassoPath}[Poisson GammaLassoPath (50) solutions for 4 predictors in 257 iterations):
──────────────────────────────────────
                λ      pct_dev  ncoefs
──────────────────────────────────────
 [1]  0.0761628    4.77396e-15       0
 [2]  0.0693967    0.0160918         1
 [3]  0.0632317    0.0296043         1
 [4]  0.0576143    0.0452719         2
 [5]  0.052496     0.0614706         2
 [6]  0.0478324    0.0751983         2
 [7]  0.0435831    0.0868424         2
 [8]  0.0397113    0.0967254         2
 [9]  0.0361835    0.105115          2
[10]  0.032969     0.11224           2
[11]  0.0300402    0.118282          2
[12]  0.0273715    0.124314          3
[13]  0.0249399    0.130633          3
[14]  0.0227243    0.135993          3
[15]  0.0207055    0.140525          3
[16]  0.0188661    0.144358          3
[17]  0.0171901    0.147597          3
[18]  0.015663     0.150333          3
[19]  0.0142715    0.152641          3
[20]  0.0130037    0.154586          3
[21]  0.0118485    0.156225          3
[22]  0.0107959    0.157604          3
[23]  0.0098368    0.158764          3
[24]  0.00896292   0.159738          3
[25]  0.00816668   0.160556          3
[26]  0.00744118   0.161242          3
[27]  0.00678012   0.161818          3
[28]  0.0061778    0.162299          3
[29]  0.00562898   0.162703          3
[30]  0.00512891   0.16304           3
[31]  0.00467328   0.163322          3
[32]  0.00425812   0.163558          3
[33]  0.00387984   0.163755          3
[34]  0.00353516   0.163919          3
[35]  0.00322111   0.164056          3
[36]  0.00293495   0.164171          3
[37]  0.00267422   0.164266          3
[38]  0.00243665   0.164346          3
[39]  0.00222018   0.164412          3
[40]  0.00202295   0.164467          3
[41]  0.00184324   0.164513          3
[42]  0.00167949   0.164551          3
[43]  0.00153029   0.164583          3
[44]  0.00139434   0.16461           3
[45]  0.00127047   0.164632          3
[46]  0.00115761   0.16465           3
[47]  0.00105477   0.164665          3
[48]  0.000961065  0.164678          3
[49]  0.000875687  0.164689          3
[50]  0.000797893  0.164698          3
──────────────────────────────────────, Poisson GammaLassoPath (51) solutions for 4 predictors in 282 iterations):
─────────────────────────────────────
               λ      pct_dev  ncoefs
─────────────────────────────────────
 [1]  0.169502    1.77636e-14       0
 [2]  0.154444    0.0291542         2
 [3]  0.140723    0.0623827         2
 [4]  0.128222    0.090534          2
 [5]  0.116831    0.116502          3
 [6]  0.106452    0.141835          3
 [7]  0.0969951   0.163402          3
 [8]  0.0883783   0.181775          3
 [9]  0.080527    0.197429          3
[10]  0.0733732   0.210765          3
[11]  0.066855    0.222122          3
[12]  0.0609158   0.23179           3
[13]  0.0555042   0.240012          3
[14]  0.0505733   0.247             3
[15]  0.0460805   0.252934          3
[16]  0.0419869   0.257966          3
[17]  0.0382569   0.262231          3
[18]  0.0348582   0.265841          3
[19]  0.0317615   0.268893          3
[20]  0.0289399   0.271471          3
[21]  0.026369    0.273647          3
[22]  0.0240264   0.27548           3
[23]  0.021892    0.277024          3
[24]  0.0199472   0.278323          3
[25]  0.0181751   0.279415          3
[26]  0.0165605   0.280332          3
[27]  0.0150893   0.281101          3
[28]  0.0137488   0.281746          3
[29]  0.0125274   0.282286          3
[30]  0.0114145   0.282739          3
[31]  0.0104005   0.283117          3
[32]  0.00947652  0.283434          3
[33]  0.00863465  0.283698          3
[34]  0.00786757  0.283919          3
[35]  0.00716864  0.284104          3
[36]  0.0065318   0.284258          3
[37]  0.00595153  0.284386          3
[38]  0.00542281  0.284493          3
[39]  0.00494107  0.284582          3
[40]  0.00450211  0.284657          3
[41]  0.00410216  0.284719          3
[42]  0.00373773  0.28477           3
[43]  0.00340568  0.284813          3
[44]  0.00310313  0.284849          3
[45]  0.00282746  0.284879          3
[46]  0.00257628  0.284903          3
[47]  0.00234741  0.284924          3
[48]  0.00213887  0.284941          3
[49]  0.00194886  0.284955          3
[50]  0.00177573  0.284967          3
[51]  0.00161798  0.284977          3
─────────────────────────────────────, Poisson GammaLassoPath (51) solutions for 4 predictors in 248 iterations):
───────────────────────────────────
               λ    pct_dev  ncoefs
───────────────────────────────────
 [1]  0.420148    0.0             0
 [2]  0.382823    0.0275775       1
 [3]  0.348814    0.050612        1
 [4]  0.317827    0.0698654       1
 [5]  0.289592    0.0859654       1
 [6]  0.263865    0.108082        2
 [7]  0.240424    0.127106        2
 [8]  0.219066    0.143043        2
 [9]  0.199604    0.156396        2
[10]  0.181872    0.167582        2
[11]  0.165715    0.17695         2
[12]  0.150993    0.184793        2
[13]  0.13758     0.191356        2
[14]  0.125357    0.196847        2
[15]  0.114221    0.201437        2
[16]  0.104074    0.205275        2
[17]  0.0948282   0.209376        3
[18]  0.086404    0.213349        3
[19]  0.0787281   0.216666        3
[20]  0.0717341   0.219435        3
[21]  0.0653614   0.221744        3
[22]  0.0595549   0.223671        3
[23]  0.0542642   0.225277        3
[24]  0.0494435   0.226615        3
[25]  0.0450511   0.22773         3
[26]  0.0410489   0.22866         3
[27]  0.0374022   0.229433        3
[28]  0.0340795   0.230077        3
[29]  0.031052    0.230614        3
[30]  0.0282934   0.23106         3
[31]  0.0257799   0.231431        3
[32]  0.0234897   0.23174         3
[33]  0.0214029   0.231997        3
[34]  0.0195015   0.232211        3
[35]  0.0177691   0.232388        3
[36]  0.0161905   0.232536        3
[37]  0.0147522   0.232659        3
[38]  0.0134417   0.232761        3
[39]  0.0122475   0.232846        3
[40]  0.0111595   0.232916        3
[41]  0.0101681   0.232975        3
[42]  0.00926481  0.233024        3
[43]  0.00844175  0.233064        3
[44]  0.00769181  0.233098        3
[45]  0.00700849  0.233126        3
[46]  0.00638588  0.233149        3
[47]  0.00581857  0.233168        3
[48]  0.00530167  0.233184        3
[49]  0.00483068  0.233197        3
[50]  0.00440154  0.233208        3
[51]  0.00401052  0.233218        3
───────────────────────────────────, Poisson GammaLassoPath (52) solutions for 4 predictors in 246 iterations):
─────────────────────────────────────
               λ      pct_dev  ncoefs
─────────────────────────────────────
 [1]  0.630683    3.49498e-13       1
 [2]  0.574655    0.0368374         1
 [3]  0.523604    0.068304          2
 [4]  0.477088    0.111223          2
 [5]  0.434705    0.146782          2
 [6]  0.396087    0.176252          2
 [7]  0.3609      0.200683          2
 [8]  0.328839    0.220941          2
 [9]  0.299625    0.237741          2
[10]  0.273008    0.255802          3
[11]  0.248754    0.273247          3
[12]  0.226656    0.287713          3
[13]  0.20652     0.29971           3
[14]  0.188174    0.309662          3
[15]  0.171457    0.317918          3
[16]  0.156225    0.324767          3
[17]  0.142346    0.33045           3
[18]  0.129701    0.335166          3
[19]  0.118178    0.339079          3
[20]  0.10768     0.342326          3
[21]  0.0981138   0.345021          3
[22]  0.0893977   0.347258          3
[23]  0.0814558   0.349114          3
[24]  0.0742195   0.350655          3
[25]  0.0676261   0.351934          3
[26]  0.0616183   0.352995          3
[27]  0.0561443   0.353876          3
[28]  0.0511566   0.354608          3
[29]  0.046612    0.355215          3
[30]  0.0424711   0.355719          3
[31]  0.0386981   0.356137          3
[32]  0.0352603   0.356485          3
[33]  0.0321279   0.356773          3
[34]  0.0292737   0.357012          3
[35]  0.0266731   0.357211          3
[36]  0.0243035   0.357376          3
[37]  0.0221445   0.357513          3
[38]  0.0201772   0.357626          3
[39]  0.0183847   0.357721          3
[40]  0.0167515   0.357799          3
[41]  0.0152633   0.357864          3
[42]  0.0139074   0.357918          3
[43]  0.0126719   0.357963          3
[44]  0.0115462   0.358             3
[45]  0.0105204   0.358031          3
[46]  0.00958582  0.358057          3
[47]  0.00873424  0.358078          3
[48]  0.00795832  0.358096          3
[49]  0.00725132  0.35811           3
[50]  0.00660713  0.358123          3
[51]  0.00602017  0.358133          3
[52]  0.00548536  0.358141          3
─────────────────────────────────────], true, 100, 4, 3)

we can now select, for example the coefficients that minimize 10-fold CV mse (takes a while)

In [7]:
coef(paths, MinCVKfold{MinCVmse}(10))
Out[7]:
4×4 Array{Float64,2}:
 -2.15703  -0.670327  -0.631308  -0.756541
 -1.60526  -2.31103   -0.898109   0.365359
 -0.64499  -2.25979   -0.318847   0.198874
 -1.94172  -1.22852   -1.22846    0.371721

Hurdle Distributed Multinomial Regression (HDMR)

For highly sparse counts, as is often the case with text that is selected for various reasons, the Hurdle Distributed Multinomial Regression (HDMR) model of Kelly, Manela, and Moreira (2018), may be superior to the DMR. It approximates a higher dispersion Multinomial using distributed (independent, parallel) Hurdle regressions, one for each of the d categories (columns) of a large counts matrix, on the covars. It allows a potentially different sets of covariates to explain category inclusion ($h=1{c>0}$), and repetition ($c>0$).

Both the model for zeroes and for positive counts are regularized by default, using GammaLassoPath, picking the AICc optimal segment of the regularization path.

HDMR can be fitted:

In [8]:
m = hdmr(covars, counts; inpos=1:2, inzero=1:3)
┌ Info: fitting 100 observations on 4 categories 
│ 2 covariates for positive and 3 for zero counts
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:468
┌ Info: distributed InclusionRepetition run on local cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:481
Out[8]:
HDMRCoefs{InclusionRepetition,Array{Float64,2},MinAICc,UnitRange{Int64}}([0.0 -2.48491 -0.505547 -0.23561; 0.0 0.0 -2.65019 0.203207; 0.0 0.0 -0.491465 0.0], [-2.59356 1.23117 1.73801 -13.9022; 0.0 -2.17746 0.0 38.1537; 0.0 -2.89271 0.0 423.955; 0.0 -1.23601 -1.85064 -19.9601], true, 100, 4, 1:2, 1:3, MinAICc(2))

or with a dataframe and formula

In [9]:
mf = @model(h ~ vy + v1 + v2, c ~ vy + v1)
m = fit(HDMR, mf, covarsdf, counts)
┌ Info: fitting 100 observations on 4 categories 
│ 2 covariates for positive and 3 for zero counts
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:468
┌ Info: distributed InclusionRepetition run on local cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:481
Out[9]:
HDMRCoefs{InclusionRepetition,Array{Float64,2},MinAICc,Array{Int64,1}}([0.0 -2.48491 -0.505547 -0.23561; 0.0 0.0 -2.65019 0.203207; 0.0 0.0 -0.491465 0.0], [-2.59356 1.23117 1.73801 -13.9022; 0.0 -2.17746 0.0 38.1537; 0.0 -2.89271 0.0 423.955; 0.0 -1.23601 -1.85064 -19.9601], true, 100, 4, [1, 2], [1, 2, 3], MinAICc(2))

where the h ~ equation is the model for zeros (hurdle crossing) and c ~ is the model for positive counts

in either case we can get the coefficients matrix for each variable + intercept as usual with

In [10]:
coefspos, coefszero = coef(m)
Out[10]:
([0.0 -2.48491 -0.505547 -0.23561; 0.0 0.0 -2.65019 0.203207; 0.0 0.0 -0.491465 0.0], [-2.59356 1.23117 1.73801 -13.9022; 0.0 -2.17746 0.0 38.1537; 0.0 -2.89271 0.0 423.955; 0.0 -1.23601 -1.85064 -19.9601])

By default we only return the AICc maximizing coefficients. To also get back the entire regulatrization paths, run

In [11]:
paths = fit(HDMRPaths, mf, covarsdf, counts)

coef(paths, AllSeg())
┌ Info: fitting 100 observations on 4 categories 
│ 2 covariates for positive and 3 for zero counts
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:281
┌ Info: distributed InclusionRepetition run on remote cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:312
Out[11]:
([0.0 0.0 0.0; 0.0 0.0 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0]

[-2.48491 0.0 0.0; -2.41917 -0.199534 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0]

[-1.78805 0.0 0.0; -1.69583 -0.197051 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0]

[-0.133883 3.92431e-7 0.0; -0.149695 0.0319783 0.0; … ; 0.0 0.0 0.0; 0.0 0.0 0.0], [-2.59356 0.0 0.0 0.0; -2.54089 0.0 0.0 -0.121014; … ; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0]

[-1.5251 0.0 0.0 0.0; -1.43208 0.0 -0.192178 0.0; … ; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0]

[0.670484 0.0 0.0 -4.29912e-8; 0.762441 0.0 0.0 -0.163359; … ; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0]

[4.94744 0.0 0.0 0.0; 4.64738 0.0 0.592183 0.0; … ; -13.6801 37.63 417.646 -19.6945; -13.9022 38.1537 423.955 -19.9601])

Sufficient reduction projection

A sufficient reduction projection summarizes the counts, much like a sufficient statistic, and is useful for reducing the d dimensional counts in a potentially much lower dimension matrix z.

To get a sufficient reduction projection in direction of vy for the above example

In [12]:
z = srproj(m,counts,1,1)
Out[12]:
100×4 Array{Float64,2}:
 -0.367472   19.0768   10.0  2.0
 -0.0821325  19.0768   10.0  2.0
  0.203207   38.1537    3.0  1.0
 -0.158994   11.9921    9.0  3.0
  0.135472   11.9921    6.0  3.0
  0.203207   38.1537    7.0  1.0
 -0.670108   11.9921    7.0  3.0
 -0.204421   19.0768    7.0  2.0
 -0.153467   19.0768    8.0  2.0
 -0.430881   19.0768    9.0  2.0
 -0.113837   19.0768    9.0  2.0
  0.203207   38.1537    4.0  1.0
  0.203207   38.1537    4.0  1.0
  ⋮                             
 -1.01943     8.99406   5.0  4.0
  0.203207   38.1537    4.0  1.0
 -1.22349    19.0768    4.0  2.0
 -0.747925   19.0768    6.0  2.0
 -0.367472   19.0768    5.0  2.0
 -0.510142   19.0768    4.0  2.0
 -0.315592   19.0768   11.0  2.0
 -0.233451   11.9921    7.0  3.0
 -1.24889    12.7179    8.0  3.0
 -1.22349    19.0768    2.0  2.0
 -1.22349    19.0768    6.0  2.0
 -0.612049   19.0768    7.0  2.0

Here, the first column is the SR projection from the model for positive counts, the second is the the SR projection from the model for hurdle crossing (zeros), and the third is the total count for each observation.

Counts Inverse Regression (CIR)

Counts inverse regression allows us to predict a covariate with the counts and other covariates. Here we use hdmr for the backward regression and another model for the forward regression. This can be accomplished with a single command, by fitting a CIR{HDMR,FM} where the forward model is FM <: RegressionModel.

In [13]:
cir = fit(CIR{HDMR,LinearModel},mf,covarsdf,counts,:vy; nocounts=true)
┌ Info: fitting 100 observations on 4 categories 
│ 2 covariates for positive and 3 for zero counts
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:468
┌ Info: distributed InclusionRepetition run on local cluster with 4 nodes
└ @ HurdleDMR /home/root/.julia/dev/HurdleDMR/src/hdmr.jl:481
Out[13]:
CIR{HDMR,LinearModel}(1, [1, 2], HDMRCoefs{InclusionRepetition,Array{Float64,2},MinAICc,Array{Int64,1}}([0.0 -2.48491 -0.505547 -0.23561; 0.0 0.0 -2.65019 0.203207; 0.0 0.0 -0.491465 0.0], [-2.59356 1.23117 1.73801 -13.9022; 0.0 -2.17746 0.0 38.1537; 0.0 -2.89271 0.0 423.955; 0.0 -1.23601 -1.85064 -19.9601], true, 100, 4, [1, 2], [1, 2, 3], MinAICc(2)), LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────────
       Estimate  Std. Error    t value  Pr(>|t|)    Lower 95%   Upper 95%
─────────────────────────────────────────────────────────────────────────
x1   0.806879    0.254456     3.171       0.0021   0.30158     1.31218   
x2  -0.103535    0.096488    -1.07304     0.2860  -0.295142    0.0880707 
x3  -0.115557    0.103506    -1.11642     0.2671  -0.321099    0.0899861 
x4   0.154634    0.0814074    1.89951     0.0606  -0.00702506  0.316293  
x5  -0.00108295  0.00500668  -0.216302    0.8292  -0.0110252   0.00885932
x6   0.00588134  0.012264     0.479563    0.6327  -0.0184725   0.0302351 
x7  -0.0905476   0.0652975   -1.38669     0.1689  -0.220216    0.0391203 
─────────────────────────────────────────────────────────────────────────
, LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────
      Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────
x1   0.456718    0.0713401   6.40198     <1e-8    0.315127   0.598308
x2  -0.0414989   0.0975448  -0.425434    0.6715  -0.235098   0.152101
x3   0.0921372   0.0909831   1.01269     0.3137  -0.088439   0.272713
─────────────────────────────────────────────────────────────────────
)

where the nocounts=true means we also fit a benchmark model without counts.

we can get the forward and backward model coefficients with

In [14]:
coefbwd(cir)
Out[14]:
([0.0 -2.48491 -0.505547 -0.23561; 0.0 0.0 -2.65019 0.203207; 0.0 0.0 -0.491465 0.0], [-2.59356 1.23117 1.73801 -13.9022; 0.0 -2.17746 0.0 38.1537; 0.0 -2.89271 0.0 423.955; 0.0 -1.23601 -1.85064 -19.9601])
In [15]:
coeffwd(cir)
Out[15]:
7-element Array{Float64,1}:
  0.8068791910919334   
 -0.10353546049021634  
 -0.11555664427731226  
  0.1546339871132811   
 -0.0010829538468236821
  0.005881340629811088 
 -0.0905476479689421   

The fitted model can be used to predict vy with new data

In [16]:
yhat = predict(cir, covarsdf[1:10,:], counts[1:10,:])
Out[16]:
10-element Array{Union{Missing, Float64},1}:
 0.4698993353328307 
 0.45963979129838384
 0.5782657496603597 
 0.484620220504942  
 0.5023887465911144 
 0.5914871966842338 
 0.40669982045527   
 0.5011220105167319 
 0.4811118616060238 
 0.5030932551174164 

We can also predict only with the other covariates, which in this case is just a linear regression

In [17]:
yhat_nocounts = predict(cir, covarsdf[1:10,:], counts[1:10,:]; nocounts=true)
Out[17]:
10-element Array{Union{Missing, Float64},1}:
 0.45731660084749226
 0.5179985510761541 
 0.4807768729723196 
 0.47254808567420575
 0.47267552235044974
 0.48519627697522144
 0.4799151241219515 
 0.44186379731084846
 0.4592317510593359 
 0.4232136851484049 

Kelly, Manela, and Moreira (2018) show that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.