# Fit an Additive-Multiplicative Intensity Model for SIR Data

`twinSIR.Rd`

`twinSIR`

is used to fit additive-multiplicative intensity models for
epidemics as described in Höhle (2009). Estimation is driven
by (penalized) maximum likelihood in the point process frame work. Optimization
(maximization) of the (penalized) likelihood function is performed by means of
`optim`

.
The implementation is illustrated in Meyer et al. (2017, Section 4),
see `vignette("twinSIR")`

.

## Usage

```
twinSIR(formula, data, weights, subset,
knots = NULL, nIntervals = 1, lambda.smooth = 0, penalty = 1,
optim.args = list(), model = TRUE, keep.data = FALSE)
```

## Arguments

- formula
an object of class

`"formula"`

(or one that can be coerced to that class): a symbolic description of the intensity model to be estimated. The details of the model specification are given below.- data
an object inheriting from class

`"epidata"`

.- weights
an optional vector of weights to be used in the fitting process. Should be

`NULL`

(the default, i.e. all observations have unit weight) or a numeric vector.- subset
an optional vector specifying a subset of observations to be used in the fitting process. The subset

`atRiskY == 1`

is automatically chosen, because the likelihood only depends on those observations.- knots
numeric vector or

`NULL`

(the default). Specification of the knots, where we suppose a step of the log-baseline. With the current implementation, these must be existing`"stop"`

time points in the selected`subset`

of the`data`

, which is always restricted to`atRiskY == 1`

rows. The intervals of constant log-baseline hazard rate then are \((minTime;knots_1]\), \((knots_1;knots_2]\), ..., \((knots_K;maxTime]\). By default, the`knots`

are automatically chosen at the quantiles of the infection time points such that`nIntervals`

intervals result. Non-NULL`knots`

take precedence over`nIntervals`

.- nIntervals
the number of intervals of constant log-baseline hazard. Defaults to 1, which means an overall constant log-baseline hazard will be fitted.

- lambda.smooth
numeric, the smoothing parameter \(\lambda\). By default it is 0 which leads to unpenalized likelihood inference. In case

`lambda.smooth=-1`

, the automatic smoothing parameter selection based on a mixed model approach is used (cf. Höhle, 2009).- penalty
either a single number denoting the order of the difference used to penalize the log-baseline coefficients (defaults to 1), or a more specific penalty matrix \(K\) for the parameter sub-vector \(\beta\). In case of non-equidistant knots – usually the case when using quantile based knot locations – only a 1st order differences penalty matrix as in Fahrmeir and Lang (2001) is implemented.

- optim.args
a list with arguments passed to the

`optim`

function. Especially useful are the following ones:`par`

:to specify initial parameter values. Those must be in the order

`c(alpha, h0, beta)`

, i.e. first the coefficients of the epidemic covariates in the same order as they appear in the`formula`

, then the log-baseline levels in chronological order and finally the coefficients of the endemic covariates in the same order as they appear in the`cox`

terms of the`formula`

. The default is to start with 1's for`alpha`

and 0's for`h0`

and`beta`

.`control`

:for more detailed

`trace`

-ing (default: 1), another`REPORT`

-ing frequency if`trace`

is positive (default: 10), higher`maxit`

(maximum number of iterations, default: 300) or another`factr`

value (default: 1e7, a lower value means higher precision).`method`

:the optimization algorithm defaults to

`"L-BFGS-B"`

(for box-constrained optimization), if there are any epidemic (non-`cox`

) variables in the model, and to`"BFGS"`

otherwise.`lower`

:if

`method = "L-BFGS-B"`

this defines the lower bounds for the model coefficients. By default, all effects \(\alpha\) of epidemic variables are restricted to be non-negative. Normally, this is exactly what one would like to have, but there might be reasons for other lower bounds, see the Note below.`hessian`

:An estimation of the Expected Fisher Information matrix is always part of the return value of the function. It might be interesting to see the Observed Fisher Information (= negative Hessian at the maximum), too. This will be additionally returned if

`hessian = TRUE`

.

- model
logical indicating if the model frame, the

`weights`

,`lambda.smooth`

, the penalty matrix \(K\) and the list of used distance functions`f`

(from`attributes(data)`

) should be returned for further computation. This defaults to`TRUE`

as this information is necessary e.g. in the`profile`

and`plot`

methods.- keep.data
logical indicating if the

`"epidata"`

object (`data`

) should be part of the return value. This is only necessary for use of the`simulate`

-method for`"twinSIR"`

objects. The reason is that the`twinSIR`

function only uses and stores the rows with`atRiskY == 1`

in the`model`

component, but for the simulation of new epidemic data one needs the whole data set with all individuals in every time block. The default value is`FALSE`

, so if you intent to use`simulate.twinSIR`

, you have to set this to`TRUE`

.

## Details

A model is specified through the `formula`

, which has the form

```
~ epidemicTerm1 + epidemicTerm2 + cox(endemicVar1) *
cox(endemicVar2)
```

,

i.e. the right hand side has the usual form as in `lm`

with
some variables marked as being endemic by the special function
`cox`

. The left hand side of the formula is empty and will be
set internally to `cbind(start, stop, event)`

, which is similar to
`Surv(start, stop, event, type="counting")`

in package survival.

Basically, the additive-multiplicative model for the infection intensity \(\lambda_i(t)\) for individual \(i\) is $$\lambda_i(t) = Y_i(t) * (e_i(t) + h_i(t))$$ where

- Y_i(t)
is the at-risk indicator, indicating if individual \(i\) is “at risk” of becoming infected at time point \(t\). This variable is part of the event history

`data`

.- e_i(t)
is the epidemic component of the infection intensity, defined as $$e_i(t) = \sum_{j \in I(t)} f(||s_i - s_j||)$$ where \(I(t)\) is the set of infectious individuals just before time point \(t\), \(s_i\) is the coordinate vector of individual \(i\) and the function \(f\) is defined as $$f(u) = \sum_{m=1}^p \alpha_m B_m(u)$$ with unknown transmission parameters \(\alpha\) and known distance functions \(B_m\). This set of distance functions results in the set of epidemic variables normally calculated by the converter function

`as.epidata`

, considering the equality $$e_i(t) = \sum_{m=1}^p \alpha_m x_{im}(t)$$ with \(x_{im}(t) = \sum_{j \in I(t)} B_m(||s_i - s_j||)\) being the \(m\)'th epidemic variable for individual \(i\).- h_i(t)
is the endemic (

`cox`

) component of the infection intensity, defined as $$h_i(t) = \exp(h_0(t) + z_i(t)' \beta)$$ where \(h_0(t)\) is the log-baseline hazard function, \(z_i(t)\) is the vector of endemic covariates of individual \(i\) and \(\beta\) is the vector of unknown coefficients. To fit the model, the log-baseline hazard function is approximated by a piecewise constant function with known knots, but unknown levels, which will be estimated. The approximation is specified by the arguments`knots`

or`nIntervals`

.

If a big number of `knots`

(or `nIntervals`

) is chosen, the
corresponding log-baseline parameters can be rendered identifiable by
the use of penalized likelihood inference. At present, it is the job
of the user to choose an adequate value of the smoothing parameter
`lambda.smooth`

. Alternatively, a data driven
`lambda.smooth`

smoothing parameter selection based on a mixed
model representation of an equivalent truncated power spline is offered (see
reference for further details). The following two steps are iterated
until convergence:

Given fixed smoothing parameter, the penalized likelihood is optimized for the regression components using a L-BFGS-B approach

Given fixed regression parameters, a Laplace approximation of the marginal likelihood for the smoothing parameter is numerically optimized.

Depending on the data, convergence might take a couple of iterations.

Note also that it is unwise to include endemic covariates with huge values,
as they affect the intensities on the exponential scale (after
multiplication by the parameter vector \(\beta\)).
With large covariate values, the
`optim`

method "L-BFGS-B" will likely terminate due to an infinite
log-likelihood or score function in some iteration.

## Value

`twinSIR`

returns an object of class
`"twinSIR"`

, which is a list containing the following components:

- coefficients
a named vector of coefficients.

- loglik
the maximum of the (penalized) log-likelihood function.

- counts
the number of log-likelihood and score function evaluations.

- converged
logical indicating convergence of the optimization algorithm.

- fisherinfo.observed
if requested, the negative Hessian from

`optim`

.- fisherinfo
an estimation of the Expected Fisher Information matrix.

- method
the optimization algorithm used.

- intervals
a numeric vector (

`c(minTime, knots, maxTime)`

) representing the consecutive intervals of constant log-baseline.- nEvents
a numeric vector containing the number of infections in each of the above

`intervals`

.- model
if requested, the model information used. This is a list with components

`"survs"`

(data.frame with the id, start, stop and event columns),`"X"`

(matrix of the epidemic variables),`"Z"`

(matrix of the endemic variables),`"weights"`

(the specified`weights`

),`"lambda.smooth"`

(the specified`lambda.smooth`

),`"K"`

(the penalty matrix used), and`"f"`

and`"w"`

(the functions to generate the used epidemic covariates). Be aware that the model only contains those rows with`atRiskY == 1`

!- data
if requested, the supplied

`"epidata"`

`data`

.- call
the matched call.

- formula
the specified

`formula`

.- terms
the

`terms`

object used.

## References

Höhle, M. (2009),
Additive-multiplicative regression models for spatio-temporal
epidemics, *Biometrical Journal*, **51** (6), 961-978.

Meyer, S., Held, L. and Höhle, M. (2017):
Spatio-temporal analysis of epidemic phenomena using the R package
surveillance.
*Journal of Statistical Software*, **77** (11), 1-55.
doi:10.18637/jss.v077.i11

## Note

There are some restrictions to modelling the infection intensity
without a baseline hazard rate, i.e. without an intercept in the
`formula`

.
Reason: At some point, the optimization algorithm L-BFGS-B tries to set all
transmission parameters \(\alpha\) to the boundary value 0 and to calculate
the (penalized) score function with this set of parameters (all 0). The problem
then is that the values of the infection intensities \(lambda_i(t)\) are 0
for all \(i\) and \(t\) and especially at observed event times, which is
impossible. Without a baseline, it is not allowed to have all alpha's set to 0,
because then we would not observe any infections. Unfortunately, L-BFGS-B can
not consider this restriction. Thus, if one wants to fit a model without
baseline hazard, the control parameter `lower`

must be specified in
`optim.args`

so that some alpha is strictly positive, e.g.
`optim.args = list(lower = c(0,0.001,0.001,0))`

and the initial parameter
vector `par`

must not be the zero vector.

## See also

`as.epidata`

for the necessary data input structure,
`plot.twinSIR`

for plotting the path of the infection intensity,
`profile.twinSIR`

for profile likelihood estimation.
and `simulate.twinSIR`

for the simulation of epidemics following
the fitted model.

Furthermore, the standard extraction methods
`vcov`

, `logLik`

,
`AIC`

and
`extractAIC`

are implemented for
objects of class `"twinSIR"`

.

## Examples

```
data("hagelloch")
summary(hagelloch)
# simple model with an overall constant baseline hazard rate
fit1 <- twinSIR(~ household + cox(AGE), data = hagelloch)
fit1
summary(fit1) # see also help("summary.twinSIR")
plot(fit1) # see also help("plot.twinSIR")
checkResidualProcess(fit1) # could be better
# fit a piecewise constant baseline hazard rate with 3 intervals using
# _un_penalized ML and estimated coefs from fit1 as starting values
fit2 <- twinSIR(~ household, data = hagelloch, nIntervals = 3,
optim.args = list(par = coef(fit1)[c(1,2,2,2)]))
summary(fit2)
# fit a piecewise constant baseline hazard rate with 7 intervals
# using _penalized_ ML
fit3 <- twinSIR(~ household, data = hagelloch, nIntervals = 7,
lambda.smooth = 0.1, penalty = 1)
summary(fit3)
checkResidualProcess(fit3)
# plot the estimated log-baseline levels
plot(x=fit2$intervals, y=coef(fit2)[c(2,2:4)], type="S", ylim=c(-6, -1))
lines(x=fit3$intervals, y=coef(fit3)[c(2,2:8)], type="S", col=2)
legend("right", legend=c("unpenalized 3", "penalized 7"), lty=1, col=1:2, bty="n")
## special use case: fit the model to a subset of the events only,
## while preserving epidemic contributions from the remainder
## (maybe some buffer area nodes)
fit_subset <- twinSIR(~ household, data = hagelloch, subset = CL=="preschool")
summary(fit_subset)
```