Fit data to an spatial Gaussian SAR (spatial error) model, or model a vector of spatially-autocorrelated parameters using a SAR prior model.

stan_sar(
  formula,
  slx,
  re,
  data,
  C,
  sar_parts = prep_sar_data(C),
  family = auto_gaussian(),
  prior = NULL,
  ME = NULL,
  centerx = FALSE,
  prior_only = FALSE,
  censor_point,
  chains = 4,
  iter = 2000,
  refresh = 500,
  keep_all = FALSE,
  pars = NULL,
  control = NULL,
  ...
)

Source

Cliff, A D and Ord, J K (1981). Spatial Processes: Models and Applications. Pion.

Cressie, Noel (2015 (1993)). Statistics for Spatial Data. Wiley Classics, Revised Edition.

Cressie, Noel and Wikle, Christopher (2011). Statistics for Spatio-Temporal Data. Wiley.

LeSage, James (2014). What Regional Scientists Need to Know about Spatial Econometrics. The Review of Regional Science 44: 13-32 (2014 Southern Regional Science Association Fellows Address).

LeSage, James, & Pace, Robert Kelley (2009). Introduction to Spatial Econometrics. Chapman and Hall/CRC.

Arguments

formula

A model formula, following the R formula syntax. Binomial models can be specified by setting the left hand side of the equation to a data frame of successes and failures, as in cbind(successes, failures) ~ x.

slx

Formula to specify any spatially-lagged covariates. As in, ~ x1 + x2 (the intercept term will be removed internally). When setting priors for beta, remember to include priors for any SLX terms.

re

To include a varying intercept (or "random effects") term, alpha_re, specify the grouping variable here using formula syntax, as in ~ ID. Then, alpha_re is a vector of parameters added to the linear predictor of the model, and:

alpha_re ~ N(0, alpha_tau)
alpha_tau ~ Student_t(d.f., location, scale).

With the SAR model, any alpha_re term should be at a different level or scale than the observations; that is, at a different scale than the autocorrelation structure of the SAR model itself.

data

A data.frame or an object coercible to a data frame by as.data.frame containing the model data.

C

Spatial weights matrix (conventionally referred to as \(W\) in the SAR model). Typically, this will be created using geostan::shape2mat(shape, style = "W"). This will be passed internally to prep_sar_data, and will also be used to calculate residual spatial autocorrelation as well as any user specified slx terms; it will automatically be row-standardized before calculating slx terms. See shape2mat.

sar_parts

Optional. If not provided, then prep_sar_data will be used automatically to create sar_parts using the user-provided spatial weights matrix.

family

The likelihood function for the outcome variable. Current options are auto_gaussian(), binomial(link = "logit"), and poisson(link = "log"); if family = gaussian() is provided, it will automatically be converted to auto_gaussian().

prior

A named list of parameters for prior distributions (see priors):

intercept

The intercept is assigned a Gaussian prior distribution (see normal

.
beta

Regression coefficients are assigned Gaussian prior distributions. Variables must follow their order of appearance in the model formula. Note that if you also use slx terms (spatially lagged covariates), and you use custom priors for beta, then you have to provide priors for the slx terms. Since slx terms are prepended to the design matrix, the prior for the slx term will be listed first.

sar_scale

Scale parameter for the SAR model, sar_scale. The scale is assigned a Student's t prior model (constrained to be positive).

sar_rho

The spatial autocorrelation parameter in the SAR model, rho, is assigned a uniform prior distribution. By default, the prior will be uniform over all permissible values as determined by the eigenvalues of the spatial weights matrix. The range of permissible values for rho is printed to the console by prep_sar_data.

tau

The scale parameter for any varying intercepts (a.k.a exchangeable random effects, or partial pooling) terms. This scale parameter, tau, is assigned a Student's t prior (constrained to be positive).

ME

To model observational uncertainty (i.e. measurement or sampling error) in any or all of the covariates, provide a list of data as constructed by the prep_me_data function.

centerx

To center predictors on their mean values, use centerx = TRUE. If the ME argument is used, the modeled covariate (i.e., latent variable), rather than the raw observations, will be centered. When using the ME argument, this is the recommended method for centering the covariates.

prior_only

Logical value; if TRUE, draw samples only from the prior distributions of parameters.

censor_point

Integer value indicating the maximum censored value; this argument is for modeling censored (suppressed) outcome data, typically disease case counts or deaths.

chains

Number of MCMC chains to use.

iter

Number of samples per chain.

refresh

Stan will print the progress of the sampler every refresh number of samples. Set refresh=0 to silence this.

keep_all

If keep_all = TRUE then samples for all parameters in the Stan model will be kept; this is necessary if you want to do model comparison with Bayes factors and the bridgesampling package.

pars

Optional; specify any additional parameters you'd like stored from the Stan model.

control

A named list of parameters to control the sampler's behavior. See stan for details.

...

Other arguments passed to sampling. For multi-core processing, you can use cores = parallel::detectCores(), or run options(mc.cores = parallel::detectCores()) first.

Value

An object of class class geostan_fit (a list) containing:

summary

Summaries of the main parameters of interest; a data frame.

diagnostic

Widely Applicable Information Criteria (WAIC) with a measure of effective number of parameters (eff_pars) and mean log pointwise predictive density (lpd), and mean residual spatial autocorrelation as measured by the Moran coefficient.

stanfit

an object of class stanfit returned by rstan::stan

data

a data frame containing the model data

family

the user-provided or default family argument used to fit the model

formula

The model formula provided by the user (not including CAR component)

slx

The slx formula

re

A list containing re, the varying intercepts (re) formula if provided, and Data a data frame with columns id, the grouping variable, and idx, the index values assigned to each group.

priors

Prior specifications.

x_center

If covariates are centered internally (centerx = TRUE), then x_center is a numeric vector of the values on which covariates were centered.

spatial

A data frame with the name of the spatial component parameter (either "phi" or, for auto Gaussian models, "trend") and method ("SAR")

ME

A list indicating if the object contains an ME model; if so, the user-provided ME list is also stored here.

C

Spatial weights matrix (in sparse matrix format).

Details

Discussions of SAR models may be found in Cliff and Ord (1981), Cressie (2015, Ch. 6), LeSage and Pace (2009), and LeSage (2014).

The general scheme of the SAR model for numeric vector \(y\) is $$ y = \mu + ( I - \rho W)^{-1} \epsilon \\ \epsilon \sim Gauss(0, \sigma^2 I) $$ where \(W\) is the spatial weights matrix, \(I\) is the n-by-n identity matrix, and \(\rho\) is a spatial autocorrelation parameter. In words, the errors of the regression equation are spatially autocorrelated.

Re-arranging terms, the model can also be written as follows: $$ y = \mu + \rho W (y - \mu) + \epsilon $$ which perhaps shows more intuitively the implicit spatial trend component, \(\rho W (y - \mu)\).

Most often, this model is applied directly to observations (referred to below as the auto-Gaussian model). The SAR model can also be applied to a vector of parameters inside a hierarchical model. The latter enables spatial autocorrelation to be modeled when the observations are discrete counts (e.g., disease incidence data).

A note on terminology: the spatial statistics literature conceptualizes the simultaneously-specified spatial autoregressive model (SAR) in relation to the conditionally-specified spatial autoregressive model (CAR) (see stan_car) (see Cliff and Ord 1981). The spatial econometrics literature, by contrast, refers to the simultaneously-specified spatial autoregressive (SAR) model as the spatial error model (SEM), and they contrast the SEM with the spatial lag model (which contains a spatially-lagged dependent variable on the right-hand-side of the regression equation) (see LeSage 2014).

Auto-Gaussian

When family = auto_gaussian(), the SAR model is specified as follows: $$ y \sim Gauss(\mu, \Sigma) \\ \Sigma = \sigma^2 (I - \rho W)^{-1}(I - \rho W')^{-1} $$ where \(\mu\) is the mean vector (with intercept, covariates, etc.), \(W\) is a spatial weights matrix (usually row-standardized), and \(\sigma\) is a scale parameter.

The SAR model contains an implicit spatial trend (i.e., spatial autocorrelation) component \(\phi\) which is calculated as follows: $$ \phi = \rho W (y - \mu) $$

This term can be extracted from a fitted auto-Gaussian model using the spatial method.

When applied to a fitted auto-Gaussian model, the residuals.geostan_fit method returns 'de-trended' residuals \(R\) by default. That is, $$ R = y - \mu - \rho W (y - \mu). $$ To obtain "raw" residuals (\(y - \mu\)), use residuals(fit, detrend = FALSE). Similarly, the fitted values obtained from the fitted.geostan_fit will include the spatial trend term by default.

Poisson

For family = poisson(), the model is specified as:

$$ y \sim Poisson(e^{O + \lambda}) \\ \lambda \sim Gauss(\mu, \Sigma) \\ \Sigma = \sigma^2 (I - \rho W)^{-1}(I - \rho W')^{-1}. $$ If the raw outcome consists of a rate \(\frac{y}{p}\) with observed counts \(y\) and denominator p (often this will be the size of the population at risk), then the offset term \(O=log(p)\) is the log of the denominator.

This is often written (equivalently) as: $$ y \sim Poisson(e^{O + \mu + \phi}) \\ \phi \sim Gauss(0, \Sigma) \\ \Sigma = \sigma^2 (I - \rho W)^{-1}(I - \rho W')^{-1} $$

For Poisson models, the spatial method returns the parameter vector \(\phi\).

In the Poisson SAR model, \(\phi\) contains a latent spatial trend as well as additional variation around it. If you would like to extract the latent/implicit spatial trend from \(\phi\), you can do so by calculating: $$ \rho W \phi. $$

Binomial

For family = binomial(), the model is specified as: $$ y \sim Binomial(N, \lambda) \\ logit(\lambda) \sim Gauss(\mu, \Sigma) \\ \Sigma = \sigma^2 (I - \rho W)^{-1}(I - \rho W')^{-1} $$ where outcome data \(y\) are counts, \(N\) is the number of trials, and \(\lambda\) is the rate of 'success'. Note that the model formula should be structured as: cbind(sucesses, failures) ~ 1 (for an intercept-only model), such that trials = successes + failures.

For fitted Binomial models, the spatial method will return the parameter vector phi, equivalent to: $$ \phi = logit(\lambda) - \mu. $$ As is also the case for the Poisson model, \(\phi\) contains a latent spatial trend as well as additional variation around it. If you would like to extract the latent/implicit spatial trend from \(\phi\), you can do so by calculating: $$ \rho W \phi. $$

Spatially lagged covariates (SLX)

The slx argument is a convenience function for including SLX terms. For example, $$ y = W X \gamma + X \beta + \epsilon $$ where \(W\) is a row-standardized spatial weights matrix (see shape2mat), \(WX\) is the mean neighboring value of \(X\), and \(\gamma\) is a coefficient vector. This specifies a regression with spatially lagged covariates. SLX terms can specified by providing a formula to the slx argument:

stan_glm(y ~ x1 + x2, slx = ~ x1 + x2, \...),

which is a shortcut for

stan_glm(y ~ I(W \%*\% x1) + I(W \%*\% x2) + x1 + x2, \...)

SLX terms will always be prepended to the design matrix, as above, which is important to know when setting prior distributions for regression coefficients.

For measurement error (ME) models, the SLX argument is the only way to include spatially lagged covariates since the SLX term needs to be re-calculated on each iteration of the MCMC algorithm.

Measurement error (ME) models

The ME models are designed for surveys with spatial sampling designs, such as the American Community Survey (ACS) estimates. Given estimates \(x\), their standard errors \(s\), and the target quantity of interest (i.e., the unknown true value) \(z\), the ME models have one of the the following two specifications, depending on the user input. If a spatial CAR model is specified, then: $$ x \sim Gauss(z, s^2) \\ z \sim Gauss(\mu_z, \Sigma_z) \\ \Sigma_z = (I - \rho C)^{-1} M \\ \mu_z \sim Gauss(0, 100) \\ \tau_z \sim Student-t(10, 0, 40), \tau > 0 \\ \rho_z \sim uniform(l, u) $$ where \(\Sigma\) specifies a spatial conditional autoregressive model with scale parameter \(\tau\) (on the diagonal of \(M\)), and \(l\), \(u\) are the lower and upper bounds that \(\rho\) is permitted to take (which is determined by the extreme eigenvalues of the spatial connectivity matrix \(C\)).

For non-spatial ME models, the following is used instead: $$ x \sim Gauss(z, s^2) \\ z \sim student_t(\nu_z, \mu_z, \sigma_z) \\ \nu_z \sim gamma(3, 0.2) \\ \mu_z \sim Gauss(0, 100) \\ \sigma_z \sim student-t(10, 0, 40). $$

For strongly skewed variables, such as census tract poverty rates, it can be advantageous to apply a logit transformation to \(z\) before applying the CAR or Student-t prior model. When the logit argument is used, the model becomes: $$ x \sim Gauss(z, s^2) \\ logit(z) \sim Gauss(\mu_z, \Sigma_z) ... $$ and similarly for the Student t model: $$ x \sim Gauss(z, s^2) \\ logit(z) \sim student-t(\nu_z, \mu_z, \sigma_z) \\ ... $$

Censored counts

Vital statistics systems and disease surveillance programs typically suppress case counts when they are smaller than a specific threshold value. In such cases, the observation of a censored count is not the same as a missing value; instead, you are informed that the value is an integer somewhere between zero and the threshold value. For Poisson models (family = poisson())), you can use the censor_point argument to encode this information into your model.

Internally, geostan will keep the index values of each censored observation, and the index value of each of the fully observed outcome values. For all observed counts, the likelihood statement will be: $$ p(y_i | data, model) = poisson(y_i | \mu_i), $$ as usual, where \(\mu_i\) may include whatever spatial terms are present in the model.

For each censored count, the likelihood statement will equal the cumulative Poisson distribution function for values zero through the censor point: $$ p(y_i | data, model) = \sum_{m=0}^{M} Poisson( m | \mu_i), $$ where \(M\) is the censor point and \(\mu_i\) again is the fitted value for the \(i^{th}\) observation.

For example, the US Centers for Disease Control and Prevention's CDC WONDER database censors all death counts between 0 and 9. To model CDC WONDER mortality data, you could provide censor_point = 9 and then the likelihood statement for censored counts would equal the summation of the Poisson probability mass function over each integer ranging from zero through 9 (inclusive), conditional on the fitted values (i.e., all model parameters). See Donegan (2021) for additional discussion, references, and Stan code.

Author

Connor Donegan, connor.donegan@gmail.com

Examples

# model mortality risk
data(georgia)
W <- shape2mat(georgia, style = "W")

fit <- stan_sar(log(rate.male) ~ 1,
                C = W,
                data = georgia,
                chains = 1, # for ex. speed only
                iter = 700 
                )

rstan::stan_rhat(fit$stanfit)
rstan::stan_mcse(fit$stanfit)
print(fit)
plot(fit)
sp_diag(fit, georgia)

# \donttest{
 # a more appropriate model for count data:
fit2 <- stan_sar(deaths.male ~ offset(log(pop.at.risk.male)),
                C = W,
                data = georgia,
                family = poisson(),
                chains = 1, # for ex. speed only
                iter = 700 
                 )
sp_diag(fit2, georgia)
# }