This notebook presents the simulation of a Gaussian vector under equality/inequality constraints using the Gibbs sampler. Gibbs sampling in **gstlearn** is illustrated with:

the conditional simulation of an excursion set (e.g.Â a possible model for binary variable valued in \(\{0,1\}\)),

the computation of the conditional expectation when the observations have been censored (e.g., to take into account the limit of quantification, or LOQ, on the sample analysis).

The description of the Gibbs sampler given in Wikipedia is:

" Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used

- to approximate the joint distribution (e.g., to generate a histogram of the distribution);
- to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or
- to compute an integral (such as the expected value of one of the variables).

Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e.Â an algorithm that makes use of random numbers), and is an alternative to deterministic algorithms for statistical inference such as the expectation-maximization algorithm (EM).

As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired. Generally, samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution and are usually discarded. "

We consider the Gaussian vector \(\mathbf{Y}\) defined by the observations on a finite set \(A = \{\alpha_1, \dots, \alpha_n \}\): \(\mathbf{Y} = [Y(\alpha_i)]_{i \in 1:n}\) of a Gaussian random function. This random vector defined by

- its mean vector \(\mathbf{m} = [\mathbb{E}\{Y(\alpha_i)\}]_{i \in 1:n}\) and
- its covariance matrix \(\mathbf{\Sigma} = [\text{Cov}\{Y(\alpha_i), Y(\alpha_j)\}]_{i,j \in 1:n}\),

follows the Gaussian probability density function: \[ g_{\mathbf{m, \Sigma}}(y) = (\pi)^{-n/2} \, |\mathbf{\Sigma}|^{-1/2} \exp{(-\frac{1}{2} y^{t} \mathbf{\Sigma}^{-1} y)}. \]

The Gibbs sampler defines a Markov chain of random vectors \(\mathbf{Y^{(0)}, Y^{(1)}, \dots, Y^{(t)}}\) whose limiting distribution is the Gaussian distribution \(\mathcal{N}(\mathbf{m, \Sigma})\). An iteration consists in a sequential update of all the components drawn from its univariate conditional distribution: \[ Y_i^{(t)} \leftarrow y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k }) + \sigma^{SK}_i\mathcal{N}(0,1) \] where \(y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k })\) is the simple kriging of the component \(Y_i\) by the other components, those already updated \(Y^{(t)}_{k < i }\) in the current iteration and the values of the previous iteration \(Y^{(t-1)}_{k > i }\) for the others.

If univariate inequality constraints are added on the vector: \(\bigcap_{i \in 1:n} \{ l_i \leq Y_i \leq u_i\} = \bigcap_{i \in 1:n} \{ Y_i \in B_i\}\), the conditional distribution is the restriction of the Gaussian distribution to the n-dimensional polyhedron \(B = B_1 \times \dots \times B_n\):

\[ Y | Y \in B \sim \frac{g(y) 1_{y \in B}}{\int_B g(u)du} \] The Gibbs sampler updates sequentially each component \(Y_i\) using: \[ Y_i^{(t)} \leftarrow y_i^{SK} + \mathcal{N}_{C_i^{(t)}}(0,1). \] The residual is drawn from the restriction of the univariate Gaussian distribution to the interval \[ C_i^{(t)} = [\frac{l_i-y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k })}{\sigma_i}, \frac{u_i-y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k })}{\sigma_i}]. \]

- A conditional simulation of the excursion set \(I_{\tau} = (1_{Y(s) \geq \tau })_{s \in \mathcal{G}}\) given the observed values \((1_{Y(\alpha_i) \geq \tau })_{i \in 1:n}\) is achieved by:

- a conditional simulation of the Gaussian vector \(\mathbf{Y}|\mathbf{Y} \in B\) using the Gibbs sampler,
- a conditional simulation of the Gaussian vector \([Y(s)]_{s \in \mathcal{G}}\) given the constrained values, simulated using the turning band method and the conditioning simple kriging, and
- a coding of the vector by the indicator \([1_{Y(s) \geq \tau}]_{s \in \mathcal{G}}\).

- The conditional expectation of a Gaussian transform \(\phi(Y(s))\) with partially censored data \((\text{Max}(\text{LOQ}, \phi (Y(\alpha_i)))_{i \in 1:n}\), where LOQ is the limit of quantification of the measurement of \(\phi(Y)\), is achieved by:

- a coding of the observations into equality constraints, \(Y(\alpha_i) = \phi^{-1}(z_i)\) when \(z_i \geq \text{LOQ}\), and inequality constraints, \(Y(\alpha_i) \leq \phi^{-1}(\text{LOQ})\) when \(z_i = \text{LOQ}\),
- \(N\) conditional simulations of the constrained Gaussian vector, \((Y(\alpha_i)^{(k)})_{k \in 1:N, i \in 1:n}\), using the Gibbs sampler,
- the averaging of the conditional expectations \[ E\{\phi(Y(s))|data\} \approx \frac{1}{N} \sum_{k \in 1:N} \int \phi(y^{SK}_{(k)}(s)+ \sigma_{SK} \, u) g(u) du \] where \(y^{SK}_{(k)}(s)\) is the simple kriging computed using the values of the k-th simulations produced by the Gibbs sampler.

We can envisage more elaborate constraints using likelihood function. For example, if the observation model is defined by a likelihood function: \(f(z|y)\) and if the observations \((Z_i)_{i \in 1:n}\) given the value of the latent field \((Y(\alpha_i))_{i \in 1:n}\), the posterior distribution of the latent field \(y\) given the observed values \(z\) is: \[g(y|z) \propto \Pi_{\alpha \in 1:n} f(z_i|y_i) g(y_1, \dots, y_n).\]

This case corresponds to the conditional simulation of the intensity of a Cox point process given observed counting measures. This case is not currently available in **gstlearn**.

The function **gibbs_sampler** generates simulations of a Gaussian vector using the iterative Gibbs algorithm.

The random vector to be simulated is defined by:

- the point locations provided by the input
*Db*and - the
*model*which is a Second Order model (i.e., its mean and its covariance function are known).

We have, \((Y(\alpha))_{\alpha \in A}\) whose covariance matrix is \(\mathbf{C} = [\text{Cov}(Y(\alpha), Y(\beta)]\).

**Notes:**

If the model is not stationary, \(Y(s) = m(s) + X(s)\) with \(m(s) = \sum_{l \in 0:L} a_l f^{l}(s)\), the value of the coefficients \(a_l\) of the drift should be known and the drift functions \(f^{l}\) specified in the input database using the locators

**ELoc_F()**.The model can be multivariate and the covariance matrix is a block matrix in order to handle the different variables. In this case, the coefficients of the mean \(a_l\) are vectors.

The argument *percent* and *flag_norm* can be used to alter the covariance matrix \(\mathbf{C}\):

if

*percent*\(= \tau\), the covariance matrix is replaced by \(\mathbf{C \leftarrow \tau I + C}\).if

*flag_norm*is TRUE, \(\mathbf{C} \leftarrow \mathbf{diag}(\mathbf{diag}(\mathbf{C})^{-1/2}) \times \mathbf{C} \times \mathbf{diag}(\mathbf{diag}(\mathbf{C})^{-1/2})\).

The equality and inequality constraints are specified on the components of the vector to be simulated using the locators:

**ELoc_L**i for a lower bound on the i-th variables of the model,**ELoc_U**i for a upper bound on the i-th variables of the model.

Equality constraints are defined setting the lower bounds equal to upper bounds.

The Gibbs algorithm can be used to simulate a Gaussian vector without condition.

The following arguments control the outputs:

*nbsimu*is the number of independent simulations to be stored in the input*Db*using the*NamingConvention*.*flag_ce*and*flag_cstd*allow to compute the*conditional*expectation and the*conditional*standard deviation using the Monte-Carlo method, \[Y(\alpha)^{CE}= \frac{1}{n} \sum_{i \in 1:n} Y_{(k)}(\alpha)\] and \[Y(\alpha)^{STD} = \sqrt{\frac{1}{n} \sum_{i \in 1:n} (Y_{(k)}(\alpha))^2 - (Y(\alpha)^{CE})^2}\]

where \(Y_{(k)}\) is the k-th simulations. If *flag_ce OR flag_cstd* is TRUE, the simulations are not stored.

*seed*is the integer value used to initialize the random number generator.The argument

*verbose*controls the messages reported during the computation (in order to monitor the convergence ?).The argument

*gibbs_optstats*provides a control on the output of computed statistics with 0: No stats - 1: Print - 2: Save in Neutral file.

The parameters to control the Gibbs sampler are:

- The Initial number of iterations for bootstrapping:
*gibbs_nburn* - The Maximum number of iterations:
*gibbs_niter*

In both cases, an iteration is an update of all the components of the working vector (using a sequential or a random path).

flag_moving (Boolean). The Gibbs sampler works when a unique neighbourhood is used. The update of the value of a component takes into account the values of the other components. In some cases (for large vectors or for a Markov Random Field), only the values of the neighbours are taken into account to compute the conditional marginal distribution from which the univariate is drawn. If the Boolean flag is equal to

*TRUE*, the covariance/precision is simplified to define the*neighborhood*of each component.flag_multi_mono (Boolean). If this Boolean flag is equal to

*TRUE*, the components are grouped into conditionally independent blocks and a block update is performed.flag_propagation (Boolean). If this Boolean flag is equal to

*TRUE*, the propagation algorithm is used. However, it can be used only if no constraints is defined.

**Notes**:

*flag_sym_neigh*a Boolean is a deprecated argument.

The function returns an error code.

```
# ------------------------------------------------------------
# QC of simulations
# ------------------------------------------------------------
#' @param db a Db data base with the simulations to evaluate
#' @param names a list of strings with the names of the simulation in db
#' @param vario_param defines the variogram parameters
#' @param vario_model defines the simulated model
#' @param title a string to be used as the title of the figures
#' @param stat_keys a list of strings for statistics to be computed on the simulation
#' @values no returned value (invisible)
qc_simulations <- function(db, names, vario_param, vario_model, title, stat_keys) {
nsim = length(names)
# statistics
# tab = as.matrix(dbStatisticsMono(db, names = names,
# opers = EStatOption_fromKeys(stat_keys))$toTL())
#
# knitr::kable(tab, caption = title, digits = 4)
# histogram of simulation #1
i = 1
p = ggDefault() +
plot.hist(db = db, name = names[i], useSel = TRUE, bins = 25,
col = "orange", bg = "gray") +
plot.decoration(xlab = names[i], title = title)
ggPrint(p)
# base map
p = ggDefaultGeographic()
if (db$isGrid()) {
p = p + plot.grid(db, name_raster = names[i], show.legend.raster = TRUE,
legend.name.raster = "Y", palette = "Spectral")
} else {
p = p + plot.point(db, nameColor = names[i], flagLegendColor = TRUE,
palette = "Spectral")
}
p = p + plot.decoration(xlab = "Easting", ylab = "Northing", title = title)
ggPrint(p)
# variogram computations
vario_list = list()
for (i in 1:nsim) {
vario = Vario(vario_param)
err = db$setLocator(name = names[i], locatorType = ELoc_Z(),
cleanSameLocator = TRUE)
err = vario$compute(db)
vario_list[[1 + length(vario_list)]] <- vario
}
# plot of mean variogram per direction
err = qc_variogram(vario_list = vario_list, vario_param = vario_param,
vario_model = vario_model, title = title)
invisible()
}
# ------------------------------------------------------------
# plot of mean variogram per direction
# ------------------------------------------------------------
#' @param vario_list a list of variogram structure
#' @param vario_param defines the parameters used to compute the variograms
#' @param vario_model defines the variogram model used for the simulations
#' @param title a string to be used as the title of the figures
#' @values no returned value (invisible)
qc_variogram <- function(vario_list, vario_param, vario_model, title) {
nsim = length(vario_list)
ndir = vario_param$getDirectionNumber()
# variogram matrices
res_gg = list()
for (idir in 1:ndir) {
nlags = vario_param$getDirParam(idir-1)$getLagNumber()
gg_sim = matrix(NaN, nrow = nlags, ncol = nsim)
for (i in 1:nsim) {
gg_sim[,i] = vario_list[[i]]$getGgVec(idir = idir-1, ivar = 0, jvar = 0)
}
res_gg[[1 + length(res_gg)]] <- gg_sim
}
# plot for each direction
for (idir in 1:ndir) {
nlags = vario_param$getDirParam(idir-1)$getLagNumber()
# hh = dx[idir]*(1:(nlags-1))
hh = vario_list[[i]]$getHhVec(idir = idir-1, ivar = 0, jvar = 0)
v_mean = apply(X = res_gg[[idir]], MARGIN = 1, FUN = mean)
v_sd = apply(X = res_gg[[idir]], MARGIN = 1, FUN = sd)
# initial plot
dir = vario_param$getDirParam(idir - 1)$getCodirs()
sdir = sprintf("Direction = [%1d, %1d]", dir[1], dir[2])
plot(NULL, NULL, xaxs="i", yaxs="i",
xlim = c(0, max(hh)),
ylim = 1.2*c(0, vario_model$getTotalSill(ivar = 0, jvar = 0)),
xlab = "Distance", ylab = "Variogram", main = paste0(title, "\n", sdir))
abline(h = 0, lty = 1, col = "black")
abline(v = 0, lty = 1, col = "black")
# experimental variogram
lines(hh, v_mean, col = "orange", lw = 2)
lines(hh, v_mean + 2 * v_sd, col = "orange", lty = 2)
lines(hh, v_mean - 2 * v_sd, col = "orange", lty = 2)
for (s in 1:nsim) {
lines(hh, res_gg[[idir]][,s] , col = "gray", lty = 3)
}
# variogram model
c00 = vario_model$getTotalSill()
abline(h = c00, lty = 2, col = "skyblue")
lines(hh, c00 -
vario_model$sample(h = hh, codir = vario_param$getDirParam(idir - 1)$getCodirs()),
lty = 1, lw = 2, col = "skyblue")
# legend
legend("bottomright",
legend = c("model", "mean variogram", "+/- 2 x Std.",
"empirical variograms"),
lty = c(1, 1, 2, 3), lw = c(2, 2, 1, 1),
col = c("skyblue", "orange", "orange", "gray")
)
} # loop over the directions
invisible()
}
```

```
# ---------------------------------------------
# parameters
# ---------------------------------------------
seed = 1235
set.seed(seed)
OptDbg_setReference(0)
```

`## NULL`

```
# Defining a set of statistics
stat_keys = c("NUM","MINI","MAXI","MEAN","STDV")
opers = EStatOption_fromKeys(stat_keys)
# defining the variogram parameters
varioParam = VarioParam_createOmniDirection(npas=10, dpas=0.05)
np = 1000 # total number of points
n = 50 # number of equality constraints
p = np - n # number of inequality constraints
# ---------------------------------------------
# Simulation of the data set
# ---------------------------------------------
data = Db_createFillRandom(ndat = np, ndim = 2, nvar = 0)
model = Model_createFromParam(type = ECov_EXPONENTIAL(), range = 0.5, sill = 1)
err = model$setMeans(0)
err = model$setDriftIRF(order = -1)
err = simtub(dbin = NULL, dbout = data, model = model, nbsimu = 1, seed = seed)
# plot
p = ggDefaultGeographic() +
plot.point(data, color = "red") +
plot.decoration(title = "Defined point constraints")
ggPrint(p)
```

```
nburn = 10
niter = 300
nbsimu = 10
flag.moving = TRUE
flag.mono = FALSE
flag.prop = FALSE
```

Test the standard vs.Â propagative algorithm.

```
prefix = "g1"
err = data$deleteColumns(paste(prefix, "*", sep = "."))
# simulation
err = data$clearLocators(ELoc_Z())
err = data$clearLocators(ELoc_U())
err = data$clearLocators(ELoc_L())
err = gibbs_sampler(dbin = data, model = model,
nbsimu = nbsimu, flag_ce = FALSE, flag_cstd = FALSE,
namconv = NamingConvention(prefix),
gibbs_nburn = nburn, gibbs_niter = niter,
flag_moving = flag.moving, flag_multi_mono = flag.mono, flag_propagation = flag.prop,
flag_norm = FALSE, percent = 0,
seed = seed, verbose = TRUE, gibbs_optstats = 0,
flag_sym_neigh = FALSE
)
```

```
##
## Gibbs using Moving Neighborhood
## -------------------------------
## Building Covariance Sparse Matrix (Dimension = 1000)
## Cholesky Decomposition of Covariance Matrix
## Calculating and storing the weights
## Weights are stored internally
## - Total core needs = 1000000
## - Number of zero weights = 936730
## - Percentage = 6.33
```

```
# variograms
qc_simulations(db = data, names = paste(prefix, 1:nbsimu, sep = "."),
vario_param = varioParam, vario_model = model,
title = paste0("Gibbs sampler: np = ", np), stat_keys = stat_keys)
```