This notebook presents the simulation of a Gaussian vector under equality/inequality constraints using the Gibbs sampler. Gibbs sampling in gstlearn is illustrated with:
the conditional simulation of an excursion set (e.g. a possible model for binary variable valued in \(\{0,1\}\)),
the computation of the conditional expectation when the observations have been censored (e.g., to take into account the limit of quantification, or LOQ, on the sample analysis).
The description of the Gibbs sampler given in Wikipedia is:
" Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used
Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.
Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers), and is an alternative to deterministic algorithms for statistical inference such as the expectation-maximization algorithm (EM).
As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired. Generally, samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution and are usually discarded. "
We consider the Gaussian vector \(\mathbf{Y}\) defined by the observations on a finite set \(A = \{\alpha_1, \dots, \alpha_n \}\): \(\mathbf{Y} = [Y(\alpha_i)]_{i \in 1:n}\) of a Gaussian random function. This random vector defined by
follows the Gaussian probability density function: \[ g_{\mathbf{m, \Sigma}}(y) = (\pi)^{-n/2} \, |\mathbf{\Sigma}|^{-1/2} \exp{(-\frac{1}{2} y^{t} \mathbf{\Sigma}^{-1} y)}. \]
The Gibbs sampler defines a Markov chain of random vectors \(\mathbf{Y^{(0)}, Y^{(1)}, \dots, Y^{(t)}}\) whose limiting distribution is the Gaussian distribution \(\mathcal{N}(\mathbf{m, \Sigma})\). An iteration consists in a sequential update of all the components drawn from its univariate conditional distribution: \[ Y_i^{(t)} \leftarrow y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k }) + \sigma^{SK}_i\mathcal{N}(0,1) \] where \(y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k })\) is the simple kriging of the component \(Y_i\) by the other components, those already updated \(Y^{(t)}_{k < i }\) in the current iteration and the values of the previous iteration \(Y^{(t-1)}_{k > i }\) for the others.
If univariate inequality constraints are added on the vector: \(\bigcap_{i \in 1:n} \{ l_i \leq Y_i \leq u_i\} = \bigcap_{i \in 1:n} \{ Y_i \in B_i\}\), the conditional distribution is the restriction of the Gaussian distribution to the n-dimensional polyhedron \(B = B_1 \times \dots \times B_n\):
\[ Y | Y \in B \sim \frac{g(y) 1_{y \in B}}{\int_B g(u)du} \] The Gibbs sampler updates sequentially each component \(Y_i\) using: \[ Y_i^{(t)} \leftarrow y_i^{SK} + \mathcal{N}_{C_i^{(t)}}(0,1). \] The residual is drawn from the restriction of the univariate Gaussian distribution to the interval \[ C_i^{(t)} = [\frac{l_i-y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k })}{\sigma_i}, \frac{u_i-y_i^{SK}(Y^{(t)}_{k < i }, Y^{(t-1)}_{i < k })}{\sigma_i}]. \]
We can envisage more elaborate constraints using likelihood function. For example, if the observation model is defined by a likelihood function: \(f(z|y)\) and if the observations \((Z_i)_{i \in 1:n}\) given the value of the latent field \((Y(\alpha_i))_{i \in 1:n}\), the posterior distribution of the latent field \(y\) given the observed values \(z\) is: \[g(y|z) \propto \Pi_{\alpha \in 1:n} f(z_i|y_i) g(y_1, \dots, y_n).\]
This case corresponds to the conditional simulation of the intensity of a Cox point process given observed counting measures. This case is not currently available in gstlearn.
The function gibbs_sampler generates simulations of a Gaussian vector using the iterative Gibbs algorithm.
The random vector to be simulated is defined by:
We have, \((Y(\alpha))_{\alpha \in A}\) whose covariance matrix is \(\mathbf{C} = [\text{Cov}(Y(\alpha), Y(\beta)]\).
Notes:
If the model is not stationary, \(Y(s) = m(s) + X(s)\) with \(m(s) = \sum_{l \in 0:L} a_l f^{l}(s)\), the value of the coefficients \(a_l\) of the drift should be known and the drift functions \(f^{l}\) specified in the input database using the locators ELoc_F().
The model can be multivariate and the covariance matrix is a block matrix in order to handle the different variables. In this case, the coefficients of the mean \(a_l\) are vectors.
The argument percent and flag_norm can be used to alter the covariance matrix \(\mathbf{C}\):
if percent \(= \tau\), the covariance matrix is replaced by \(\mathbf{C \leftarrow \tau I + C}\).
if flag_norm is TRUE, \(\mathbf{C} \leftarrow \mathbf{diag}(\mathbf{diag}(\mathbf{C})^{-1/2}) \times \mathbf{C} \times \mathbf{diag}(\mathbf{diag}(\mathbf{C})^{-1/2})\).
The equality and inequality constraints are specified on the components of the vector to be simulated using the locators:
ELoc_Li for a lower bound on the i-th variables of the model,
ELoc_Ui for a upper bound on the i-th variables of the model.
Equality constraints are defined setting the lower bounds equal to upper bounds.
The Gibbs algorithm can be used to simulate a Gaussian vector without condition.
The following arguments control the outputs:
nbsimu is the number of independent simulations to be stored in the input Db using the NamingConvention.
flag_ce and flag_cstd allow to compute the conditional expectation and the conditional standard deviation using the Monte-Carlo method, \[Y(\alpha)^{CE}= \frac{1}{n} \sum_{i \in 1:n} Y_{(k)}(\alpha)\] and \[Y(\alpha)^{STD} = \sqrt{\frac{1}{n} \sum_{i \in 1:n} (Y_{(k)}(\alpha))^2 - (Y(\alpha)^{CE})^2}\]
where \(Y_{(k)}\) is the k-th simulations. If flag_ce OR flag_cstd is TRUE, the simulations are not stored.
seed is the integer value used to initialize the random number generator.
The argument verbose controls the messages reported during the computation (in order to monitor the convergence ?).
The argument gibbs_optstats provides a control on the output of computed statistics with 0: No stats - 1: Print - 2: Save in Neutral file.
The parameters to control the Gibbs sampler are:
In both cases, an iteration is an update of all the components of the working vector (using a sequential or a random path).
flag_moving (Boolean). The Gibbs sampler works when a unique neighbourhood is used. The update of the value of a component takes into account the values of the other components. In some cases (for large vectors or for a Markov Random Field), only the values of the neighbours are taken into account to compute the conditional marginal distribution from which the univariate is drawn. If the Boolean flag is equal to TRUE, the covariance/precision is simplified to define the neighborhood of each component.
flag_multi_mono (Boolean). If this Boolean flag is equal to TRUE, the components are grouped into conditionally independent blocks and a block update is performed.
flag_propagation (Boolean). If this Boolean flag is equal to TRUE, the propagation algorithm is used. However, it can be used only if no constraints is defined.
Notes:
The function returns an error code.
# ------------------------------------------------------------
# QC of simulations
# ------------------------------------------------------------
#' @param db a Db data base with the simulations to evaluate
#' @param names a list of strings with the names of the simulation in db
#' @param vario_param defines the variogram parameters
#' @param vario_model defines the simulated model
#' @param title a string to be used as the title of the figures
#' @param stat_keys a list of strings for statistics to be computed on the simulation
#' @values no returned value (invisible)
qc_simulations <- function(db, names, vario_param, vario_model, title, stat_keys) {
nsim = length(names)
# statistics
# tab = as.matrix(dbStatisticsMono(db, names = names,
# opers = EStatOption_fromKeys(stat_keys))$toTL())
#
# knitr::kable(tab, caption = title, digits = 4)
# histogram of simulation #1
i = 1
p = ggDefault() +
plot.hist(db = db, name = names[i], useSel = TRUE, bins = 25,
col = "orange", bg = "gray") +
plot.decoration(xlab = names[i], title = title)
ggPrint(p)
# base map
p = ggDefaultGeographic()
if (db$isGrid()) {
p = p + plot.grid(db, name_raster = names[i], show.legend.raster = TRUE,
legend.name.raster = "Y", palette = "Spectral")
} else {
p = p + plot.point(db, nameColor = names[i], flagLegendColor = TRUE,
palette = "Spectral")
}
p = p + plot.decoration(xlab = "Easting", ylab = "Northing", title = title)
ggPrint(p)
# variogram computations
vario_list = list()
for (i in 1:nsim) {
vario = Vario(vario_param)
err = db$setLocator(name = names[i], locatorType = ELoc_Z(),
cleanSameLocator = TRUE)
err = vario$compute(db)
vario_list[[1 + length(vario_list)]] <- vario
}
# plot of mean variogram per direction
err = qc_variogram(vario_list = vario_list, vario_param = vario_param,
vario_model = vario_model, title = title)
invisible()
}
# ------------------------------------------------------------
# plot of mean variogram per direction
# ------------------------------------------------------------
#' @param vario_list a list of variogram structure
#' @param vario_param defines the parameters used to compute the variograms
#' @param vario_model defines the variogram model used for the simulations
#' @param title a string to be used as the title of the figures
#' @values no returned value (invisible)
qc_variogram <- function(vario_list, vario_param, vario_model, title) {
nsim = length(vario_list)
ndir = vario_param$getDirectionNumber()
# variogram matrices
res_gg = list()
for (idir in 1:ndir) {
nlags = vario_param$getDirParam(idir-1)$getLagNumber()
gg_sim = matrix(NaN, nrow = nlags, ncol = nsim)
for (i in 1:nsim) {
gg_sim[,i] = vario_list[[i]]$getGgVec(idir = idir-1, ivar = 0, jvar = 0)
}
res_gg[[1 + length(res_gg)]] <- gg_sim
}
# plot for each direction
for (idir in 1:ndir) {
nlags = vario_param$getDirParam(idir-1)$getLagNumber()
# hh = dx[idir]*(1:(nlags-1))
hh = vario_list[[i]]$getHhVec(idir = idir-1, ivar = 0, jvar = 0)
v_mean = apply(X = res_gg[[idir]], MARGIN = 1, FUN = mean)
v_sd = apply(X = res_gg[[idir]], MARGIN = 1, FUN = sd)
# initial plot
dir = vario_param$getDirParam(idir - 1)$getCodirs()
sdir = sprintf("Direction = [%1d, %1d]", dir[1], dir[2])
plot(NULL, NULL, xaxs="i", yaxs="i",
xlim = c(0, max(hh)),
ylim = 1.2*c(0, vario_model$getTotalSill(ivar = 0, jvar = 0)),
xlab = "Distance", ylab = "Variogram", main = paste0(title, "\n", sdir))
abline(h = 0, lty = 1, col = "black")
abline(v = 0, lty = 1, col = "black")
# experimental variogram
lines(hh, v_mean, col = "orange", lw = 2)
lines(hh, v_mean + 2 * v_sd, col = "orange", lty = 2)
lines(hh, v_mean - 2 * v_sd, col = "orange", lty = 2)
for (s in 1:nsim) {
lines(hh, res_gg[[idir]][,s] , col = "gray", lty = 3)
}
# variogram model
c00 = vario_model$getTotalSill()
abline(h = c00, lty = 2, col = "skyblue")
lines(hh, c00 -
vario_model$sample(h = hh, codir = vario_param$getDirParam(idir - 1)$getCodirs()),
lty = 1, lw = 2, col = "skyblue")
# legend
legend("bottomright",
legend = c("model", "mean variogram", "+/- 2 x Std.",
"empirical variograms"),
lty = c(1, 1, 2, 3), lw = c(2, 2, 1, 1),
col = c("skyblue", "orange", "orange", "gray")
)
} # loop over the directions
invisible()
}
# ---------------------------------------------
# parameters
# ---------------------------------------------
seed = 1235
set.seed(seed)
OptDbg_setReference(0)
## NULL
# Defining a set of statistics
stat_keys = c("NUM","MINI","MAXI","MEAN","STDV")
opers = EStatOption_fromKeys(stat_keys)
# defining the variogram parameters
varioParam = VarioParam_createOmniDirection(npas=10, dpas=0.05)
np = 1000 # total number of points
n = 50 # number of equality constraints
p = np - n # number of inequality constraints
# ---------------------------------------------
# Simulation of the data set
# ---------------------------------------------
data = Db_createFillRandom(ndat = np, ndim = 2, nvar = 0)
model = Model_createFromParam(type = ECov_EXPONENTIAL(), range = 0.5, sill = 1)
err = model$setMeans(0)
err = model$setDriftIRF(order = -1)
err = simtub(dbin = NULL, dbout = data, model = model, nbsimu = 1, seed = seed)
# plot
p = ggDefaultGeographic() +
plot.point(data, color = "red") +
plot.decoration(title = "Defined point constraints")
ggPrint(p)
nburn = 10
niter = 300
nbsimu = 10
flag.moving = TRUE
flag.mono = FALSE
flag.prop = FALSE
Test the standard vs. propagative algorithm.
prefix = "g1"
err = data$deleteColumns(paste(prefix, "*", sep = "."))
# simulation
err = data$clearLocators(ELoc_Z())
err = data$clearLocators(ELoc_U())
err = data$clearLocators(ELoc_L())
err = gibbs_sampler(dbin = data, model = model,
nbsimu = nbsimu, flag_ce = FALSE, flag_cstd = FALSE,
namconv = NamingConvention(prefix),
gibbs_nburn = nburn, gibbs_niter = niter,
flag_moving = flag.moving, flag_multi_mono = flag.mono, flag_propagation = flag.prop,
flag_norm = FALSE, percent = 0,
seed = seed, verbose = TRUE, gibbs_optstats = 0,
flag_sym_neigh = FALSE
)
##
## Gibbs using Moving Neighborhood
## -------------------------------
## Building Covariance Sparse Matrix (Dimension = 1000)
## Cholesky Decomposition of Covariance Matrix
## Calculating and storing the weights
## Weights are stored internally
## - Total core needs = 1000000
## - Number of zero weights = 936730
## - Percentage = 6.33
# variograms
qc_simulations(db = data, names = paste(prefix, 1:nbsimu, sep = "."),
vario_param = varioParam, vario_model = model,
title = paste0("Gibbs sampler: np = ", np), stat_keys = stat_keys)