# Multivariate¶

## Preamble¶

In this preamble, we load the **gstlearn** library.

```
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc
import matplotlib.pyplot as plt
import numpy as np
import os
gdoc.setNoScroll()
ndim = 2
gl.defineDefaultSpace(gl.ESpaceType.RN, ndim)
gl.OptCst.define(gl.ECst.NTCAR, 15)
gp.setDefaultGeographic(dims=[8,8])
```

We load two data bases:

- a data base
`dat`

containing point observations of two variables across Scotland: the elevation (`Elevation`

) and the temperature (`January_temp`

, then renamed as`Temperature`

) - a data base
`target`

containing a grid of points covering Scotland with a selection variable (`inshore`

) selecting the points that are on land, and a variable (`Elevation`

) giving the elevation at every point on land

```
## Load observations
temp_nf = gdoc.loadData("Scotland", "Scotland_Temperatures.NF")
dat = gl.Db.createFromNF(temp_nf)
### Change variable name
dat.setName("*temp", "Temperature")
## Load grid
elev_nf = gdoc.loadData("Scotland", "Scotland_Elevations.NF")
target = gl.DbGrid.createFromNF(elev_nf)
```

### Exploratory data analysis¶

The `target`

data base a (grid) map of the elevation across Scotland.b

```
target
```

Data Base Grid Characteristics ============================== Data Base Summary ----------------- File is organized as a regular grid Space dimension = 2 Number of Columns = 4 Total number of samples = 11097 Number of active samples = 3092 Grid characteristics: --------------------- Origin : 65.000 535.000 Mesh : 4.938 4.963 Number : 81 137 Variables --------- Column = 0 - Name = Longitude - Locator = x1 Column = 1 - Name = Latitude - Locator = x2 Column = 2 - Name = Elevation - Locator = f1 Column = 3 - Name = inshore - Locator = sel

Let us plot its content.

```
ax = target.plot("Elevation", flagLegendRaster=True)
```

The `dat`

data base contains 236 (point) samples of two variables across Scotland: elevation and temperature.

```
dat
```

Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 5 Total number of samples = 236 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = Longitude - Locator = x1 Column = 2 - Name = Latitude - Locator = x2 Column = 3 - Name = Elevation - Locator = NA Column = 4 - Name = Temperature - Locator = z1

We can use the `dbStatisticsPrint`

function to compute statistics on variables of a `Db`

. We specify

- the
`Db`

of interest (argument`db`

), - a vector containing the names of the variables of interest (argument
`names`

), - the statistics we want to compute (argument
`opers`

). This last argument is set through a`EStatOption.fromKeys`

function, which we call with a vector containing the (abbreviated) names of the statistics (run`EStatOption.printAll()`

for the full list) - a flag
`flagIso`

allowing to specify whether we want to compute the statistics for each variable separately (`flagIso=TRUE`

), or whether we want to compute "filtered" statistics (`flagIso=FALSE`

). In the latter case, we compute the statistics while only using the points where all the variables are defined.

For instance, let us count the number of observations of each variable using the `dbStatisticsPrint`

.

```
gl.dbStatisticsPrint(db=dat, names=["Elevation", "Temperature"],
opers=gl.EStatOption.fromKeys(["NUM"]),flagIso = False,
title="Number of observations")
```

Number of observations ---------------------- Number Elevation 236 Temperature 151

Since the data base `dat`

contains 236 samples, we can conclude that the `Elevation`

variable is defined at every point, but not the `Temperature`

variable. Similarly, we can compute the mean and the extreme values of each variable in the observation data base.

```
gl.dbStatisticsPrint(db=dat, names=["Elevation", "Temperature"],
opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI"]),flagIso = False,
title="Statistics of observations")
```

Statistics of observations -------------------------- Minimum Maximum Mean Elevation 2.000 800.000 146.441 Temperature 0.600 5.200 2.815

Finally, we compute the mean and the extreme values of the elevation in the target data base.

```
gl.dbStatisticsPrint(db=target, names=["Elevation"],
opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI"]),flagIso = False,
title="Statistics of target")
```

Statistics of target -------------------- Minimum Maximum Mean Elevation 0.000 1270.000 241.152

We can then compute the filtered means for the `Elevation`

and `Temperature`

variables as follows:

```
gl.dbStatisticsPrint(db=dat, names=["Elevation", "Temperature"],
opers=gl.EStatOption.fromKeys(["MEAN","MINI","MAXI"]),flagIso = True,
title="Filtered statistics of observations")
```

Filtered statistics of observations ----------------------------------- Minimum Maximum Mean Elevation 3.000 387.000 87.974 Temperature 0.600 5.200 2.815

As explained above, the first row of the table contains contains the mean, the minimum and the maximum of the observations of the `Elevation`

variable, over all the locations where both the `Elevation`

na `Temperature`

variables are defined (i.e. in our case, all the points where the temperature is defined). Hence, we see that the points where the temperature is observed are located at relatively low altitudes within the Scottish landscape.

To confirm that, we plot the points where the temperature is sampled on top of the elevation map of Scotland.

```
ax = target.plot(nameRaster="Elevation")
ax = dat.plot(nameSize="Temperature", color="yellow", sizmin=0.1, sizmax=20)
```

From observing this last plot, it seems like the bigger points (corresponding to high temperatures) are located where the elevation is smaller: this seems to hint (confirm?) that the temperature is negatively correlated with the elevation. To corroborate this, we plot a correlation plot between the two variables.

```
ax = gp.correlation(dat, namex="Elevation", namey="Temperature", asPoint=True, regrLine=True)
ax.decoration(title="Correlation between Temperature and Elevation")
```

### Baseline univariate model¶

In this course, we will introduce four methods allowing to deal with multivariate data: Cokriging, Residual modeling, Models with polynomial trends and finally Models with an external drift. As a baseline, we will compare the results obtained with each model with the results obtained with a univariate model.

#### Model fitting¶

We start by computing an experimental directional variogram `vario_raw2dir`

of the "raw" temperature observations, along two directions ($0^{\circ}\text{C}$ and $90^{\circ}\text{C}$).

```
varioparam = gl.VarioParam.createMultiple(ndir=2, npas=30, dpas=10)
vario_raw2dir = gl.Vario(varioparam)
err = vario_raw2dir.compute(dat)
```

We then fit a model on the experimental variogram of raw temperature observations. Since we want to use ordinary kriging, we add a constant drift to the model.

```
fitmod_raw = gl.Model()
err = fitmod_raw.fit(vario_raw2dir,
types=[gl.ECov.NUGGET, gl.ECov.EXPONENTIAL, gl.ECov.CUBIC, gl.ECov.LINEAR])
fitmod_raw.setDriftIRF(0,0)
```

```
ax = gp.varmod(vario_raw2dir, fitmod_raw)
ax.decoration(title="Experimental and fitted variogram models - Raw temperature observations",
xlabel = "Distance (km)", ylabel = "Variogram")
```

#### Ordinary kriging¶

We create a "neighborhood" object specifying a unique neighborhood, which we will use throughout the course.

```
uniqueNeigh = gl.NeighUnique.create()
```

We now perform an ordinary kriging prediction of the temperature on the `target`

grid using the model fitted on the raw observations, and compute statistics on the predicted values.

```
err = gl.kriging(dbin=dat, dbout=target, model=fitmod_raw,
neigh=uniqueNeigh,
namconv=gl.NamingConvention.create(prefix="OK"))
```

```
ax = target.plot(nameRaster="OK*estim")
ax = dat.plot(flagCst=True, color="black")
ax.decoration(title="Temperature - Ordinary Kriging")
```

```
opers = gl.EStatOption.fromKeys(["NUM","MINI","MAXI","MEAN","STDV"])
gl.dbStatisticsPrint(target, names = (["OK.T*"]), opers=opers,
title="Statistics on the Ordinary Kriging:")
```

Statistics on the Ordinary Kriging: ----------------------------------- Number Minimum Maximum Mean St. Dev. OK.Temperature.estim 3092 0.588 5.153 2.826 0.976 OK.Temperature.stdev 3092 0.036 0.990 0.407 0.187

#### Cross-validation¶

We perform a cross-validation of the fitted model using Ordinary Kriging, and calculate the Mean Squared cross-validation and standardized errors.

```
## Compute cross-validation
err = gl.xvalid(dat, model=fitmod_raw,
neigh=uniqueNeigh,
namconv=gl.NamingConvention.create(prefix="CV_OK",flag_locator=False))
```

```
mse=np.nanmean(np.square(dat.getColumn("CV_OK*esterr*")))
print("Mean squared cross-validation error:", round(mse,3))
mse=np.nanmean(np.square(dat.getColumn("CV_OK*stderr*")))
print("Mean squared standardized error:", round(mse,3))
```

Mean squared cross-validation error: 0.281 Mean squared standardized error: 2.635

## Multivariate Models and Cokriging¶

To create and work with multivariate models, we simply need to work with `Db`

objects containing more than one variable with a `z`

locator. All the variables with a `z`

locator will be considered as part of the multivariate model. Then, the same commands as in the univariate case can be used to create and fit experimental variograms, and to perform (co)kriging predictions.

Let us illustrate our point with our running example. We would like now to consider a bivariate model of the temperature and the elevation. To do so, we simply allocate, in the observation data base `dat`

, a `z`

locator to both variables using the `setLocators`

method.

```
dat.setLocators(names=["Temperature", "Elevation"], locatorType=gl.ELoc.Z)
dat
```

Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 7 Total number of samples = 236 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = Longitude - Locator = x1 Column = 2 - Name = Latitude - Locator = x2 Column = 3 - Name = Elevation - Locator = z2 Column = 4 - Name = Temperature - Locator = z1 Column = 5 - Name = CV_OK.Temperature.esterr - Locator = NA Column = 6 - Name = CV_OK.Temperature.stderr - Locator = NA

### Fitting a bivariate model¶

To create experimental (directional) variograms and cross-variograms, we use the same commands as in the univariate case: since the data base `dat`

now contains two variables with a `z`

locator, the `compute`

method automatically computes both variograms and cross-variograms for these variables.

```
varioexp2var = gl.Vario.create(varioparam)
err = varioexp2var.compute(dat)
```

We can then plot the experimental variograms and cross-variograms with a simple command: the plot in the i-th row and j-th column corresponds to the cross-variogram between the variables with locators `zi`

and `zj`

(hence the diagonal plots correspond to the variograms of each variable).

```
ax=gp.varmod(varioexp2var)
```

To fit a model on the experimental variograms and cross-variograms, we use the same commands as in the univariate case.

```
fitmod2var = gl.Model()
err = fitmod2var.fit(varioexp2var,
types=[gl.ECov.NUGGET, gl.ECov.EXPONENTIAL, gl.ECov.CUBIC, gl.ECov.LINEAR])
fitmod2var.setDriftIRF(0,0)
```

```
ax = gp.varmod(varioexp2var, fitmod2var, lw=2)
gp.decoration(ax,title="Temperature (°C) and Elevation")
```

### Cokriging predictions¶

To compute predictions by (simple) cokriging on the grid, we use the same syntax as in univariate case: a predictor for each variable in the multivariate model is produced. (Note: we revert back to a unique neighborhood to compare with the predictors previously introduced).

```
err = gl.kriging(dbin=dat, dbout=target, model=fitmod2var,
neigh=uniqueNeigh,
namconv=gl.NamingConvention.create(prefix="COK"))
```

We can then represent the cokriging predictor for the temperature.

```
ax = target.plot(nameRaster="COK.Temp*estim")
ax = dat.plot(flagCst=True, color="black")
ax.decoration(title="Temperature - CoKriging")
```

For this predictor, we get the following statistics:

```
opers = gl.EStatOption.fromKeys(["NUM","MINI","MAXI","MEAN","STDV"])
gl.dbStatisticsPrint(target, names = (["COK.T*"]), opers=opers,
title="Statistics on the CoKriging predictions")
```

Statistics on the CoKriging predictions --------------------------------------- Number Minimum Maximum Mean St. Dev. COK.Temperature.estim 3092 0.200 5.094 2.671 0.970 COK.Temperature.stdev 3092 0.231 0.948 0.448 0.109

Finally, we compare the cokriging predictor to the ordinary kriging predictor.

```
ax = gp.correlation(target, namex="OK.T*estim", namey="COK.T*estim", bissLine=True, bins=100, cmin=1)
ax.decoration(xlabel="Ordinary Kriging",ylabel="Ordinary CoKriging")
```

```
opers = gl.EStatOption.fromKeys(["NUM","MINI","MAXI","MEAN","STDV"])
gl.dbStatisticsPrint(target, names = (["OK.T*estim", "COK.T*estim"]), opers=opers,
title="Comparison between Ordinary and Universal kriging predictions")
```

Comparison between Ordinary and Universal kriging predictions ------------------------------------------------------------- Number Minimum Maximum Mean St. Dev. OK.Temperature.estim 3092 0.588 5.153 2.826 0.976 COK.Temperature.estim 3092 0.200 5.094 2.671 0.970

### Cross-validation¶

Since cokriging can be time-consuming in Unique Neighborhood, we create a small moving neighborhood for demonstration.

```
movingNeigh = gl.NeighMoving.create(radius = 1000, nmaxi = 10)
```

To perform a cross-validation of the bivariate model using co-kriging, we use the same commands as in the univariate case. Then cross-validation errors are computed for each variable of the multivariate model (hence for both the Temperature and the Elevation in our case).

```
err = gl.xvalid(dat, model=fitmod2var,
neigh=movingNeigh,
namconv=gl.NamingConvention.create(prefix="CV_COK",flag_locator=False))
```

We obtain the following Mean Squared Errors for the temperature.

```
mse=np.nanmean(np.square(dat.getColumn("CV_COK.Temperature*esterr*")))
print("Mean squared cross-validation error:", round(mse,3))
mse=np.nanmean(np.square(dat.getColumn("CV_COK.Temperature*stderr*")))
print("Mean squared standardized error:", round(mse,3))
```

Mean squared cross-validation error: 0.279 Mean squared standardized error: 1.227

We obtain the following Mean Squared Errors for the elevation.

```
mse=np.nanmean(np.square(dat.getColumn("CV_COK.Elevation*esterr*")))
print("Mean squared cross-validation error:", round(mse,3))
mse=np.nanmean(np.square(dat.getColumn("CV_COK.Elevation*stderr*")))
print("Mean squared standardized error:", round(mse,3))
```

Mean squared cross-validation error: 17849.434 Mean squared standardized error: 1.206

## Working with residuals¶

In this section, we assume that the variable of interest $Z$ is modeled (at each location $x$) as $$ Z(x) = b+a Y(x) + \varepsilon(x)$$ where $Y$ is an auxiliary variable known at every location, $a$ and $b$ are some (unknown) regression coefficients, and $\varepsilon$ denotes stationary residuals. Our goal will be to model and work directly with the residuals $\varepsilon$ (since they are the one who are assumed to be stationary).

### Computing and fitting the residuals¶

In our running example, the variable of interest $Z$ will be the temperature, and the auxiliary variable $Y$ will be the elevation. In the observation data base, we start by assigning the locator `z`

to the `Temperature`

variable (this is our variable of interest), and ensure that it is the only variable with a `z`

locator.

```
## Set `z` locator to temperature
dat.setLocator("Temperature",gl.ELoc.Z,cleanSameLocator=True)
```

To compute the coefficients $a$ and $b$ of the linear regression between the temperature and the elevation, we can use the `regression`

function. We specify the name of response variable (argument `nameResp`

) and the names of the auxiliary variables (argument `names`

), and set the argument `mode=0`

to specify that we would like to compute a regression defined from the variables with the specified names. We also set the argument `flagCste=TRUE`

to specify that we are looking for an affine regression model between the variables (i.e. that includes the bias coefficient `b`

).

```
regr = gl.regression(dat, "Temperature", ["Elevation"], mode=0, flagCst=True)
regr.display()
b = regr.getCoeff(0)
a = regr.getCoeff(1)
```

Linear Regression ----------------- - Calculated on 151 active values - Constant term = 3.61197 - Explanatory Variable #1 = -0.0090641 - Initial variance = 1.01979 - Variance of residuals = 0.363298

From these regression coefficients obtained above, we can then compute the residuals explicitly as $ \varepsilon(x)=Z(x) - (b+a Y(x) )$. An alternative method consists in using the `dbRegression`

function: this functions fits a regression model, computes the residuals and add them directly on the data base containing the data. The `dbRegression`

function is called in a similar way as the `regression`

function.

In the next example, we compute the residuals of the linear regression between the temperature and the elevation and add them to the observation data base (with a name starting with "RegRes"and without changing the locators in the data base).

```
err = gl.dbRegression(dat, "Temperature", ["Elevation"], flagCst=True,
namconv = gl.NamingConvention.create("RegRes",flag_locator=False))
```

We then compute some statistics about these residuals.

```
opers = gl.EStatOption.fromKeys(["NUM","MINI","MAXI","MEAN","STDV"])
gl.dbStatisticsPrint(dat, names = (["RegRes*"]), opers=opers,
title="Statistics on the residuals")
```

Statistics on the residuals --------------------------- Number Minimum Maximum Mean St. Dev. RegRes.Temperature 151 -1.359 1.795 0.000 0.603

Finally we plot a correlation plot between the residuals and the regressor variable (i.e. the elevation).

```
ax = gp.correlation(dat, namex="Elevation", namey="RegRes*",regrLine=True,asPoint=True)
```