Cross-validation¶

Import packages¶

In [1]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc

gdoc.setNoScroll()

Introduction¶

This is a small script which illustrates the capabilities of the cross-validation feature within gstlearn.

We generate a fictitious data set (by sampling a given simulation). Then we will use this fictitious data set to demonstrate the cross-validation tool.

We generate a data set composed of a series of few samples located randomly within a 100 by 100 square.

The number of data can be set to a small nuùber (10 for example) small in order to make the results more legible. However, you can turn it to a more reasonable value (say 100) to see better results (note that you should avoid large numbers as a test is performed in Unique Neighborhood). In the latter case, you may then switch OFF the variable 'turnPrintON' to remove tedious printouts.

In [2]:
nech = 10
turnPrintON = True

data = gl.Db.createFromBox(nech, [0,0], [100,100])
ax = data.plot()
ax.decoration(title="Measurement location")

We define a model (spherical structure with range 30 and sill 4) with the Universality condition, and perform a non conditional simulation at the data locations. These values, renamed as data will now become the data set.

In [3]:
model = gl.Model.createFromParam(type=gl.ECov.SPHERICAL,range=30,sill=4)
model.setDriftIRF(0)
err = gl.simtub(None,data,model)
data.setName("Simu","data")
data
ax = data.plot()
ax.decoration(title="Measurement values")

Now we perform the cross-validation step. This requires the definition of a neighborhood (called neigh) that we consider as unique, due to the small neumber of data. Obviously this could be turned into a moving neighborhood if necessary.

In [4]:
neighU = gl.NeighUnique()
err = gl.xvalid(data,model,neighU,flag_xvalid_est=1,flag_xvalid_std=1,
                namconv=gl.NamingConvention("Xvalid",True,True,False))

The cross-validation feature offers several types of outputs, according to the flags:

  • flag_xvalid_est tells if the function must return the estimation error Z-Z (flag.est=1) or the estimation Z (flag.est=-1)

  • flag_xvalid_std tells if the function must return the normalized error (Z*-Z)/S (flag.std=1) or the standard deviation S (flag.std=-1)

For a complete demonstration, all options are used. Note the use of NamingConvention which explicitely leaves the Z-locator on the input variable (i.e. data).

We perform the Cross-validation step once more but change the storing option (as wellas the radix given to the output variables).

In [5]:
err = gl.xvalid(data,model,neighU,flag_xvalid_est=-1, flag_xvalid_std=-1,
                namconv=gl.NamingConvention("Xvalid2",True,True,False))
In [6]:
data
Out[6]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 8
Total number of samples      = 10

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = data - Locator = z1
Column = 4 - Name = Xvalid.data.esterr - Locator = NA
Column = 5 - Name = Xvalid.data.stderr - Locator = NA
Column = 6 - Name = Xvalid2.data.estim - Locator = NA
Column = 7 - Name = Xvalid2.data.stdev - Locator = NA

We know check all the results gathered on the first sample.

In [7]:
data[0,0:8]
Out[7]:
array([ 1.        , 22.70134452, 83.64117505,  2.50180018, -1.95167723,
       -1.09498517,  0.55012296,  1.78237778])

The printed values correspond to the following information:

  • the sample rank: 1
  • the sample abscissae $X$: 22.7
  • the sample coordinate $Y$: 83.64
  • the data value $Z$: 2.502
  • the cross-validation error $Z^* - Z$: -1.952
  • the cross-validation standardized error $\frac{Z^* - Z} {S}$: -1.095
  • the cross-validation estimated value $Z^*$: 0.550
  • the standard deviation of the cross-validation error $S$: 1.781

We can also double-check these results by asking a full dump of all information when processing the first sample. The next chunk does not store any result: it is just there in order to produce some output on the terminal to better understand the process.

In [8]:
if turnPrintON:
    gl.OptDbg.setReference(1)
    err = gl.xvalid(data,model,neighU,flag_xvalid_est=1, flag_xvalid_std=1,
                    namconv=gl.NamingConvention("Xvalid3",True,True,False))
 Target location
 ---------------
 Sample #1 (from 10)
 Coordinate #1 = 22.701345
 Coordinate #2 = 83.641175
 
 Data selected in neighborhood
 -----------------------------
        Rank      Sample          x1          x2 
           1           1      22.701      83.641 
           2           2      82.323      43.955 
           3           3      15.270       3.385 
           4           4      55.428      19.935 
           5           5      93.142      79.864 
           6           6      85.694      97.845 
           7           7      73.690      37.455 
           8           8      32.792      43.164 
           9           9      32.178      78.710 
          10          10      64.591      82.042 
 
 LHS of Kriging matrix (compressed)
 ==================================
 Number of active samples    = 10
 Total number of equations   = 11
 Reduced number of equations = 11
 
        Rank                       1           2           3           4           5 
                    Flag           1           2           3           4           5 
           1           1       4.000       0.000       0.000       0.000       0.000 
           2           2       0.000       4.000       0.000       0.000       0.000 
           3           3       0.000       0.000       4.000       0.000       0.000 
           4           4       0.000       0.000       0.000       4.000       0.000 
           5           5       0.000       0.000       0.000       0.000       4.000 
           6           6       0.000       0.000       0.000       0.000       0.654 
           7           7       0.000       1.932       0.000       0.139       0.000 
           8           8       0.000       0.000       0.000       0.000       0.000 
           9           9       1.954       0.000       0.000       0.000       0.000 
          10          10       0.000       0.000       0.000       0.000       0.012 
          11          11       1.000       1.000       1.000       1.000       1.000 
 
        Rank                       6           7           8           9          10 
                    Flag           6           7           8           9          10 
           1           1       0.000       0.000       0.000       1.954       0.000 
           2           2       0.000       1.932       0.000       0.000       0.000 
           3           3       0.000       0.000       0.000       0.000       0.000 
           4           4       0.000       0.139       0.000       0.000       0.000 
           5           5       0.654       0.000       0.000       0.000       0.012 
           6           6       4.000       0.000       0.000       0.000       0.085 
           7           7       0.000       4.000       0.000       0.000       0.000 
           8           8       0.000       0.000       4.000       0.000       0.000 
           9           9       0.000       0.000       0.000       4.000       0.000 
          10          10       0.085       0.000       0.000       0.000       4.000 
          11          11       1.000       1.000       1.000       1.000       1.000 
 
        Rank                      11 
                    Flag          11 
           1           1       1.000 
           2           2       1.000 
           3           3       1.000 
           4           4       1.000 
           5           5       1.000 
           6           6       1.000 
           7           7       1.000 
           8           8       1.000 
           9           9       1.000 
          10          10       1.000 
          11          11       0.000 
 
 Cross-validation results
 ========================
 Target Sample = 1
 Variable Z1 
  - True value        =        2.502 
  - Estimated value   =        0.550 
  - Estimation Error  =       -1.952 
  - Std. deviation    =        1.782 
  - Normalized Error  =       -1.095 
 

We can check that:

  • the cross-validation system is dimensionned to 40. As a matter of fact, in Unique Neighborhood, the kriging system is established with the whole data set. The one suppressing one datum in turn is derived using the Schur theorem.

  • the results recall the values for the true data value (-0.326), the estimated value (-2.084), the standard deviation of the estimation error (1.623) and finally the cross-validation estimation error (-1.758).

We can also double-check these calculations with a Moving Neighborhood which has been tuned to cover a pseudo-Unique Neighborhood.

In [9]:
neighM = gl.NeighMoving.create()
In [10]:
if turnPrintON:
    gl.OptDbg.setReference(1)
    err = gl.xvalid(data,model,neighM,flag_xvalid_est=1, flag_xvalid_std=1,
                    namconv=gl.NamingConvention("Xvalid4",True,True,False))
 Target location
 ---------------
 Sample #1 (from 10)
 Coordinate #1 = 22.701345
 Coordinate #2 = 83.641175
 
 Data selected in neighborhood
 -----------------------------
        Rank      Sample          x1          x2      Sector 
           1           2      82.323      43.955           1 
           2           3      15.270       3.385           1 
           3           4      55.428      19.935           1 
           4           5      93.142      79.864           1 
           5           6      85.694      97.845           1 
           6           7      73.690      37.455           1 
           7           8      32.792      43.164           1 
           8           9      32.178      78.710           1 
           9          10      64.591      82.042           1 
 
 LHS of Kriging matrix (compressed)
 ==================================
 Number of active samples    = 9
 Total number of equations   = 10
 Reduced number of equations = 10
 
        Rank                       1           2           3           4           5 
                    Flag           1           2           3           4           5 
           1           1       4.000       0.000       0.000       0.000       0.000 
           2           2       0.000       4.000       0.000       0.000       0.000 
           3           3       0.000       0.000       4.000       0.000       0.000 
           4           4       0.000       0.000       0.000       4.000       0.654 
           5           5       0.000       0.000       0.000       0.654       4.000 
           6           6       1.932       0.000       0.139       0.000       0.000 
           7           7       0.000       0.000       0.000       0.000       0.000 
           8           8       0.000       0.000       0.000       0.000       0.000 
           9           9       0.000       0.000       0.000       0.012       0.085 
          10          10       1.000       1.000       1.000       1.000       1.000 
 
        Rank                       6           7           8           9          10 
                    Flag           6           7           8           9          10 
           1           1       1.932       0.000       0.000       0.000       1.000 
           2           2       0.000       0.000       0.000       0.000       1.000 
           3           3       0.139       0.000       0.000       0.000       1.000 
           4           4       0.000       0.000       0.000       0.012       1.000 
           5           5       0.000       0.000       0.000       0.085       1.000 
           6           6       4.000       0.000       0.000       0.000       1.000 
           7           7       0.000       4.000       0.000       0.000       1.000 
           8           8       0.000       0.000       4.000       0.000       1.000 
           9           9       0.000       0.000       0.000       4.000       1.000 
          10          10       1.000       1.000       1.000       1.000       0.000 
 
 RHS of Kriging matrix (compressed)
 ==================================
 Number of active samples    = 9
 Total number of equations   = 10
 Reduced number of equations = 10
 Number of right-hand sides  = 1
 Punctual Estimation
 
        Rank        Flag           1 
           1           1       0.000 
           2           2       0.000 
           3           3       0.000 
           4           4       0.000 
           5           5       0.000 
           6           6       0.000 
           7           7       0.000 
           8           8       1.954 
           9           9       0.000 
          10          10       1.000 
 
 (Co-) Kriging weights
 =====================
        Rank          x1          x2        Data         Z1* 
           1      82.323      43.955       1.266       0.045 
           2      15.270       3.385       2.184       0.064 
           3      55.428      19.935      -2.917       0.063 
           4      93.142      79.864       0.870       0.055 
           5      85.694      97.845      -0.730       0.054 
           6      73.690      37.455       2.955       0.040 
           7      32.792      43.164      -0.573       0.064 
           8      32.178      78.710       0.824       0.553 
           9      64.591      82.042      -0.157       0.063 
 Sum of weights                                     1.000 
 
 Drift coefficients
 ==================
        Rank    Lagrange       Coeff 
           1      -0.256       0.289 
 
 Cross-validation results
 ========================
 Target Sample = 1
 Variable Z1 
  - True value        =        2.502 
  - Estimated value   =        0.550 
  - Estimation Error  =       -1.952 
  - Std. deviation    =        1.782 
  - Normalized Error  =       -1.095 
 

In the next paragraph, we perform the different graphic outputs that are expected after a cross-validation step. They are provided by the function draw.xvalid which produces:

  • the base map of the absolute value of the cross-validation standardized error
  • the histogram of the cross-validation standardized error
  • the scatter plot of the standardized error as a function of the estimation
  • the scatter plot of the true value as a function of the estimation
In [11]:
fig, axs = plt.subplots(2,2, figsize=(10,10))
axs[0,0].gstpoint(data,nameSize="Xvalid.data.stderr", flagAbsSize=True)
axs[0,0].decoration(title="Standardized Errors (absolute value)")
axs[0,1].histogram(data, name="Xvalid.data.stderr", bins=20)
axs[0,1].decoration(title="Histogram of Standardized Errors")
axs[1,0].correlation(data, namey="Xvalid.data.stderr", namex="Xvalid2.data.estim", 
                   asPoint=True)
axs[1,0].decoration(xlabel="Estimation", ylabel="Standardized Error")
axs[1,1].correlation(data, namey="data", namex="Xvalid2.data.estim", 
                   asPoint=True, regrLine=True, flagSameAxes=True)
axs[1,1].decoration(xlabel="Estimation", ylabel="True Value")