Cross-validationΒΆ

Import packagesΒΆ

In [1]:
import numpy as np
import pandas as pd
import sys
import os
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc

gdoc.setNoScroll()

IntroductionΒΆ

This is a small script which illustrates the capabilities of the cross-validation feature within gstlearn.

We generate a fictitious data set (by sampling a given simulation). Then we will use this fictitious data set to demonstrate the cross-validation tool.

We generate a data set composed of a series of few samples located randomly within a 100 by 100 square.

The number of data can be set to a small nuΓΉber (10 for example) small in order to make the results more legible. However, you can turn it to a more reasonable value (say 100) to see better results (note that you should avoid large numbers as a test is performed in Unique Neighborhood). In the latter case, you may then switch OFF the variable 'turnPrintON' to remove tedious printouts.

In [2]:
nech = 10
turnPrintON = True

data = gl.Db.createFromBox(nech, [0,0], [100,100])
gp.plot(data)
gp.decoration(title="Measurement location")
No description has been provided for this image

We define a model (spherical structure with range 30 and sill 4) with the Universality condition, and perform a non conditional simulation at the data locations. These values, renamed as data will now become the data set.

In [3]:
model = gl.Model.createFromParam(type=gl.ECov.SPHERICAL,range=30,sill=4)
model.setDriftIRF(0)
err = gl.simtub(None,data,model)
data.setName("Simu","data")
gp.plot(data, nameSize="data")
gp.decoration(title="Measurement values")
No description has been provided for this image

Now we perform the cross-validation step. This requires the definition of a neighborhood (called neigh) that we consider as unique, due to the small neumber of data. Obviously this could be turned into a moving neighborhood if necessary.

In [4]:
neighU = gl.NeighUnique()
err = gl.xvalid(data,model,neighU,flag_xvalid_est=1,flag_xvalid_std=1,
                namconv=gl.NamingConvention("Xvalid",True,True,False))

The cross-validation feature offers several types of outputs, according to the flags:

  • flag_xvalid_est tells if the function must return the estimation error Z-Z (flag.est=1) or the estimation Z (flag.est=-1)

  • flag_xvalid_std tells if the function must return the normalized error (Z*-Z)/S (flag.std=1) or the standard deviation S (flag.std=-1)

For a complete demonstration, all options are used. Note the use of NamingConvention which explicitely leaves the Z-locator on the input variable (i.e. data).

We perform the Cross-validation step once more but change the storing option (as wellas the radix given to the output variables).

In [5]:
err = gl.xvalid(data,model,neighU,flag_xvalid_est=-1, flag_xvalid_std=-1,
                namconv=gl.NamingConvention("Xvalid2",True,True,False))
data
Out[5]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 8
Total number of samples      = 10

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = data - Locator = z1
Column = 4 - Name = Xvalid.data.esterr - Locator = NA
Column = 5 - Name = Xvalid.data.stderr - Locator = NA
Column = 6 - Name = Xvalid2.data.estim - Locator = NA
Column = 7 - Name = Xvalid2.data.stdev - Locator = NA

We know check all the results gathered on the first sample.

In [6]:
data[0,0:8]
Out[6]:
array([ 1.        , 22.70134452, 83.64117505,  2.50180018, -1.95167723,
       -1.09498517,  0.55012296,  1.78237778])

The printed values correspond to the following information:

  • the sample rank: 1
  • the sample abscissae X: 22.7
  • the sample coordinate Y: 83.64
  • the data value Z: 2.502
  • the cross-validation error Zβˆ—βˆ’Z: -1.952
  • the cross-validation standardized error Zβˆ—βˆ’ZS: -1.095
  • the cross-validation estimated value Zβˆ—: 0.550
  • the standard deviation of the cross-validation error S: 1.781

We can also double-check these results by asking a full dump of all information when processing the first sample. The next chunk does not store any result: it is just there in order to produce some output on the terminal to better understand the process.

In [7]:
if turnPrintON:
    gl.OptDbg.setReference(1)
    err = gl.xvalid(data,model,neighU,flag_xvalid_est=1, flag_xvalid_std=1,
                    namconv=gl.NamingConvention("Xvalid3",True,True,False))
Target location
---------------
Sample #1 (from 10)
Coordinate #1 = 22.701345
Coordinate #2 = 83.641175

Data selected in neighborhood
-----------------------------
       Rank     Sample         x1         x2
          1          1     22.701     83.641
          2          2     82.323     43.955
          3          3     15.270      3.385
          4          4     55.428     19.935
          5          5     93.142     79.864
          6          6     85.694     97.845
          7          7     73.690     37.455
          8          8     32.792     43.164
          9          9     32.178     78.710
         10         10     64.591     82.042

LHS of Kriging matrix
=====================
Dimension of the Covariance Matrix  = 10
Dimension of the Drift Matrix       = 1

       Rank          1          2          3          4          5
          1      4.000      0.000      0.000      0.000      0.000
          2      0.000      4.000      0.000      0.000      0.000
          3      0.000      0.000      4.000      0.000      0.000
          4      0.000      0.000      0.000      4.000      0.000
          5      0.000      0.000      0.000      0.000      4.000
          6      0.000      0.000      0.000      0.000      0.654
          7      0.000      1.932      0.000      0.139      0.000
          8      0.000      0.000      0.000      0.000      0.000
          9      1.954      0.000      0.000      0.000      0.000
         10      0.000      0.000      0.000      0.000      0.012
         11      1.000      1.000      1.000      1.000      1.000

       Rank          6          7          8          9         10
          1      0.000      0.000      0.000      1.954      0.000
          2      0.000      1.932      0.000      0.000      0.000
          3      0.000      0.000      0.000      0.000      0.000
          4      0.000      0.139      0.000      0.000      0.000
          5      0.654      0.000      0.000      0.000      0.012
          6      4.000      0.000      0.000      0.000      0.085
          7      0.000      4.000      0.000      0.000      0.000
          8      0.000      0.000      4.000      0.000      0.000
          9      0.000      0.000      0.000      4.000      0.000
         10      0.085      0.000      0.000      0.000      4.000
         11      1.000      1.000      1.000      1.000      1.000

       Rank         11
          1      1.000
          2      1.000
          3      1.000
          4      1.000
          5      1.000
          6      1.000
          7      1.000
          8      1.000
          9      1.000
         10      1.000
         11      0.000

RHS of Kriging matrix
=====================
Number of active samples    = 10
Total number of equations   = 11
Number of right-hand sides  = 1
Punctual Estimation

       Rank          1
          1      0.823
          2      0.000
          3      0.000
          4      0.000
          5      0.000
          6      0.000
          7      0.000
          8      0.000
          9      1.954
         10      0.000
         11      1.000

(Co-) Kriging weights
=====================
       Rank       Data        Z1*
          1      2.502      0.000
          2      1.266      0.045
          3      2.184      0.064
          4     -2.917      0.063
          5      0.870      0.055
          6     -0.730      0.054
          7      2.955      0.040
          8     -0.573      0.064
          9      0.824      0.553
         10     -0.157      0.063
Sum of weights              1.000

Drift or Mean Information
=========================
       Rank       Mu1*      Coeff
          1      0.256      0.446

Cross-validation results
========================
Target Sample = 1
Variable Z1 
 - True value        =       2.502
 - Estimated value   =       0.550
 - Estimation Error  =      -1.952
 - Std. deviation    =       1.782
 - Normalized Error  =      -1.095

We can also double-check these calculations with a Moving Neighborhood which has been tuned to cover a pseudo-Unique Neighborhood.

In [8]:
neighM = gl.NeighMoving.create()
In [9]:
if turnPrintON:
    gl.OptDbg.setReference(1)
    err = gl.xvalid(data,model,neighM,flag_xvalid_est=1, flag_xvalid_std=1,
                    namconv=gl.NamingConvention("Xvalid4",True,True,False))
Target location
---------------
Sample #1 (from 10)
Coordinate #1 = 22.701345
Coordinate #2 = 83.641175

Data selected in neighborhood
-----------------------------
       Rank     Sample         x1         x2     Sector
          1          2     82.323     43.955          1
          2          3     15.270      3.385          1
          3          4     55.428     19.935          1
          4          5     93.142     79.864          1
          5          6     85.694     97.845          1
          6          7     73.690     37.455          1
          7          8     32.792     43.164          1
          8          9     32.178     78.710          1
          9         10     64.591     82.042          1

LHS of Kriging matrix
=====================
Dimension of the Covariance Matrix  = 9
Dimension of the Drift Matrix       = 1

       Rank          1          2          3          4          5
          1      4.000      0.000      0.000      0.000      0.000
          2      0.000      4.000      0.000      0.000      0.000
          3      0.000      0.000      4.000      0.000      0.000
          4      0.000      0.000      0.000      4.000      0.654
          5      0.000      0.000      0.000      0.654      4.000
          6      1.932      0.000      0.139      0.000      0.000
          7      0.000      0.000      0.000      0.000      0.000
          8      0.000      0.000      0.000      0.000      0.000
          9      0.000      0.000      0.000      0.012      0.085
         10      1.000      1.000      1.000      1.000      1.000

       Rank          6          7          8          9         10
          1      1.932      0.000      0.000      0.000      1.000
          2      0.000      0.000      0.000      0.000      1.000
          3      0.139      0.000      0.000      0.000      1.000
          4      0.000      0.000      0.000      0.012      1.000
          5      0.000      0.000      0.000      0.085      1.000
          6      4.000      0.000      0.000      0.000      1.000
          7      0.000      4.000      0.000      0.000      1.000
          8      0.000      0.000      4.000      0.000      1.000
          9      0.000      0.000      0.000      4.000      1.000
         10      1.000      1.000      1.000      1.000      0.000

RHS of Kriging matrix
=====================
Number of active samples    = 9
Total number of equations   = 10
Number of right-hand sides  = 1
Punctual Estimation

       Rank          1
          1      0.000
          2      0.000
          3      0.000
          4      0.000
          5      0.000
          6      0.000
          7      0.000
          8      1.954
          9      0.000
         10      1.000

(Co-) Kriging weights
=====================
       Rank       Data        Z1*
          1      1.266      0.045
          2      2.184      0.064
          3     -2.917      0.063
          4      0.870      0.055
          5     -0.730      0.054
          6      2.955      0.040
          7     -0.573      0.064
          8      0.824      0.553
          9     -0.157      0.063
Sum of weights              1.000

Drift or Mean Information
=========================
       Rank       Mu1*      Coeff
          1      0.256      0.289

Cross-validation results
========================
Target Sample = 1
Variable Z1 
 - True value        =       2.502
 - Estimated value   =       0.550
 - Estimation Error  =      -1.952
 - Std. deviation    =       1.782
 - Normalized Error  =      -1.095

In the next paragraph, we perform the different graphic outputs that are expected after a cross-validation step. They are provided by the function draw.xvalid which produces:

  • the base map of the absolute value of the cross-validation standardized error
  • the histogram of the cross-validation standardized error
  • the scatter plot of the standardized error as a function of the estimation
  • the scatter plot of the true value as a function of the estimation
In [10]:
fig, axs = gp.init(2,2, figsize=(12,12))
axs[0,0].symbol(data,nameSize="Xvalid.data.stderr", flagAbsSize=True)
axs[0,0].decoration(title="Standardized Errors (absolute value)")
axs[0,1].histogram(data, name="Xvalid.data.stderr", bins=20)
axs[0,1].decoration(title="Histogram of Standardized Errors")
axs[1,0].correlation(data, namey="Xvalid.data.stderr", namex="Xvalid2.data.estim", 
                   asPoint=True, horizLine=True)
axs[1,0].decoration(xlabel="Estimation", ylabel="Standardized Error")
axs[1,1].correlation(data, namey="data", namex="Xvalid2.data.estim", 
                   asPoint=True, regrLine=True, flagSameAxes=True)
axs[1,1].decoration(xlabel="Estimation", ylabel="True Value")
No description has been provided for this image

Difference between Kriging and Cross-ValidationΒΆ

This small paragraph is meant to enhance the difference between Kriging and Cross-validation. Clearly the two main differences are that:

  • Estimation is performed on the data file itself: hence 'input' and 'output' data bases coincide
  • the target sample is temporarily suppressed from the data base before the estimation takes place at its initial location

This is illustrated in the following paragraph where the estimation is performed on the first sample, toggling all the verbose options ON. We also take the opportunity for testing the option for the calculation of the the variance of the estimator (not the estimation error for once).

In [11]:
gl.OptDbg.setReference(1)
err = gl.kriging(data,data,model,neighU, flag_est=True, flag_std=True, flag_varz=True,
                namconv=gl.NamingConvention("Kriging",True,True,False))
data
Target location
---------------
Sample #1 (from 10)
Coordinate #1 = 22.701345
Coordinate #2 = 83.641175

Data selected in neighborhood
-----------------------------
       Rank     Sample         x1         x2
          1          1     22.701     83.641
          2          2     82.323     43.955
          3          3     15.270      3.385
          4          4     55.428     19.935
          5          5     93.142     79.864
          6          6     85.694     97.845
          7          7     73.690     37.455
          8          8     32.792     43.164
          9          9     32.178     78.710
         10         10     64.591     82.042

LHS of Kriging matrix
=====================
Dimension of the Covariance Matrix  = 10
Dimension of the Drift Matrix       = 1

       Rank          1          2          3          4          5
          1      4.000      0.000      0.000      0.000      0.000
          2      0.000      4.000      0.000      0.000      0.000
          3      0.000      0.000      4.000      0.000      0.000
          4      0.000      0.000      0.000      4.000      0.000
          5      0.000      0.000      0.000      0.000      4.000
          6      0.000      0.000      0.000      0.000      0.654
          7      0.000      1.932      0.000      0.139      0.000
          8      0.000      0.000      0.000      0.000      0.000
          9      1.954      0.000      0.000      0.000      0.000
         10      0.000      0.000      0.000      0.000      0.012
         11      1.000      1.000      1.000      1.000      1.000

       Rank          6          7          8          9         10
          1      0.000      0.000      0.000      1.954      0.000
          2      0.000      1.932      0.000      0.000      0.000
          3      0.000      0.000      0.000      0.000      0.000
          4      0.000      0.139      0.000      0.000      0.000
          5      0.654      0.000      0.000      0.000      0.012
          6      4.000      0.000      0.000      0.000      0.085
          7      0.000      4.000      0.000      0.000      0.000
          8      0.000      0.000      4.000      0.000      0.000
          9      0.000      0.000      0.000      4.000      0.000
         10      0.085      0.000      0.000      0.000      4.000
         11      1.000      1.000      1.000      1.000      1.000

       Rank         11
          1      1.000
          2      1.000
          3      1.000
          4      1.000
          5      1.000
          6      1.000
          7      1.000
          8      1.000
          9      1.000
         10      1.000
         11      0.000

RHS of Kriging matrix
=====================
Number of active samples    = 10
Total number of equations   = 11
Number of right-hand sides  = 1
Punctual Estimation

       Rank          1
          1      4.000
          2      0.000
          3      0.000
          4      0.000
          5      0.000
          6      0.000
          7      0.000
          8      0.000
          9      1.954
         10      0.000
         11      1.000

(Co-) Kriging weights
=====================
       Rank       Data        Z1*
          1      2.502      1.000
          2      1.266      0.000
          3      2.184      0.000
          4     -2.917      0.000
          5      0.870      0.000
          6     -0.730      0.000
          7      2.955      0.000
          8     -0.573      0.000
          9      0.824      0.000
         10     -0.157      0.000
Sum of weights              1.000

Drift or Mean Information
=========================
       Rank       Mu1*      Coeff
          1      0.000      0.446

(Co-) Kriging results
=====================
Target Sample = 1
Variable Z1 
 - Estimate  =       2.502
 - Std. Dev. =       0.000
 - Variance  =       0.000
 - Cov(h=0)  =       4.000
 - Var(Z*)   =       4.000
Out[11]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 15
Total number of samples      = 10

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = data - Locator = z1
Column = 4 - Name = Xvalid.data.esterr - Locator = NA
Column = 5 - Name = Xvalid.data.stderr - Locator = NA
Column = 6 - Name = Xvalid2.data.estim - Locator = NA
Column = 7 - Name = Xvalid2.data.stdev - Locator = NA
Column = 8 - Name = Xvalid3.data.esterr - Locator = NA
Column = 9 - Name = Xvalid3.data.stderr - Locator = NA
Column = 10 - Name = Xvalid4.data.esterr - Locator = NA
Column = 11 - Name = Xvalid4.data.stderr - Locator = NA
Column = 12 - Name = Kriging.data.estim - Locator = NA
Column = 13 - Name = Kriging.data.stdev - Locator = NA
Column = 14 - Name = Kriging.data.varz - Locator = NA
In [12]:
data[0,0:15]
Out[12]:
array([ 1.        , 22.70134452, 83.64117505,  2.50180018, -1.95167723,
       -1.09498517,  0.55012296,  1.78237778, -1.95167723, -1.09498517,
       -1.95167723, -1.09498517,  2.50180018,  0.        ,  4.        ])

We also take this opportunity to double-check that Kriging is an Exact Interpolator (e.g. at the data point, estimated value coincides with data value and variance of estimation error is zero) even when using a Model containing some Nugget Effect.

So we modify the previous model by adding a Nugget Effect component.

In [13]:
model.addCovFromParam(type=gl.ECov.NUGGET, sill=1.5)
model
Out[13]:
Model characteristics
=====================
Space dimension              = 2
Number of variable(s)        = 1
Number of basic structure(s) = 2
Number of drift function(s)  = 1
Number of drift equation(s)  = 1

Covariance Part
---------------
Spherical
- Sill         =      4.000
- Range        =     30.000
Nugget Effect
- Sill         =      1.500
Total Sill     =      5.500

Drift Part
----------
Universality_Condition
In [14]:
gl.OptDbg.setReference(1)
err = gl.kriging(data,data,model,neighU, flag_est=True, flag_std=True, flag_varz=True,
                namconv=gl.NamingConvention("Kriging2",True,True,False))
data
Target location
---------------
Sample #1 (from 10)
Coordinate #1 = 22.701345
Coordinate #2 = 83.641175

Data selected in neighborhood
-----------------------------
       Rank     Sample         x1         x2
          1          1     22.701     83.641
          2          2     82.323     43.955
          3          3     15.270      3.385
          4          4     55.428     19.935
          5          5     93.142     79.864
          6          6     85.694     97.845
          7          7     73.690     37.455
          8          8     32.792     43.164
          9          9     32.178     78.710
         10         10     64.591     82.042

LHS of Kriging matrix
=====================
Dimension of the Covariance Matrix  = 10
Dimension of the Drift Matrix       = 1

       Rank          1          2          3          4          5
          1      5.500      0.000      0.000      0.000      0.000
          2      0.000      5.500      0.000      0.000      0.000
          3      0.000      0.000      5.500      0.000      0.000
          4      0.000      0.000      0.000      5.500      0.000
          5      0.000      0.000      0.000      0.000      5.500
          6      0.000      0.000      0.000      0.000      0.654
          7      0.000      1.932      0.000      0.139      0.000
          8      0.000      0.000      0.000      0.000      0.000
          9      1.954      0.000      0.000      0.000      0.000
         10      0.000      0.000      0.000      0.000      0.012
         11      1.000      1.000      1.000      1.000      1.000

       Rank          6          7          8          9         10
          1      0.000      0.000      0.000      1.954      0.000
          2      0.000      1.932      0.000      0.000      0.000
          3      0.000      0.000      0.000      0.000      0.000
          4      0.000      0.139      0.000      0.000      0.000
          5      0.654      0.000      0.000      0.000      0.012
          6      5.500      0.000      0.000      0.000      0.085
          7      0.000      5.500      0.000      0.000      0.000
          8      0.000      0.000      5.500      0.000      0.000
          9      0.000      0.000      0.000      5.500      0.000
         10      0.085      0.000      0.000      0.000      5.500
         11      1.000      1.000      1.000      1.000      1.000

       Rank         11
          1      1.000
          2      1.000
          3      1.000
          4      1.000
          5      1.000
          6      1.000
          7      1.000
          8      1.000
          9      1.000
         10      1.000
         11      0.000

RHS of Kriging matrix
=====================
Number of active samples    = 10
Total number of equations   = 11
Number of right-hand sides  = 1
Punctual Estimation

       Rank          1
          1      5.500
          2      0.000
          3      0.000
          4      0.000
          5      0.000
          6      0.000
          7      0.000
          8      0.000
          9      1.954
         10      0.000
         11      1.000

(Co-) Kriging weights
=====================
       Rank       Data        Z1*
          1      2.502      1.000
          2      1.266      0.000
          3      2.184      0.000
          4     -2.917      0.000
          5      0.870      0.000
          6     -0.730      0.000
          7      2.955      0.000
          8     -0.573      0.000
          9      0.824      0.000
         10     -0.157      0.000
Sum of weights              1.000

Drift or Mean Information
=========================
       Rank       Mu1*      Coeff
          1      0.000      0.488

(Co-) Kriging results
=====================
Target Sample = 1
Variable Z1 
 - Estimate  =       2.502
 - Std. Dev. =       0.000
 - Variance  =       0.000
 - Cov(h=0)  =       5.500
 - Var(Z*)   =       5.500
Out[14]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 18
Total number of samples      = 10

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = data - Locator = z1
Column = 4 - Name = Xvalid.data.esterr - Locator = NA
Column = 5 - Name = Xvalid.data.stderr - Locator = NA
Column = 6 - Name = Xvalid2.data.estim - Locator = NA
Column = 7 - Name = Xvalid2.data.stdev - Locator = NA
Column = 8 - Name = Xvalid3.data.esterr - Locator = NA
Column = 9 - Name = Xvalid3.data.stderr - Locator = NA
Column = 10 - Name = Xvalid4.data.esterr - Locator = NA
Column = 11 - Name = Xvalid4.data.stderr - Locator = NA
Column = 12 - Name = Kriging.data.estim - Locator = NA
Column = 13 - Name = Kriging.data.stdev - Locator = NA
Column = 14 - Name = Kriging.data.varz - Locator = NA
Column = 15 - Name = Kriging2.data.estim - Locator = NA
Column = 16 - Name = Kriging2.data.stdev - Locator = NA
Column = 17 - Name = Kriging2.data.varz - Locator = NA
In [15]:
data[0,0:21]
Out[15]:
array([ 1.        , 22.70134452, 83.64117505,  2.50180018, -1.95167723,
       -1.09498517,  0.55012296,  1.78237778, -1.95167723, -1.09498517,
       -1.95167723, -1.09498517,  2.50180018,  0.        ,  4.        ,
        2.50180018,  0.        ,  5.5       ])