%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
import numpy as np
import pandas as pd
import sys
import os
import urllib.request
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
The data are stored in a CSV format in the file called Pollution.dat. We concentrate on the varibale named Pb.
url = 'https://soft.minesparis.psl.eu/gstlearn/data/Pollution/Pollution.dat'
filepath, head = urllib.request.urlretrieve(url)
mydb = gl.Db.createFromCSV(filepath,gl.CSVformat())
mydb.setLocators(["X","Y"],gl.ELoc.X)
mydb.setLocator("Pb",gl.ELoc.Z)
dbfmt = gl.DbStringFormat.createFromFlags(flag_vars=True, flag_extend=True, flag_stats=True,
names=["*Pb"])
mydb.display(dbfmt)
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 5 Total number of samples = 102 Data Base Extension ------------------- Coor #1 - Min = 109.850 - Max = 143.010 - Ext = 33.16 Coor #2 - Min = 483.660 - Max = 513.040 - Ext = 29.38 Data Base Statistics -------------------- 5 - Name Pb - Locator z1 Nb of data = 102 Nb of active values = 101 Minimum value = 3.000 Maximum value = 31.600 Mean value = 6.104 Standard Deviation = 3.594 Variance = 12.916 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = X - Locator = x1 Column = 2 - Name = Y - Locator = x2 Column = 3 - Name = Zn - Locator = NA Column = 4 - Name = Pb - Locator = z1
We denote that one sample has no value defined: therefore only 101 values are available. Moreover, the following histogram shows the presence of two outliers (value above 24).
ax = gp.histogram(mydb, name="Pb", bins=50)
ax.decoration(title="Pb (initial)")
We decide to mask these two outliers off. This is an opportunity to create a selection applied on the Data Base.
tab = mydb.getColumn("Pb")
iuid = mydb.addSelection(tab<24)
ax = gp.histogram(mydb, name="Pb", bins=50)
ax.decoration(title="Pb after filtering the two outliers")
The updated statistics show that the active values of the variable Pb now vary between 3 and 12.7. Note the variance of the Pb variable is equal to 2.881 (instead of 12.9 prior to masking the outliers off).
mydb.display(dbfmt)
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 6 Total number of samples = 102 Number of active samples = 99 Data Base Extension ------------------- Coor #1 - Min = 109.850 - Max = 143.010 - Ext = 33.16 Coor #2 - Min = 483.660 - Max = 513.040 - Ext = 29.38 Data Base Statistics -------------------- 5 - Name Pb - Locator z1 Nb of data = 102 Nb of active values = 99 Minimum value = 3.000 Maximum value = 12.700 Mean value = 5.658 Standard Deviation = 1.697 Variance = 2.881 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = X - Locator = x1 Column = 2 - Name = Y - Locator = x2 Column = 3 - Name = Zn - Locator = NA Column = 4 - Name = Pb - Locator = z1 Column = 5 - Name = NewSel - Locator = sel
ax = mydb.plot(nameColor="Pb",size=50)
ax.decoration(title="Data Set (Outliers have been masked off)")
plt.show()
We first define the geometry of the variogram calculations
myVarioParamOmni = gl.VarioParam()
mydir = gl.DirParam.create(npas=10, dpas=1.)
myVarioParamOmni.addDir(mydir)
We calculate the experimental omni-directional variogram
myvario = gl.Vario(myVarioParamOmni,mydb)
err = myvario.compute(gl.ECalcVario.VARIOGRAM)
The variogram is represented graphically.
ax = myvario.plot()
ax.decoration(title="Omni-directional Variogram for Pb")
Fitting a Model. We call the Automatic Fitting procedure providing the list of covariance functions to be tested.
mymodel = gl.Model.createFromDb(mydb)
err = mymodel.fit(myvario,[gl.ECov.EXPONENTIAL,gl.ECov.SPHERICAL])
Visualizing the resulting model, overlaid on the experimental variogram
ax = gp.varmod(myvario,mymodel)
ax.decoration(title="Model for Pb")
mymodel.setDriftIRF()
mymodel.display()
Model characteristics ===================== Space dimension = 2 Number of variable(s) = 1 Number of basic structure(s) = 2 Number of drift function(s) = 1 Number of drift equation(s) = 1 Covariance Part --------------- Exponential - Sill = 0.966 - Range = 0.637 - Theo. Range = 0.212 Spherical - Sill = 1.670 - Range = 5.757 Total Sill = 2.637 Drift Part ---------- Universality_Condition
We transform the Data into Gaussian. This requires the definition of a transform function called Gaussian Anamophosis. This function is expanded on a basis of Hermite polynomials: here 30 polynomials are used.
myanam = gl.AnamHermite(30)
myanam.fitFromLocator(mydb)
myanam.display()
Hermitian Anamorphosis ---------------------- Minimum absolute value for Y = -2.7 Maximum absolute value for Y = 2.6 Minimum absolute value for Z = 3.0029 Maximum absolute value for Z = 12.9777 Minimum practical value for Y = -2.7 Maximum practical value for Y = 2.6 Minimum practical value for Z = 3.0029 Maximum practical value for Z = 12.9777 Mean = 5.65758 Variance = 2.86296 Number of Hermite polynomials = 30 Normalized coefficients for Hermite polynomials (punctual variable) [, 0] [, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [ 0,] 5.658 -1.625 0.440 -0.069 -0.017 0.082 -0.061 [ 7,] 0.001 0.036 -0.044 0.004 0.047 -0.030 -0.029 [ 14,] 0.037 0.007 -0.031 0.010 0.018 -0.019 -0.003 [ 21,] 0.019 -0.010 -0.014 0.019 0.006 -0.023 0.004 [ 28,] 0.022 -0.013
We can produce the Gaussian Anamorphosis graphically within its definition domain.
ax = gp.anam(myanam)
ax.decoration(title="Anamorphosis")
The next step consists in translating the target variable ('Pb') into its Gaussian transform. We can check that the newly created variable is centered with a mean close to 0 and a variance close to 1.
err = myanam.rawToGaussianByLocator(mydb)
mydb.display(dbfmt)
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 7 Total number of samples = 102 Number of active samples = 99 Data Base Extension ------------------- Coor #1 - Min = 109.850 - Max = 143.010 - Ext = 33.16 Coor #2 - Min = 483.660 - Max = 513.040 - Ext = 29.38 Data Base Statistics -------------------- 5 - Name Pb - Locator NA Nb of data = 102 Nb of active values = 99 Minimum value = 3.000 Maximum value = 12.700 Mean value = 5.658 Standard Deviation = 1.697 Variance = 2.881 7 - Name Y.Pb - Locator z1 Nb of data = 102 Nb of active values = 99 Minimum value = -2.700 Maximum value = 2.513 Mean value = -0.005 Standard Deviation = 1.007 Variance = 1.014 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = X - Locator = x1 Column = 2 - Name = Y - Locator = x2 Column = 3 - Name = Zn - Locator = NA Column = 4 - Name = Pb - Locator = NA Column = 5 - Name = NewSel - Locator = sel Column = 6 - Name = Y.Pb - Locator = z1
The histogram of the transformed values show the expected beel shape.
ax = gp.histogram(mydb, name="Y.Pb", bins=50)
We calculate the experimental (omni-directional) variogram on the Gaussian transformed variable.
myvarioG = gl.Vario(myVarioParamOmni,mydb)
err = myvarioG.compute(gl.ECalcVario.VARIOGRAM)
We fit the model by automatic fit. In some cases, it is required the resulting model to have its sill equal to 1: this constraints is added to the fitting step;
mymodelG = gl.Model.createFromDb(mydb)
constr = gl.Constraints(1)
err = mymodelG.fit(myvarioG,[gl.ECov.EXPONENTIAL], constr)
ax = gp.varmod(myvarioG,mymodelG)
ax.decoration(title="Model for Gaussian Pb")
We turn the Gaussian values back to the Raw scale. This exercise is not very demonstrative when based on the initial data themselves: in operational framework, we use this transform to turn newly created values in the Gaussian scale (results of Simulations for example) back in the Raw scale.
myanam.gaussianToRaw(mydb,"Y.Pb")
mydb.display(dbfmt)
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 8 Total number of samples = 102 Number of active samples = 99 Data Base Extension ------------------- Coor #1 - Min = 109.850 - Max = 143.010 - Ext = 33.16 Coor #2 - Min = 483.660 - Max = 513.040 - Ext = 29.38 Data Base Statistics -------------------- 5 - Name Pb - Locator NA Nb of data = 102 Nb of active values = 99 Minimum value = 3.000 Maximum value = 12.700 Mean value = 5.658 Standard Deviation = 1.697 Variance = 2.881 7 - Name Y.Pb - Locator NA Nb of data = 102 Nb of active values = 99 Minimum value = -2.700 Maximum value = 2.513 Mean value = -0.005 Standard Deviation = 1.007 Variance = 1.014 8 - Name Z.Y.Pb - Locator z1 Nb of data = 102 Nb of active values = 99 Minimum value = 3.003 Maximum value = 12.700 Mean value = 5.658 Standard Deviation = 1.697 Variance = 2.881 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = X - Locator = x1 Column = 2 - Name = Y - Locator = x2 Column = 3 - Name = Zn - Locator = NA Column = 4 - Name = Pb - Locator = NA Column = 5 - Name = NewSel - Locator = sel Column = 6 - Name = Y.Pb - Locator = NA Column = 7 - Name = Z.Y.Pb - Locator = z1
The back transformation, from Gaussian to Raw scale, is performed using the Hermite polynomial expansion (with a limited number of polynomials). This is the reason why we may expect each datum not to coincide exactly with its initial value. This is demonstrated in the next correlation plot.
ax = gp.correlation(mydb, namex="Pb", namey="Z.Y.Pb", asPoint=True)