Anamorphosis¶

Import packages¶

In [1]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc

gdoc.setNoScroll()

Reading data¶

The data are stored in a CSV format in the file called Pollution.dat. We concentrate on the varibale named Pb.

In [2]:
filepath = gdoc.loadData("Pollution", "Pollution.dat")
mydb = gl.Db.createFromCSV(filepath,gl.CSVformat())
mydb.setLocators(["X","Y"],gl.ELoc.X)
mydb.setLocator("Pb",gl.ELoc.Z)

dbfmt = gl.DbStringFormat.createFromFlags(flag_vars=True, flag_extend=True, flag_stats=True,
                                         names=["*Pb"]) 
mydb.display(dbfmt)
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 5
Total number of samples      = 102

Data Base Extension
-------------------
Coor #1 - Min =    109.850 - Max =    143.010 - Ext = 33.16
Coor #2 - Min =    483.660 - Max =    513.040 - Ext = 29.38

Data Base Statistics
--------------------
5 - Name Pb - Locator z1
 Nb of data          =        102
 Nb of active values =        101
 Minimum value       =      3.000
 Maximum value       =     31.600
 Mean value          =      6.104
 Standard Deviation  =      3.594
 Variance            =     12.916

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = X - Locator = x1
Column = 2 - Name = Y - Locator = x2
Column = 3 - Name = Zn - Locator = NA
Column = 4 - Name = Pb - Locator = z1
 

We denote that one sample has no value defined: therefore only 101 values are available. Moreover, the following histogram shows the presence of two outliers (value above 24).

In [3]:
ax = gp.histogram(mydb, name="Pb", bins=50)
ax.decoration(title="Pb (initial)")

We decide to mask these two outliers off. This is an opportunity to create a selection applied on the Data Base.

In [4]:
tab = mydb.getColumn("Pb")
iuid = mydb.addSelection(tab<24)
In [5]:
ax = gp.histogram(mydb, name="Pb", bins=50)
ax.decoration(title="Pb after filtering the two outliers")

The updated statistics show that the active values of the variable Pb now vary between 3 and 12.7. Note the variance of the Pb variable is equal to 2.881 (instead of 12.9 prior to masking the outliers off).

In [6]:
mydb.display(dbfmt)
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 6
Total number of samples      = 102
Number of active samples     = 99

Data Base Extension
-------------------
Coor #1 - Min =    109.850 - Max =    143.010 - Ext = 33.16
Coor #2 - Min =    483.660 - Max =    513.040 - Ext = 29.38

Data Base Statistics
--------------------
5 - Name Pb - Locator z1
 Nb of data          =        102
 Nb of active values =         99
 Minimum value       =      3.000
 Maximum value       =     12.700
 Mean value          =      5.658
 Standard Deviation  =      1.697
 Variance            =      2.881

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = X - Locator = x1
Column = 2 - Name = Y - Locator = x2
Column = 3 - Name = Zn - Locator = NA
Column = 4 - Name = Pb - Locator = z1
Column = 5 - Name = NewSel - Locator = sel
 
In [7]:
ax = mydb.plot(nameColor="Pb",size=50)
ax.decoration(title="Data Set (Outliers have been masked off)")
plt.show()

Variograms¶

We first define the geometry of the variogram calculations

In [8]:
myVarioParamOmni = gl.VarioParam()
mydir = gl.DirParam.create(npas=10, dpas=1.)
myVarioParamOmni.addDir(mydir)

We calculate the experimental omni-directional variogram

In [9]:
myvario = gl.Vario(myVarioParamOmni)
err = myvario.compute(mydb,gl.ECalcVario.VARIOGRAM)

The variogram is represented graphically.

In [10]:
ax = myvario.plot()
ax.decoration(title="Omni-directional Variogram for Pb")

Model¶

Fitting a Model. We call the Automatic Fitting procedure providing the list of covariance functions to be tested.

In [11]:
mymodel = gl.Model.createFromDb(mydb)
err = mymodel.fit(myvario,[gl.ECov.EXPONENTIAL,gl.ECov.SPHERICAL])

Visualizing the resulting model, overlaid on the experimental variogram

In [12]:
ax = gp.varmod(myvario,mymodel)
ax.decoration(title="Model for Pb")
In [13]:
mymodel.setDriftIRF()
mymodel.display()
Model characteristics
=====================
Space dimension              = 2
Number of variable(s)        = 1
Number of basic structure(s) = 2
Number of drift function(s)  = 1
Number of drift equation(s)  = 1

Covariance Part
---------------
Exponential
- Sill         =      0.966
- Range        =      0.637
- Theo. Range  =      0.212
Spherical
- Sill         =      1.670
- Range        =      5.757
Total Sill     =      2.637

Drift Part
----------
Universality_Condition
 

Gaussian Anamorphosis¶

We transform the Data into Gaussian. This requires the definition of a transform function called Gaussian Anamophosis. This function is expanded on a basis of Hermite polynomials: here 30 polynomials are used.

In [14]:
myanam = gl.AnamHermite(30)
myanam.fitFromLocator(mydb)
myanam.display()
Hermitian Anamorphosis
----------------------
Minimum absolute value for Y  = -2.7
Maximum absolute value for Y  = 2.6
Minimum absolute value for Z  = 3.0029
Maximum absolute value for Z  = 12.9777
Minimum practical value for Y = -2.7
Maximum practical value for Y = 2.6
Minimum practical value for Z = 3.0029
Maximum practical value for Z = 12.9777
Mean                          = 5.65758
Variance                      = 2.86296
Number of Hermite polynomials = 30
Normalized coefficients for Hermite polynomials (punctual variable)
               [,  0]    [,  1]    [,  2]    [,  3]    [,  4]    [,  5]    [,  6]
     [  0,]     5.658    -1.625     0.440    -0.069    -0.017     0.082    -0.061
     [  7,]     0.001     0.036    -0.044     0.004     0.047    -0.030    -0.029
     [ 14,]     0.037     0.007    -0.031     0.010     0.018    -0.019    -0.003
     [ 21,]     0.019    -0.010    -0.014     0.019     0.006    -0.023     0.004
     [ 28,]     0.022    -0.013
 

We can produce the Gaussian Anamorphosis graphically within its definition domain.

In [15]:
ax = gp.anam(myanam)
ax.decoration(title="Anamorphosis")

The next step consists in translating the target variable ('Pb') into its Gaussian transform. We can check that the newly created variable is centered with a mean close to 0 and a variance close to 1.

In [16]:
err = myanam.rawToGaussianByLocator(mydb)
mydb.display(dbfmt)
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 7
Total number of samples      = 102
Number of active samples     = 99

Data Base Extension
-------------------
Coor #1 - Min =    109.850 - Max =    143.010 - Ext = 33.16
Coor #2 - Min =    483.660 - Max =    513.040 - Ext = 29.38

Data Base Statistics
--------------------
5 - Name Pb - Locator NA
 Nb of data          =        102
 Nb of active values =         99
 Minimum value       =      3.000
 Maximum value       =     12.700
 Mean value          =      5.658
 Standard Deviation  =      1.697
 Variance            =      2.881
7 - Name Y.Pb - Locator z1
 Nb of data          =        102
 Nb of active values =         99
 Minimum value       =     -2.700
 Maximum value       =      2.513
 Mean value          =     -0.005
 Standard Deviation  =      1.007
 Variance            =      1.014

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = X - Locator = x1
Column = 2 - Name = Y - Locator = x2
Column = 3 - Name = Zn - Locator = NA
Column = 4 - Name = Pb - Locator = NA
Column = 5 - Name = NewSel - Locator = sel
Column = 6 - Name = Y.Pb - Locator = z1
 

The histogram of the transformed values show the expected beel shape.

In [17]:
ax = gp.histogram(mydb, name="Y.Pb", bins=50)

Variogram in the Gaussian scale¶

We calculate the experimental (omni-directional) variogram on the Gaussian transformed variable.

In [18]:
myvarioG = gl.Vario(myVarioParamOmni)
err = myvarioG.compute(mydb,gl.ECalcVario.VARIOGRAM)

We fit the model by automatic fit. In some cases, it is required the resulting model to have its sill equal to 1: this constraints is added to the fitting step;

In [19]:
mymodelG = gl.Model.createFromDb(mydb)
constr = gl.Constraints(1)
err = mymodelG.fit(myvarioG,[gl.ECov.EXPONENTIAL], constr)
ax = gp.varmod(myvarioG,mymodelG)
ax.decoration(title="Model for Gaussian Pb")

Back transform from Gaussian to Raw scale¶

We turn the Gaussian values back to the Raw scale. This exercise is not very demonstrative when based on the initial data themselves: in operational framework, we use this transform to turn newly created values in the Gaussian scale (results of Simulations for example) back in the Raw scale.

In [20]:
myanam.gaussianToRaw(mydb,"Y.Pb")
mydb.display(dbfmt)
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 8
Total number of samples      = 102
Number of active samples     = 99

Data Base Extension
-------------------
Coor #1 - Min =    109.850 - Max =    143.010 - Ext = 33.16
Coor #2 - Min =    483.660 - Max =    513.040 - Ext = 29.38

Data Base Statistics
--------------------
5 - Name Pb - Locator NA
 Nb of data          =        102
 Nb of active values =         99
 Minimum value       =      3.000
 Maximum value       =     12.700
 Mean value          =      5.658
 Standard Deviation  =      1.697
 Variance            =      2.881
7 - Name Y.Pb - Locator NA
 Nb of data          =        102
 Nb of active values =         99
 Minimum value       =     -2.700
 Maximum value       =      2.513
 Mean value          =     -0.005
 Standard Deviation  =      1.007
 Variance            =      1.014
8 - Name Z.Y.Pb - Locator z1
 Nb of data          =        102
 Nb of active values =         99
 Minimum value       =      3.003
 Maximum value       =     12.700
 Mean value          =      5.658
 Standard Deviation  =      1.697
 Variance            =      2.881

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = X - Locator = x1
Column = 2 - Name = Y - Locator = x2
Column = 3 - Name = Zn - Locator = NA
Column = 4 - Name = Pb - Locator = NA
Column = 5 - Name = NewSel - Locator = sel
Column = 6 - Name = Y.Pb - Locator = NA
Column = 7 - Name = Z.Y.Pb - Locator = z1
 

The back transformation, from Gaussian to Raw scale, is performed using the Hermite polynomial expansion (with a limited number of polynomials). This is the reason why we may expect each datum not to coincide exactly with its initial value. This is demonstrated in the next correlation plot.

In [21]:
ax = gp.correlation(mydb, namex="Pb", namey="Z.Y.Pb", asPoint=True)