Statistics on Db¶
This file demonstrates the use of Statistics functions performed on a Point and a Grid (in 2-D).
Import packages¶
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc
gdoc.setNoScroll()
Defining the Grid file called grid. The grid contains three variables with values generated randomly (called "SG_i")
grid = gl.DbGrid.create(nx=[150,100])
ngrid = grid.getSampleNumber()
grid.addColumns(gl.VectorHelper.simulateGaussian(ngrid),"SG1",gl.ELoc.Z)
grid.addColumns(gl.VectorHelper.simulateGaussian(ngrid),"SG2",gl.ELoc.Z)
grid.addColumns(gl.VectorHelper.simulateGaussian(ngrid),"SG3",gl.ELoc.Z)
grid
Data Base Grid Characteristics ============================== Data Base Summary ----------------- File is organized as a regular grid Space dimension = 2 Number of Columns = 6 Total number of samples = 15000 Grid characteristics: --------------------- Origin : 0.000 0.000 Mesh : 1.000 1.000 Number : 150 100 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x1 - Locator = x1 Column = 2 - Name = x2 - Locator = x2 Column = 3 - Name = SG1 - Locator = NA Column = 4 - Name = SG2 - Locator = NA Column = 5 - Name = SG3 - Locator = z1
Defining a Point data base called data, covering the grid(s) extension. The data base contains three variables generated randomly (called "SD_i")
nech = 100
data = gl.Db.createFromBox(nech, grid.getCoorMinimum(), grid.getCoorMaximum())
data.addColumns(gl.VectorHelper.simulateGaussian(nech),"SD1",gl.ELoc.Z)
data.addColumns(gl.VectorHelper.simulateGaussian(nech),"SD2",gl.ELoc.Z)
data.addColumns(gl.VectorHelper.simulateGaussian(nech),"SD3",gl.ELoc.Z)
data
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 6 Total number of samples = 100 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2 Column = 3 - Name = SD1 - Locator = NA Column = 4 - Name = SD2 - Locator = NA Column = 5 - Name = SD3 - Locator = z1
The following plot displays the variable SG1 from the Grid Data Base (in color scale) and the variable SD1 from the Point Data Base (in proportional symbols).
ax = grid.plot("SG1")
ax = data.plot(color="white")
ax.decoration(title="Data")
Note that in all subsequent tests, we will have to specify a set of statistical operations. This list is defined once for all and specified using fromKeys utility to make the script more legible.
opers = gl.EStatOption.fromKeys(["NUM", "MEAN", "STDV"])
In the next paragraph, we calculate some monovariate statistics on the variables contained in the Point Data Base. For all methods, several calls are available, depending on:
- how the target variables are specified
- how the results are produced
gl.dbStatisticsMono(data, ["SD*"], opers = opers)
Number Mean St. Dev. SD1 100.000 0.161 1.028 SD2 100.000 -0.293 1.020 SD3 100.000 0.118 0.910
The next command produces the correlation matrix of the selected variables.
gl.dbStatisticsCorrel(data, ["SD*"])
SD1 SD2 SD3 SD1 1.000 -0.006 -0.186 SD2 -0.006 1.000 0.002 SD3 -0.186 0.002 1.000
The following command prints the statistics on the selected variables (including the correlation matrix).
gl.dbStatisticsPrint(data, ["SD*"], opers=opers, flagCorrel=True)
Number Mean St. Dev. SD1 100 0.161 1.028 SD2 100 -0.293 1.020 SD3 100 0.118 0.910 Number of isotopic active samples = 100 Correlation matrix [,1] [,2] [,3] [1,] 1.000 -0.006 -0.186 [2,] -0.006 1.000 0.002 [3,] -0.186 0.002 1.000
The following command provides an array containaing the evaluation of a given Statistical calculation for a set of variables contained in a Db.
If 'flagMono' is set to False, this satistics is calculated for each variable in turn. Otherwise this statistics is calculated on each variable, based on the only samples where one of the other variables is defined. In that case, the dimension of the output is equal to the squzre of the number of target variables.
In our case, there will be no difference in the contents of these two outputs as the data set if Isotopic.
gl.dbStatisticsMulti(data, ["SD*"], gl.EStatOption.MEAN, flagMono = True)
Mean ---- SD1 0.161 SD2 -0.293 SD3 0.118
gl.dbStatisticsMulti(data, ["SD*"], gl.EStatOption.MEAN, flagMono = False)
Mean ---- SD1 SD2 SD3 SD1 0.161 0.161 0.161 SD2 -0.293 -0.293 -0.293 SD3 0.118 0.118 0.118
Using the Grid¶
We now calculate the statistics of the data contained in the Point Db, per cell of the output DbGrid. This function returns the results as an array of values (which has the dimension of the number of cells of the output Grid).
For those calculations, we will consider a coarse grid overlaying the initial grid, but with meshes obtained as multiples of the initial one.
gridC = grid.coarsify([5,5])
gridC
Data Base Grid Characteristics ============================== Data Base Summary ----------------- File is organized as a regular grid Space dimension = 2 Number of Columns = 6 Total number of samples = 600 Grid characteristics: --------------------- Origin : 2.000 2.000 Mesh : 5.000 5.000 Number : 30 20 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x1 - Locator = x1 Column = 2 - Name = x2 - Locator = x2 Column = 3 - Name = SG1 - Locator = NA Column = 4 - Name = SG2 - Locator = NA Column = 5 - Name = SG3 - Locator = z1
tab = gl.dbStatisticsPerCell(data, gridC, gl.EStatOption.MEAN, "SD1")
iuid = gridC.addColumns(tab, "Mean.SD1", gl.ELoc.Z)
ax = gridC.plot("Mean.SD1")
If may be more handy to store the statistic (say the Mean) directly as new variables in the output Grid File. These calculations will be performed for each input variable (Z_Locator) in the input file.
data.setLocators(["SD*"],gl.ELoc.Z)
err = gl.dbStatisticsOnGrid(data, gridC, gl.EStatOption.MEAN)
Obviously the results for the first variable, is similar to the previous calculation (as demonstrated using the scatter plot). But the statistics for the other variables have been calculated simultaneously.
ax = gp.correlation(gridC,namex="Mean.SD1",namey="Stats.SD1", bins=100)
More interesting is the ability to dilate the size of the cell while performing the calculations. Here, each grid node is dilated with a ring extension of 2: the initial node extension is multiplied by 5. So very few cells have no data included in their dilated dimension.
err = gl.dbStatisticsOnGrid(data, gridC, gl.EStatOption.MEAN, radius=2,
namconv=gl.NamingConvention("Stats.Dilate"))
ax = gridC.plot("Stats.Dilate.SD1")
This same feature cab be used to calculate the dispersion variance of blocks (say the cells of the fine grid) within panels (say the cells of the coarse grid).
grid.setLocator("SG1",gl.ELoc.Z, cleanSameLocator=True)
err = gl.dbStatisticsOnGrid(grid, gridC, gl.EStatOption.VAR, radius=2,
namconv=gl.NamingConvention("Var.Disp"))
ax = gridC.plot("Var.Disp.SG1")
ax.decoration(title="Dispersion Variance of blocks into panels")