Preamble

In this preamble, we load the gstlearn library.

rm(list=ls())
library(gstlearn)
library(ggplot2)
library(ggpubr)

Main Classes

Here is a (non-exhaustive) list of classes of objects in gstlearn:


Loading a CSV File

We start by downloading the file called Scotland_Temperatures.csv and we store it in the current working directory. In this example, the file (called filecsv) is provided as a CSV format file. We load it into a data frame (named datcsv) using the relevant R-command.

dlfile = "https://soft.minesparis.psl.eu/gstlearn/data/Scotland/Scotland_Temperatures.csv"
filecsv = "Scotland_Temperatures.csv"
download.file(dlfile, filecsv)
datcsv = read.csv(filecsv)

We can check the contents of the data frame (by simply typing its name) and see that it contains four columns (respectively called Longitude, Latitude, Elevation, January_temp) and 236 rows (header line excluded).

datcsv

Note that the last column contains several values called MISS: this corresponds to the absence of information.


Creating Db object

We now want to load this information in order to obtain a data base of the gstlearn package (or Db) that will be called dat. This operation can be performed directly by reading the CSV file again and load it directly into a Db.

To do so, we start by creating CSVformat object using the CSVformat_create function. This object is used to specify various properties of the file we want to load, namely the presence of a header (through the argument flagHeader) and the way missing values are coded in the file (through the argument naString).

Then, the function Db_createFromCSV allows to load directly the CSV file into a gstlearn data base.

csv = CSVformat_create(flagHeader=TRUE, naString = "MISS")
dat = Db_createFromCSV(filecsv, csv=csv)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 0
## Number of Columns            = 5
## Maximum Number of UIDs       = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = NA
## Column = 2 - Name = Latitude - Locator = NA
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = NA

Importing Db File

A last solution is to import it directly from the set of demonstration files (provided together with the package and called fileNF) and stored in a specific format (Neutral file).

These NF (or neutral file) are currently used for serialization of the gstlearn objects. They will probably be replaced in the future by a facility backuping the whole workspace in one step.

Note that the contents of the Db is slightly different from the result obtained when reading from CSV. Essentially, some variables have a Locator field defined, some do not. This concept will be described later in this chapter and the difference can be ignored.

dlfile = "https://soft.minesparis.psl.eu/gstlearn/data/Scotland/Scotland_Temperatures.NF"
fileNF = "Scotland_Temperatures.NF"
download.file(dlfile, fileNF)
dat = Db_createFromNF(fileNF)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Maximum Number of UIDs       = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1

Db class

Db objects have a method display allowing to print a summary of the content of the data base.

dat$display()
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Maximum Number of UIDs       = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

Equivalently, we can simply type the name of the Db object in a console to get the summary of its content.

dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Maximum Number of UIDs       = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1

As described in the “Data Base Summary” section, this Db object contains 5 fields (called Columns), and contains 236 data points (called samples). Upon inspection, we see that the 4 variables of the csv file are present (Columns 1 through 4), alongside with an additional variable called rank (Column 0). The rank variable is a variable present by default in all Db objects, and contains the index (starting at 1) of each sample/data point in the data base.

Remark: To get more information on the contents of the Db, we can provide the display method of a Db with a DbStringFormat object used to describe which information we would like to print. Such objects can be created using the function DbStringFormat_createFromFlags. We refer the reader to the documentation of the DbStringFormat class for more details. The example below provides a way to add summary statistics about some variables of the Db to the Db summary.

dbfmt = DbStringFormat_createFromFlags(flag_stats=TRUE, names=c("Elevation", "January_temp"))
dat$display(dbfmt)
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Maximum Number of UIDs       = 5
## Total number of samples      = 236
## 
## Data Base Statistics
## --------------------
## 4 - Name Elevation - Locator NA
##  Nb of data          =        236
##  Nb of active values =        236
##  Minimum value       =      2.000
##  Maximum value       =    800.000
##  Mean value          =    146.441
##  Standard Deviation  =    165.138
##  Variance            =  27270.713
## 5 - Name January_temp - Locator z1
##  Nb of data          =        236
##  Nb of active values =        151
##  Minimum value       =      0.600
##  Maximum value       =      5.200
##  Mean value          =      2.815
##  Standard Deviation  =      1.010
##  Variance            =      1.020
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

Assessors for Db class

We can also consider the data base as a data frame and use the [ ] assessors. For instance, the full content of a Db can be displayed as a data frame as follows.

dat[]

We can access one or several variables using their names.

dat["January_temp"]
##   [1] 1.7 2.0 4.6  NA 3.1 3.5 3.4 3.0 4.9 2.9  NA 1.3  NA 4.0 1.7  NA 1.9 3.3
##  [19] 2.3  NA 2.3 2.6  NA 2.7 2.9  NA 1.0 1.2  NA 3.1  NA 3.7 2.1 2.5 2.9  NA
##  [37]  NA  NA 3.1 2.1  NA 2.7 3.0  NA  NA 1.8  NA  NA 2.2 2.9 3.3  NA 5.0 1.6
##  [55]  NA 2.1 3.2 4.2 1.1  NA 2.7 0.6 3.2  NA 2.5 2.0 2.8  NA 3.2 3.2 4.5 3.3
##  [73] 4.1 2.2 1.7 4.3 5.2  NA 1.6 3.9 3.1  NA 3.5 4.7 3.6  NA 1.8 1.7  NA  NA
##  [91]  NA  NA  NA  NA  NA 1.7  NA 3.0 4.6 3.9 3.2 1.3  NA  NA  NA 4.7  NA 2.6
## [109] 2.0 4.7 1.2 2.9 0.9 3.0  NA 3.6 0.7 3.3  NA  NA  NA 2.7  NA 2.7 2.4  NA
## [127]  NA 2.0 2.6  NA 4.3  NA  NA  NA  NA 3.1 3.4 3.1 2.0 1.3 1.9  NA 3.3 2.7
## [145] 4.4  NA 3.0 0.9 0.7  NA 3.6  NA 3.5  NA 2.4 1.0  NA 3.6  NA  NA  NA  NA
## [163] 3.0  NA 3.5 4.0 3.0 3.6  NA 3.2 1.7 2.7 1.9  NA  NA 4.4 1.9 3.3  NA  NA
## [181] 3.5 1.7 3.0  NA 2.7  NA 1.0 3.3  NA  NA 3.2 3.9  NA  NA 3.0  NA 3.8  NA
## [199] 2.8  NA 2.9 1.4 2.6 3.0  NA 2.8 2.9 3.6  NA 2.0 4.6 3.7  NA  NA 4.5 2.7
## [217]  NA 4.7 1.7 1.9 3.5  NA  NA  NA 2.1 2.3 3.1  NA  NA 2.0 2.6 2.8 2.6  NA
## [235] 2.1 2.6

Note that the contents of the Column corresponding to the target variable (i.e. January_temp) is produced as a series of values (printed along a line). Also note the presence of samples with NA corresponding to those where the target variable is not informed.

But we can be more restrictive as in the next example, where we consider the samples 10 to 15 of the variables Latitude and Elevation.

dat[10:15, c("Latitude", "Elevation")]

We can also replace the variable name by their Column index in the data base.

dat[10:15, 3:4]

This is not recommended as the Column index of a given variable may vary over time.

Finally, an interesting feature of the [ ] assessors is that it allows to easily incorporate new variables into a Db or modify the existing ones. For instance, in the next example, a new variable newvar is created and added to the data base dat.

dat["newvar"] = 12.3 * dat["Elevation"] - 2.1 * dat["*temp"]
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 6
## Maximum Number of UIDs       = 6
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## Column = 5 - Name = newvar - Locator = NA

Remark: Note that variables names may be specified using traditional regexp expressions (for instance, the symbol ‘*’ replaces any list of characters meaning that ["*temp"] selects all the variable names ending with temp).


Locators

The locators are used to specify the role assigned to a Column for the rest of the study (unless they are modified). The locator is characterized by its name (Z for a variable and X for a coordinate) within the Enumeration ELoc and its rank.

# --- MP ---
ELoc_printAll()
##   -1 -     UNKNOWN : Unknown locator
##    0 -           X : Coordinate
##    1 -           Z : Variable
##    2 -           V : Variance of measurement error
##    3 -           F : External Drift
##    4 -           G : Gradient component
##    5 -           L : Lower bound of an inequality
##    6 -           U : Upper bound of an inequality
##    7 -           P : Proportion
##    8 -           W : Weight
##    9 -           C : Code
##   10 -         SEL : Selection
##   11 -         DOM : Domain
##   12 -        BLEX : Block Extension
##   13 -        ADIR : Dip direction Angle
##   14 -        ADIP : Dip Angle
##   15 -        SIZE : Object height
##   16 -          BU : Fault UP termination
##   17 -          BD : Fault DOWN termination
##   18 -        TIME : Time variable
##   19 -       LAYER : Layer rank
##   20 -      NOSTAT : Non-stationary parameter
##   21 -        TGTE : Tangent
##   22 -        SIMU : Conditional or non-conditional simulations
##   23 -      FACIES : Facies simulated
##   24 -     GAUSFAC : Gaussian value for Facies
##   25 -        DATE : Date
##   26 -       RKLOW : Rank for lower bound (when discretized)
##   27 -        RKUP : Rank for upper bound (when discretized)
##   28 -         SUM : Constraints on the Sum
## NULL
dat$setLocators(c("Longitude","Latitude"), ELoc_X())
## NULL
dat$setLocator("*temp", ELoc_Z(), cleanSameLocator=TRUE)
## NULL
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 6
## Maximum Number of UIDs       = 6
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## Column = 5 - Name = newvar - Locator = NA

As can be seen in the printout, variables Latitude and Longitude have been designated as coordinates (pay attention to the order) and January_temp is the (unique) variable. Therefore any subsequent step will be performed as a monovariate 2-D process.

The locator is translated into a letter,number pair for better legibility: e.g. x1 for the first coordinate.


Plotting a Db

Plot the contents of a Db using functions of the plot.R package. The proportional option is used to represent to january_temp variable

p = ggDefaultGeographic()
p = p + plot.point(dat, name_size="January_temp", show.legend.symbol = TRUE,
                   legend.name.size="Temperature")
p = p + plot.decoration(title="My Data Base", xlab="Easting", ylab="Northing")
ggPrint(p)

A more elaborated graphic representation displays the samples with a symbol proportional to the Elevation and a color representing the Temperature.

p = ggDefaultGeographic()
p = p + plot.point(dat, name_size="Elevation", name_color="January_temp")
p = p + plot.decoration(title="My Data Base", xlab="Easting", ylab="Northing")
ggPrint(p)


Grid Data Base

On the same area, a terrain model is available (as a demonstration file available in the package distribution). We first download it as an element of a data base defined on a grid support (DbGrid).

dlfile = "https://soft.minesparis.psl.eu/gstlearn/data/Scotland/Scotland_Elevations.NF"
fileNF = "Scotland_Elevations.NF"
download.file(dlfile, fileNF)
grid = DbGrid_createFromNF(fileNF)
grid
## 
## Data Base Grid Characteristics
## ==============================
## 
## Data Base Summary
## -----------------
## File is organized as a regular grid
## Space dimension              = 2
## Number of Columns            = 4
## Maximum Number of UIDs       = 4
## Total number of samples      = 11097
## Number of active samples     = 3092
## 
## Grid characteristics:
## ---------------------
## Origin :     65.000   535.000
## Mesh   :      4.938     4.963
## Number :         81       137
## 
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = x1
## Column = 1 - Name = Latitude - Locator = x2
## Column = 2 - Name = Elevation - Locator = f1
## Column = 3 - Name = inshore - Locator = sel

We can check that the grid is constituted of 81 columns and 137 rows, or 11097 grid cells.


Selection

We can check the presence of a variable (called inshore) which is assigned to the sel locator: this corresponds to a Selection which acts as a binary filter: some grid cells are active and others are masked off. The count of active samples is given in the previous printout (3092). This selection remains active until it is replaced or deleted (there may not be more than one selection defined at a time per data base). This is what can be seen in the following display where we represent the Elevation only within the inshore selection.

p = ggDefaultGeographic()
p = p + plot.grid(grid, name_raster="Elevation")
p = p + plot.decoration(title="My Grid", xlab="Easting", ylab="Northing")
ggPrint(p)

Note that any variable can be considered as a Selection: it must simply be assigned to the sel locator using the setLocator variable described earlier.


Final plot

On this final plot, we combine grid and point representations.

p = ggDefaultGeographic()
p = p + plot.grid(grid, name_raster="Elevation")
p = p + plot.point(dat, name_size="Elevation", sizmin=1, sizmax=3, color="yellow")
p = p + plot.decoration(title="My Grid", xlab="Easting", ylab="Northing")
ggPrint(p)