Preamble

In this preamble, we load the gstlearn package.

rm(list=ls())
library(gstlearn)
library(ggplot2)
library(ggpubr)
library(ggnewscale)

flagInternetAvailable=TRUE

Main Classes

Here is a (non-exhaustive) list of classes of objects in gstlearn:


Importing External File

Loading a CSV File

We start by downloading the file called Scotland_Temperatures.csv and we store it in the current working directory. In this example, the file (called filecsv) is provided as a CSV format file. We load it into a data frame (named datcsv) using the relevant R-command. Note that “MISS” keyword is used in this file to indicate a missing value. Such values will be replaced by NA.

filecsv = "Scotland_Temperatures.csv"
if(flagInternetAvailable){
  download.file(paste0("https://soft.minesparis.psl.eu/gstlearn/data/Scotland/",filecsv), filecsv, quiet=TRUE)
}
datcsv = read.csv(filecsv, na.strings = "MISS")

We can check the contents of the data frame (by simply typing its name) and see that it contains four columns (respectively called Longitude, Latitude, Elevation, January_temp) and 236 rows (header line excluded).

datcsv

Creating Db object from a data.frame

The user can then create a database of the gstlearn package (Db class) directly from the previously imported data.frame.

dat = Db_fromDF(datcsv)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 0
## Number of Columns            = 4
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = NA
## Column = 1 - Name = Latitude - Locator = NA
## Column = 2 - Name = Elevation - Locator = NA
## Column = 3 - Name = January_temp - Locator = NA

Creating Db object directly from CSV file

These operations can be performed directly by reading the CSV file again and load it directly into a Db.

To do so, we start by creating CSVformat object using the CSVformat_create function. This object is used to specify various properties of the file we want to load, namely the presence of a header line (through the argument flagHeader) and the way missing values are coded in the file (through the argument naString).

Then, the function Db_createFromCSV allows to load directly the CSV file into a gstlearn data base.

csv = CSVformat_create(flagHeader=TRUE, naString = "MISS")
dat = Db_createFromCSV(filecsv, csv=csv)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 0
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = NA
## Column = 2 - Name = Latitude - Locator = NA
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = NA

Note that a “rank” variable has been automatically added. The rank is always 1-based and must be distinguish from an index (0-based) when calling gstlearn functions (except for the [] operator, see below). The rank variable could be later useful for certain functions of the gstlearn package.


Importing Db File from a “Neutral File”

A last solution is to import it directly from the set of demonstration files (provided together with the package and called fileNF) and stored in a specific format (Neutral file).

These NF (or neutral file) are currently used for serialization of the gstlearn objects. They will probably be replaced in the future by a facility backuping the whole workspace in one step.

Note that the contents of the Db is slightly different from the result obtained when reading from CSV. Essentially, some variables have a Locator field defined, some do not. This concept will be described later in this chapter and the difference can be ignored.

fileNF = "Scotland_Temperatures.NF"
if(flagInternetAvailable){
  download.file(paste0("https://soft.minesparis.psl.eu/gstlearn/data/Scotland/",fileNF), fileNF, quiet=TRUE)
}
dat = Db_createFromNF(fileNF)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1

Discovering Db

The Db class

Db objects (as all objects that inherits from AStringable) have a method display allowing to print a summary of the content of the data base. The same occurs when typing the name of the variable at the end of a chunck (see above).

dat$display()
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

As described in the “Data Base Summary” section, this Db object contains 5 fields (called Columns), and contains 236 data points (called samples). Upon inspection, we see that the 4 variables of the csv file are present (Columns 1 through 4), alongside with an additional variable called rank (Column 0).

In addition, some interesting information tells you that this data base corresponds to a 2-D dimension one: this will be described later together with the use of the Locator information.

Remark: To get more information on the contents of the Db, we can provide the display method of a Db with a DbStringFormat object used to describe which information we would like to print. Such objects can be created using the function DbStringFormat_createFromFlags. We refer the reader to the documentation of the DbStringFormat class for more details. The example below provides a way to add summary statistics about some variables of the Db to the Db summary.

dbfmt = DbStringFormat_createFromFlags(flag_stats=TRUE, names=c("Elevation", "January_temp"))
dat$display(dbfmt)
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Data Base Statistics
## --------------------
## 4 - Name Elevation - Locator NA
##  Nb of data          =        236
##  Nb of active values =        236
##  Minimum value       =      2.000
##  Maximum value       =    800.000
##  Mean value          =    146.441
##  Standard Deviation  =    165.138
##  Variance            =  27270.713
## 5 - Name January_temp - Locator z1
##  Nb of data          =        236
##  Nb of active values =        151
##  Minimum value       =      0.600
##  Maximum value       =      5.200
##  Mean value          =      2.815
##  Standard Deviation  =      1.010
##  Variance            =      1.020
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

Monovariate statistics are better displayed using a single function called dbStatisticsMono. This function waits for a vector of enumerators of type EStatOption as statistic operators. Such vector is created using a static function called fromKeys which is available in all enumerators classes (i.e. inherits from AEnum).

dbStatisticsMono(dat,
                  names=c("Elevation", "January_temp"),
                  opers=EStatOption_fromKeys(c("MEAN","MINI","MAXI")))
##                    Mean    Minimum    Maximum
##    Elevation     87.974      3.000    387.000
## January_temp      2.815      0.600      5.200

Assessors for Db class

We can also consider the data base as a data frame and use the [ ] assessors. For instance, the full content of a Db can be displayed as a data.frame as follows.

dat[]

We can access to one or several variables. Note that the contents of the Column corresponding to the target variable (i.e. January_temp) is produced as a 1D vector.

Also note the presence of samples with NA corresponding to those where the target variable is not informed (‘MISS’ in the original dataset file).

dat["January_temp"]
##   [1] 1.7 2.0 4.6  NA 3.1 3.5 3.4 3.0 4.9 2.9  NA 1.3  NA 4.0 1.7  NA 1.9 3.3
##  [19] 2.3  NA 2.3 2.6  NA 2.7 2.9  NA 1.0 1.2  NA 3.1  NA 3.7 2.1 2.5 2.9  NA
##  [37]  NA  NA 3.1 2.1  NA 2.7 3.0  NA  NA 1.8  NA  NA 2.2 2.9 3.3  NA 5.0 1.6
##  [55]  NA 2.1 3.2 4.2 1.1  NA 2.7 0.6 3.2  NA 2.5 2.0 2.8  NA 3.2 3.2 4.5 3.3
##  [73] 4.1 2.2 1.7 4.3 5.2  NA 1.6 3.9 3.1  NA 3.5 4.7 3.6  NA 1.8 1.7  NA  NA
##  [91]  NA  NA  NA  NA  NA 1.7  NA 3.0 4.6 3.9 3.2 1.3  NA  NA  NA 4.7  NA 2.6
## [109] 2.0 4.7 1.2 2.9 0.9 3.0  NA 3.6 0.7 3.3  NA  NA  NA 2.7  NA 2.7 2.4  NA
## [127]  NA 2.0 2.6  NA 4.3  NA  NA  NA  NA 3.1 3.4 3.1 2.0 1.3 1.9  NA 3.3 2.7
## [145] 4.4  NA 3.0 0.9 0.7  NA 3.6  NA 3.5  NA 2.4 1.0  NA 3.6  NA  NA  NA  NA
## [163] 3.0  NA 3.5 4.0 3.0 3.6  NA 3.2 1.7 2.7 1.9  NA  NA 4.4 1.9 3.3  NA  NA
## [181] 3.5 1.7 3.0  NA 2.7  NA 1.0 3.3  NA  NA 3.2 3.9  NA  NA 3.0  NA 3.8  NA
## [199] 2.8  NA 2.9 1.4 2.6 3.0  NA 2.8 2.9 3.6  NA 2.0 4.6 3.7  NA  NA 4.5 2.7
## [217]  NA 4.7 1.7 1.9 3.5  NA  NA  NA 2.1 2.3 3.1  NA  NA 2.0 2.6 2.8 2.6  NA
## [235] 2.1 2.6

But it can be more restrictive as in the following paragraph, where we only consider the samples 10 to 15, and only consider the variables rank, Latitude, Elevation. In R indices in array start from 1 to N (1-based). Indices slice ‘10:15’ in R means indices {10,11,12,13,14,15} (the last index is considered which is different from Python) which means ranks {10,11,12,13,14,15}.Be carefull that for all other functions of the gstlearn package, indices must be provided 0-based.

dat[10:15, c("rank", "Latitude", "Elevation")]

We can also replace the variable name by their Column index (1-based in []operator) in the data base.

dat[10:15, 3:4]

This is not recommended as the Column index of a given variable may vary over time.

A particular function is available to convert all the data base in an appropriate object of the Target Langage (here R). A gstlearn Db is converted into a data.frame using toTL.

dat$toTL()

Finally, an interesting feature of the [ ] assessors is that it allows to easily incorporate new variables into a Db or modify the existing ones. For instance, in the next example, a new variable newvar is created and added to the data base dat.

dat["newvar"] = 12.3 * dat["Elevation"] - 2.1 * dat["*temp"]
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 6
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## Column = 5 - Name = newvar - Locator = NA

Remark: Note that variables names may be specified using traditional regexp expressions (for instance, the symbol ‘*’ replaces any list of characters meaning that ["*temp"] selects all the variable names ending with temp).

The user also can remove a variable from the data base by doing the following:

dat$deleteColumn("newvar")
## NULL
dat$display()
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

Locators

The locators are used to specify the role assigned to a Column for the rest of the study (unless they are modified). The locator is characterized by its name (Z for a variable and X for a coordinate) within the Enumeration ELoc.

dat$setLocators(c("Longitude","Latitude"), ELoc_X())
## NULL
dat$setLocator("*temp", ELoc_Z(), cleanSameLocator=TRUE)
## NULL
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1

As can be seen in the printout, variables Latitude and Longitude have been designated as coordinates (pay attention to the order) and January_temp is the (unique) variable of interest. Therefore any subsequent step will be performed as a monovariate 2-D process.

The locator is translated into a letter,number pair for better legibility: e.g. x1 for the first coordinate.

Here are all the roles known by gstlearn:

ELoc_printAll()
##   -1 -     UNKNOWN : Unknown locator
##    0 -           X : Coordinate
##    1 -           Z : Variable
##    2 -           V : Variance of measurement error
##    3 -           F : External Drift
##    4 -           G : Gradient component
##    5 -           L : Lower bound of an inequality
##    6 -           U : Upper bound of an inequality
##    7 -           P : Proportion
##    8 -           W : Weight
##    9 -           C : Code
##   10 -         SEL : Selection
##   11 -         DOM : Domain
##   12 -        BLEX : Block Extension
##   13 -        ADIR : Dip direction Angle
##   14 -        ADIP : Dip Angle
##   15 -        SIZE : Object height
##   16 -          BU : Fault UP termination
##   17 -          BD : Fault DOWN termination
##   18 -        TIME : Time variable
##   19 -       LAYER : Layer rank
##   20 -      NOSTAT : Non-stationary parameter
##   21 -        TGTE : Tangent
##   22 -        SIMU : Conditional or non-conditional simulations
##   23 -      FACIES : Facies simulated
##   24 -     GAUSFAC : Gaussian value for Facies
##   25 -        DATE : Date
##   26 -       RKLOW : Rank for lower bound (when discretized)
##   27 -        RKUP : Rank for upper bound (when discretized)
##   28 -         SUM : Constraints on the Sum
## NULL

More with Db

Plotting a Db

Plot the contents of a Db using functions of the package (which relies on ggplot2). The color option is used to represent to january_temp variable.

Note: Non availalble values (NaN) are display in gray. This will be tunable in future versions.

p = ggDefaultGeographic()
p = p + plot.point(dat, nameColor="January_temp", flagLegendColor = TRUE,
                   legendNameColor="Temperature")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)

A more elaborated graphic representation displays the samples with a symbol proportional to the Elevation (nameSize) and a color representing the Temperature (nameColor).

p = ggDefaultGeographic()
p = p + plot.point(dat, nameSize="Elevation", nameColor="January_temp", flagLegendColor = TRUE,
                   legendNameColor="Temperature", legendNameSize="Elevation")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)

Of course, you can use your own graphical routines (for example, a direct call to ggplot2) by simply accessing to the gstlearn data base values (using ‘[ ]’ accessor):

p = ggplot()
p = p + geom_point(data=dat[], mapping=aes(x=dat["x1"], y=dat["x2"], color=dat["January_temp"]))
p = p + labs(color = "Temperature")
p = p + labs(x = "Easting", y = "Northing")
p = p + labs(title = "January Temperature")
plot(p)


Grid Data Base

On the same area, a terrain model is available (as a demonstration file available in the package distribution). We first download it and create the corresponding data base defined on a grid support (DbGrid).

fileNF = "Scotland_Elevations.NF"
if(flagInternetAvailable){
  download.file(paste0("https://soft.minesparis.psl.eu/gstlearn/data/Scotland/",fileNF), fileNF, quiet=TRUE)
}
grid = DbGrid_createFromNF(fileNF)
grid
## 
## Data Base Grid Characteristics
## ==============================
## 
## Data Base Summary
## -----------------
## File is organized as a regular grid
## Space dimension              = 2
## Number of Columns            = 4
## Total number of samples      = 11097
## Number of active samples     = 3092
## 
## Grid characteristics:
## ---------------------
## Origin :     65.000   535.000
## Mesh   :      4.938     4.963
## Number :         81       137
## 
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = x1
## Column = 1 - Name = Latitude - Locator = x2
## Column = 2 - Name = Elevation - Locator = f1
## Column = 3 - Name = inshore - Locator = sel

We can check that the grid is constituted of 81 columns and 137 rows, or 11097 grid cells. We can also notice that some locators are already defined (these information are stored in the Neutral File).


Selection

We can check the presence of a variable (called inshore) which is assigned to the sel locator: this corresponds to a Selection which acts as a binary filter: some grid cells are active and others are masked off. The count of active samples is given in the previous printout (3092). This selection remains active until it is replaced or deleted (there may not be more than one selection defined at a time per data base). This is what can be seen in the following display where we represent the Elevation only within the inshore selection.

p = ggDefaultGeographic()
p = p + plot.grid(grid, nameRaster="Elevation", flagLegendRaster=TRUE, legendNameRaster="Elevation")
p = p + plot.decoration(title="Elevation", xlab="Easting", ylab="Northing")
ggPrint(p)

Note that any variable can be considered as a Selection: it must simply be assigned to the sel locator using the setLocator variable described earlier.


Final plot

On this final plot, we combine grid and point representations.

p = ggDefaultGeographic()
p = p + plot.grid(grid, nameRaster="Elevation", flagLegendRaster=TRUE, legendNameRaster="Elevation")
p = p + plot.point(dat, nameSize="January_temp", flagLegendSize=TRUE, legendNameSize="Temperature", sizmin=1, sizmax=3, color="yellow")
p = p + plot.decoration(title="Elevation and Temperatures", xlab="Easting", ylab="Northing")
ggPrint(p)