Preamble

In this preamble, we load the gstlearn package.

rm(list=ls())
library(gstlearn)
library(ggplot2)
library(ggpubr)
library(ggnewscale)

Main Classes

Here is a (non-exhaustive) list of classes of objects in gstlearn:


Importing External File

Loading a CSV File

We start by downloading the file called Scotland_Temperatures.csv and we store it in the current working directory. In this example, the file (called filecsv) is provided as a CSV format file. We load it into a data frame (named datcsv) using the relevant R-command. Note that "MISS" keyword is used in this file to indicate a missing value. Such values will be replaced by NA.

filecsv = loadData("Scotland", "Scotland_Temperatures.csv")
datcsv = read.csv(filecsv, na.strings = "MISS")

We can check the contents of the data frame (by simply typing its name) and see that it contains four columns (respectively called Longitude, Latitude, Elevation, January_temp) and 236 rows (header line excluded).

datcsv

Creating Db object from a data.frame

The user can then create a database of the gstlearn package (Db class) directly from the previously imported data.frame.

dat = fromTL(datcsv)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 0
## Number of Columns            = 4
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = NA
## Column = 1 - Name = Latitude - Locator = NA
## Column = 2 - Name = Elevation - Locator = NA
## Column = 3 - Name = January_temp - Locator = NA

Creating Db object directly from CSV file

These operations can be performed directly by reading the CSV file again and load it directly into a Db.

To do so, we start by creating CSVformat object using the CSVformat_create function. This object is used to specify various properties of the file we want to load, namely the presence of a header line (through the argument flagHeader) and the way missing values are coded in the file (through the argument naString).

Then, the function Db_createFromCSV allows to load directly the CSV file into a gstlearn data base.

csv = CSVformat_create(flagHeader=TRUE, naString = "MISS")
dat = Db_createFromCSV(filecsv, csv=csv)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 0
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = NA
## Column = 2 - Name = Latitude - Locator = NA
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = NA

Note that a "rank" variable has been automatically added. The rank is always 1-based and must be distinguish from an index (0-based) when calling gstlearn functions (except for the [] operator, see below). The rank variable could be later useful for certain functions of the gstlearn package.


Importing Db File from a "Neutral File"

A last solution is to import it directly from the set of demonstration files (provided together with the package and called fileNF) and stored in a specific format (Neutral file).

These NF (or neutral file) are currently used for serialization of the gstlearn objects. They will probably be replaced in the future by a facility backuping the whole workspace in one step.

Note that the contents of the Db is slightly different from the result obtained when reading from CSV. Essentially, some variables have a Locator field defined, some do not. This concept will be described later in this chapter and the difference can be ignored.

fileNF = loadData("Scotland", "Scotland_Temperatures.NF")
dat = Db_createFromNF(fileNF)
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1

Discovering Db

The Db class

Db objects (as all objects that inherits from AStringable) have a method display allowing to print a summary of the content of the data base. The same occurs when typing the name of the variable at the end of a chunck (see above).

dat$display()
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

As described in the "Data Base Summary" section, this Db object contains 5 fields (called Columns), and contains 236 data points (called samples). Upon inspection, we see that the 4 variables of the csv file are present (Columns 1 through 4), alongside with an additional variable called rank (Column 0).

In addition, some interesting information tells you that this data base corresponds to a 2-D dimension one: this will be described later together with the use of the Locator information.

Remark: To get more information on the contents of the Db, we can provide the display method of a Db with a DbStringFormat object used to describe which information we would like to print. Such objects can be created using the function DbStringFormat_createFromFlags. We refer the reader to the documentation of the DbStringFormat class for more details. The example below provides a way to add summary statistics about some variables of the Db to the Db summary.

dbfmt = DbStringFormat_createFromFlags(flag_stats=TRUE, names=c("Elevation", "January_temp"))
dat$display(dbfmt)
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Data Base Statistics
## --------------------
## 4 - Name Elevation - Locator NA
##  Nb of data          =        236
##  Nb of active values =        236
##  Minimum value       =      2.000
##  Maximum value       =    800.000
##  Mean value          =    146.441
##  Standard Deviation  =    165.138
##  Variance            =  27270.713
## 5 - Name January_temp - Locator z1
##  Nb of data          =        236
##  Nb of active values =        151
##  Minimum value       =      0.600
##  Maximum value       =      5.200
##  Mean value          =      2.815
##  Standard Deviation  =      1.010
##  Variance            =      1.020
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

Monovariate statistics are better displayed using a single function called dbStatisticsMono. This function waits for a vector of enumerators of type EStatOption as statistic operators. Such vector is created using a static function called fromKeys which is available in all enumerators classes (i.e. inherits from AEnum).

dbStatisticsMono(dat,
                  names=c("Elevation", "January_temp"),
                  opers=EStatOption_fromKeys(c("MEAN","MINI","MAXI")))
##                    Mean    Minimum    Maximum
##    Elevation     87.974      3.000    387.000
## January_temp      2.815      0.600      5.200

Assessors for Db class

We can also consider the data base as a data frame and use the [ ] assessors. For instance, the full content of a Db can be displayed as a data.frame as follows.

dat[]

We can access to one or several variables. Note that the contents of the Column corresponding to the target variable (i.e. January_temp) is produced as a 1D vector.

Also note the presence of samples with NA corresponding to those where the target variable is not informed ('MISS' in the original dataset file).

dat["January_temp"]
##   [1] 1.7 2.0 4.6  NA 3.1 3.5 3.4 3.0 4.9 2.9  NA 1.3  NA 4.0 1.7  NA 1.9 3.3
##  [19] 2.3  NA 2.3 2.6  NA 2.7 2.9  NA 1.0 1.2  NA 3.1  NA 3.7 2.1 2.5 2.9  NA
##  [37]  NA  NA 3.1 2.1  NA 2.7 3.0  NA  NA 1.8  NA  NA 2.2 2.9 3.3  NA 5.0 1.6
##  [55]  NA 2.1 3.2 4.2 1.1  NA 2.7 0.6 3.2  NA 2.5 2.0 2.8  NA 3.2 3.2 4.5 3.3
##  [73] 4.1 2.2 1.7 4.3 5.2  NA 1.6 3.9 3.1  NA 3.5 4.7 3.6  NA 1.8 1.7  NA  NA
##  [91]  NA  NA  NA  NA  NA 1.7  NA 3.0 4.6 3.9 3.2 1.3  NA  NA  NA 4.7  NA 2.6
## [109] 2.0 4.7 1.2 2.9 0.9 3.0  NA 3.6 0.7 3.3  NA  NA  NA 2.7  NA 2.7 2.4  NA
## [127]  NA 2.0 2.6  NA 4.3  NA  NA  NA  NA 3.1 3.4 3.1 2.0 1.3 1.9  NA 3.3 2.7
## [145] 4.4  NA 3.0 0.9 0.7  NA 3.6  NA 3.5  NA 2.4 1.0  NA 3.6  NA  NA  NA  NA
## [163] 3.0  NA 3.5 4.0 3.0 3.6  NA 3.2 1.7 2.7 1.9  NA  NA 4.4 1.9 3.3  NA  NA
## [181] 3.5 1.7 3.0  NA 2.7  NA 1.0 3.3  NA  NA 3.2 3.9  NA  NA 3.0  NA 3.8  NA
## [199] 2.8  NA 2.9 1.4 2.6 3.0  NA 2.8 2.9 3.6  NA 2.0 4.6 3.7  NA  NA 4.5 2.7
## [217]  NA 4.7 1.7 1.9 3.5  NA  NA  NA 2.1 2.3 3.1  NA  NA 2.0 2.6 2.8 2.6  NA
## [235] 2.1 2.6

But it can be more restrictive as in the following paragraph, where we only consider the samples 10 to 15, and only consider the variables rank, Latitude, Elevation. In R indices in array start from 1 to N (1-based). Indices slice '10:15' in R means indices {10,11,12,13,14,15} (the last index is considered which is different from Python) which means ranks {10,11,12,13,14,15}.Be carefull that for all other functions of the gstlearn package, indices must be provided 0-based.

dat[10:15, c("rank", "Latitude", "Elevation")]

We can also replace the variable name by their Column index (1-based in []operator) in the data base.

dat[10:15, 3:4]

This is not recommended as the Column index of a given variable may vary over time.

A particular function is available to convert all the data base in an appropriate object of the Target Langage (here R). A gstlearn Db is converted into a data.frame using toTL.

dat$toTL()

Finally, an interesting feature of the [ ] assessors is that it allows to easily incorporate new variables into a Db or modify the existing ones. For instance, in the next example, a new variable newvar is created and added to the data base dat.

dat["newvar"] = 12.3 * dat["Elevation"] - 2.1 * dat["*temp"]
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 6
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## Column = 5 - Name = newvar - Locator = NA

Remark: Note that variables names may be specified using traditional regexp expressions (for instance, the symbol '*' replaces any list of characters meaning that ["*temp"] selects all the variable names ending with temp).

The user also can remove a variable from the data base by doing the following:

dat$deleteColumn("newvar")
## NULL
dat$display()
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL

Locators

The locators are used to specify the role assigned to a Column for the rest of the study (unless they are modified). The locator is characterized by its name (Z for a variable and X for a coordinate) within the Enumeration ELoc.

dat$setLocators(c("Longitude","Latitude"), ELoc_X())
## NULL
dat$setLocator("*temp", ELoc_Z(), cleanSameLocator=TRUE)
## NULL
dat
## 
## Data Base Characteristics
## =========================
## 
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension              = 2
## Number of Columns            = 5
## Total number of samples      = 236
## 
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1

As can be seen in the printout, variables Latitude and Longitude have been designated as coordinates (pay attention to the order) and January_temp is the (unique) variable of interest. Therefore any subsequent step will be performed as a monovariate 2-D process.

The locator is translated into a letter,number pair for better legibility: e.g. x1 for the first coordinate.

Here are all the roles known by gstlearn:

ELoc_printAll()
##   -1 -     UNKNOWN : Unknown locator
##    0 -           X : Coordinate
##    1 -           Z : Variable
##    2 -           V : Variance of measurement error
##    3 -           F : External Drift
##    4 -           G : Gradient component
##    5 -           L : Lower bound of an inequality
##    6 -           U : Upper bound of an inequality
##    7 -           P : Proportion
##    8 -           W : Weight
##    9 -           C : Code
##   10 -         SEL : Selection
##   11 -         DOM : Domain
##   12 -        BLEX : Block Extension
##   13 -        ADIR : Dip direction Angle
##   14 -        ADIP : Dip Angle
##   15 -        SIZE : Object height
##   16 -          BU : Fault UP termination
##   17 -          BD : Fault DOWN termination
##   18 -        TIME : Time variable
##   19 -       LAYER : Layer rank
##   20 -      NOSTAT : Non-stationary parameter
##   21 -        TGTE : Tangent
##   22 -        SIMU : Conditional or non-conditional simulations
##   23 -      FACIES : Facies simulated
##   24 -     GAUSFAC : Gaussian value for Facies
##   25 -        DATE : Date
##   26 -       RKLOW : Rank for lower bound (when discretized)
##   27 -        RKUP : Rank for upper bound (when discretized)
##   28 -         SUM : Constraints on the Sum
## NULL

More with Db

Plotting a Db

Plot the contents of a Db using functions of the package (which relies on ggplot2). The color option is used to represent to january_temp variable.

Note: Non availalble values (NaN) are display in gray. This will be tunable in future versions.

p = ggDefaultGeographic()
p = p + plot.point(dat, nameColor="January_temp", flagLegendColor = TRUE,
                   legendNameColor="Temperature")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)

A more elaborated graphic representation displays the samples with a symbol proportional to the Elevation (nameSize) and a color representing the Temperature (nameColor).

p = ggDefaultGeographic()
p = p + plot.point(dat, nameSize="Elevation", nameColor="January_temp", flagLegendColor = TRUE,
                   legendNameColor="Temperature", legendNameSize="Elevation")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)