Main Classes
Here is a (non-exhaustive) list of classes of objects in gstlearn:
- Db, DbGrid: Numerical data base
- DirParam, VarioParam and Vario: Experimental variograms
- Model: Variogram model
- Neigh: Neighborhood
- Anam: Gaussian anamorphosis
- Polygon: 2-D polygonal shapes
- Rule: Lithotype rule for thresholds used for truncated plurigaussian models
Importing External File
Loading a CSV File
We start by downloading the file called Scotland_Temperatures.csv
and we store it in the current working directory. In this example, the file (called filecsv
) is provided as a CSV format file. We load it into a data frame (named datcsv
) using the relevant R-command. Note that "MISS" keyword is used in this file to indicate a missing value. Such values will be replaced by NA.
filecsv = loadData("Scotland", "Scotland_Temperatures.csv")
datcsv = read.csv(filecsv, na.strings = "MISS")
We can check the contents of the data frame (by simply typing its name) and see that it contains four columns (respectively called Longitude
, Latitude
, Elevation
, January_temp
) and 236 rows (header line excluded).
datcsv
Creating Db object from a data.frame
The user can then create a database of the gstlearn package (Db class) directly from the previously imported data.frame.
dat = fromTL(datcsv)
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 0
## Number of Columns = 4
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = NA
## Column = 1 - Name = Latitude - Locator = NA
## Column = 2 - Name = Elevation - Locator = NA
## Column = 3 - Name = January_temp - Locator = NA
Creating Db object directly from CSV file
These operations can be performed directly by reading the CSV file again and load it directly into a Db.
To do so, we start by creating CSVformat
object using the CSVformat_create
function. This object is used to specify various properties of the file we want to load, namely the presence of a header line (through the argument flagHeader
) and the way missing values are coded in the file (through the argument naString
).
Then, the function Db_createFromCSV
allows to load directly the CSV file into a gstlearn data base.
csv = CSVformat_create(flagHeader=TRUE, naString = "MISS")
dat = Db_createFromCSV(filecsv, csv=csv)
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 0
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = NA
## Column = 2 - Name = Latitude - Locator = NA
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = NA
Note that a "rank" variable has been automatically added. The rank is always 1-based and must be distinguish from an index (0-based) when calling gstlearn functions (except for the []
operator, see below). The rank variable could be later useful for certain functions of the gstlearn package.
Importing Db File from a "Neutral File"
A last solution is to import it directly from the set of demonstration files (provided together with the package and called fileNF
) and stored in a specific format (Neutral file).
These NF (or neutral file) are currently used for serialization of the gstlearn objects. They will probably be replaced in the future by a facility backuping the whole workspace in one step.
Note that the contents of the Db is slightly different from the result obtained when reading from CSV. Essentially, some variables have a Locator
field defined, some do not. This concept will be described later in this chapter and the difference can be ignored.
fileNF = loadData("Scotland", "Scotland_Temperatures.NF")
dat = Db_createFromNF(fileNF)
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
Discovering Db
The Db class
Db objects (as all objects that inherits from AStringable) have a method display
allowing to print a summary of the content of the data base. The same occurs when typing the name of the variable at the end of a chunck (see above).
dat$display()
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL
As described in the "Data Base Summary" section, this Db object contains 5 fields (called Columns), and contains 236 data points (called samples). Upon inspection, we see that the 4 variables of the csv file are present (Columns 1 through 4), alongside with an additional variable called rank
(Column 0).
In addition, some interesting information tells you that this data base corresponds to a 2-D dimension one: this will be described later together with the use of the Locator information.
Remark: To get more information on the contents of the Db, we can provide the display
method of a Db with a DbStringFormat object used to describe which information we would like to print. Such objects can be created using the function DbStringFormat_createFromFlags
. We refer the reader to the documentation of the DbStringFormat class for more details. The example below provides a way to add summary statistics about some variables of the Db to the Db summary.
dbfmt = DbStringFormat_createFromFlags(flag_stats=TRUE, names=c("Elevation", "January_temp"))
dat$display(dbfmt)
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Data Base Statistics
## --------------------
## 4 - Name Elevation - Locator NA
## Nb of data = 236
## Nb of active values = 236
## Minimum value = 2.000
## Maximum value = 800.000
## Mean value = 146.441
## Standard Deviation = 165.138
## Variance = 27270.713
## 5 - Name January_temp - Locator z1
## Nb of data = 236
## Nb of active values = 151
## Minimum value = 0.600
## Maximum value = 5.200
## Mean value = 2.815
## Standard Deviation = 1.010
## Variance = 1.020
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL
Monovariate statistics are better displayed using a single function called dbStatisticsMono. This function waits for a vector of enumerators of type EStatOption as statistic operators. Such vector is created using a static function called fromKeys which is available in all enumerators classes (i.e. inherits from AEnum).
dbStatisticsMono(dat,
names=c("Elevation", "January_temp"),
opers=EStatOption_fromKeys(c("MEAN","MINI","MAXI")))
## Mean Minimum Maximum
## Elevation 87.974 3.000 387.000
## January_temp 2.815 0.600 5.200
Assessors for Db class
We can also consider the data base as a data frame and use the [ ]
assessors. For instance, the full content of a Db
can be displayed as a data.frame as follows.
dat[]
We can access to one or several variables. Note that the contents of the Column corresponding to the target variable (i.e. January_temp) is produced as a 1D vector.
Also note the presence of samples with NA
corresponding to those where the target variable is not informed ('MISS' in the original dataset file).
dat["January_temp"]
## [1] 1.7 2.0 4.6 NA 3.1 3.5 3.4 3.0 4.9 2.9 NA 1.3 NA 4.0 1.7 NA 1.9 3.3
## [19] 2.3 NA 2.3 2.6 NA 2.7 2.9 NA 1.0 1.2 NA 3.1 NA 3.7 2.1 2.5 2.9 NA
## [37] NA NA 3.1 2.1 NA 2.7 3.0 NA NA 1.8 NA NA 2.2 2.9 3.3 NA 5.0 1.6
## [55] NA 2.1 3.2 4.2 1.1 NA 2.7 0.6 3.2 NA 2.5 2.0 2.8 NA 3.2 3.2 4.5 3.3
## [73] 4.1 2.2 1.7 4.3 5.2 NA 1.6 3.9 3.1 NA 3.5 4.7 3.6 NA 1.8 1.7 NA NA
## [91] NA NA NA NA NA 1.7 NA 3.0 4.6 3.9 3.2 1.3 NA NA NA 4.7 NA 2.6
## [109] 2.0 4.7 1.2 2.9 0.9 3.0 NA 3.6 0.7 3.3 NA NA NA 2.7 NA 2.7 2.4 NA
## [127] NA 2.0 2.6 NA 4.3 NA NA NA NA 3.1 3.4 3.1 2.0 1.3 1.9 NA 3.3 2.7
## [145] 4.4 NA 3.0 0.9 0.7 NA 3.6 NA 3.5 NA 2.4 1.0 NA 3.6 NA NA NA NA
## [163] 3.0 NA 3.5 4.0 3.0 3.6 NA 3.2 1.7 2.7 1.9 NA NA 4.4 1.9 3.3 NA NA
## [181] 3.5 1.7 3.0 NA 2.7 NA 1.0 3.3 NA NA 3.2 3.9 NA NA 3.0 NA 3.8 NA
## [199] 2.8 NA 2.9 1.4 2.6 3.0 NA 2.8 2.9 3.6 NA 2.0 4.6 3.7 NA NA 4.5 2.7
## [217] NA 4.7 1.7 1.9 3.5 NA NA NA 2.1 2.3 3.1 NA NA 2.0 2.6 2.8 2.6 NA
## [235] 2.1 2.6
But it can be more restrictive as in the following paragraph, where we only consider the samples 10 to 15, and only consider the variables rank, Latitude, Elevation. In R indices in array start from 1 to N (1-based). Indices slice '10:15' in R means indices {10,11,12,13,14,15} (the last index is considered which is different from Python) which means ranks {10,11,12,13,14,15}.Be carefull that for all other functions of the gstlearn package, indices must be provided 0-based.
dat[10:15, c("rank", "Latitude", "Elevation")]
We can also replace the variable name by their Column index (1-based in []
operator) in the data base.
dat[10:15, 3:4]
This is not recommended as the Column index of a given variable may vary over time.
A particular function is available to convert all the data base in an appropriate object of the Target Langage (here R). A gstlearn Db is converted into a data.frame using toTL.
dat$toTL()
Finally, an interesting feature of the [ ]
assessors is that it allows to easily incorporate new variables into a Db or modify the existing ones. For instance, in the next example, a new variable newvar
is created and added to the data base dat
.
dat["newvar"] = 12.3 * dat["Elevation"] - 2.1 * dat["*temp"]
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 6
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## Column = 5 - Name = newvar - Locator = NA
Remark: Note that variables names may be specified using traditional regexp expressions (for instance, the symbol '*' replaces any list of characters meaning that ["*temp"]
selects all the variable names ending with temp
).
The user also can remove a variable from the data base by doing the following:
dat$deleteColumn("newvar")
## NULL
dat$display()
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL
Locators
The locators are used to specify the role assigned to a Column for the rest of the study (unless they are modified). The locator is characterized by its name (Z
for a variable and X
for a coordinate) within the Enumeration ELoc
.
dat$setLocators(c("Longitude","Latitude"), ELoc_X())
## NULL
dat$setLocator("*temp", ELoc_Z(), cleanSameLocator=TRUE)
## NULL
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
As can be seen in the printout, variables Latitude
and Longitude
have been designated as coordinates (pay attention to the order) and January_temp
is the (unique) variable of interest. Therefore any subsequent step will be performed as a monovariate 2-D process.
The locator is translated into a letter,number pair for better legibility: e.g. x1
for the first coordinate.
Here are all the roles known by gstlearn:
ELoc_printAll()
## -1 - UNKNOWN : Unknown locator
## 0 - X : Coordinate
## 1 - Z : Variable
## 2 - V : Variance of measurement error
## 3 - F : External Drift
## 4 - G : Gradient component
## 5 - L : Lower bound of an inequality
## 6 - U : Upper bound of an inequality
## 7 - P : Proportion
## 8 - W : Weight
## 9 - C : Code
## 10 - SEL : Selection
## 11 - DOM : Domain
## 12 - BLEX : Block Extension
## 13 - ADIR : Dip direction Angle
## 14 - ADIP : Dip Angle
## 15 - SIZE : Object height
## 16 - BU : Fault UP termination
## 17 - BD : Fault DOWN termination
## 18 - TIME : Time variable
## 19 - LAYER : Layer rank
## 20 - NOSTAT : Non-stationary parameter
## 21 - TGTE : Tangent
## 22 - SIMU : Conditional or non-conditional simulations
## 23 - FACIES : Facies simulated
## 24 - GAUSFAC : Gaussian value for Facies
## 25 - DATE : Date
## 26 - RKLOW : Rank for lower bound (when discretized)
## 27 - RKUP : Rank for upper bound (when discretized)
## 28 - SUM : Constraints on the Sum
## NULL
More with Db
Plotting a Db
Plot the contents of a Db using functions of the package (which relies on ggplot2). The color option is used to represent to january_temp
variable.
Note: Non availalble values (NaN) are display in gray. This will be tunable in future versions.
p = ggDefaultGeographic()
p = p + plot.point(dat, nameColor="January_temp", flagLegendColor = TRUE,
legendNameColor="Temperature")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)
A more elaborated graphic representation displays the samples with a symbol proportional to the Elevation (nameSize) and a color representing the Temperature (nameColor).
p = ggDefaultGeographic()
p = p + plot.point(dat, nameSize="Elevation", nameColor="January_temp", flagLegendColor = TRUE,
legendNameColor="Temperature", legendNameSize="Elevation")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)
Of course, you can use your own graphical routines (for example, a direct call to ggplot2) by simply accessing to the gstlearn data base values (using '[ ]' accessor):
p = ggplot()
p = p + geom_point(data=dat[], mapping=aes(x=dat["x1"], y=dat["x2"], color=dat["January_temp"]))
p = p + labs(color = "Temperature")
p = p + labs(x = "Easting", y = "Northing")
p = p + labs(title = "January Temperature")
plot(p)
Grid Data Base
On the same area, a terrain model is available (as a demonstration file available in the package distribution). We first download it and create the corresponding data base defined on a grid support (DbGrid).
fileNF = loadData("Scotland", "Scotland_Elevations.NF")
grid = DbGrid_createFromNF(fileNF)
grid
##
## Data Base Grid Characteristics
## ==============================
##
## Data Base Summary
## -----------------
## File is organized as a regular grid
## Space dimension = 2
## Number of Columns = 4
## Total number of samples = 11097
## Number of active samples = 3092
##
## Grid characteristics:
## ---------------------
## Origin : 65.000 535.000
## Mesh : 4.938 4.963
## Number : 81 137
##
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = x1
## Column = 1 - Name = Latitude - Locator = x2
## Column = 2 - Name = Elevation - Locator = f1
## Column = 3 - Name = inshore - Locator = sel
We can check that the grid is constituted of 81 columns and 137 rows, or 11097 grid cells. We can also notice that some locators are already defined (these information are stored in the Neutral File).
Selection
We can check the presence of a variable (called inshore
) which is assigned to the sel
locator: this corresponds to a Selection which acts as a binary filter: some grid cells are active and others are masked off. The count of active samples is given in the previous printout (3092). This selection remains active until it is replaced or deleted (there may not be more than one selection defined at a time per data base). This is what can be seen in the following display where we represent the Elevation
only within the inshore
selection.
p = ggDefaultGeographic()
p = p + plot.grid(grid, nameRaster="Elevation", flagLegendRaster=TRUE, legendNameRaster="Elevation")
p = p + plot.decoration(title="Elevation", xlab="Easting", ylab="Northing")
ggPrint(p)
Note that any variable can be considered as a Selection: it must simply be assigned to the sel
locator using the setLocator
variable described earlier.
Final plot
On this final plot, we combine grid and point representations.
p = ggDefaultGeographic()
p = p + plot.grid(grid, nameRaster="Elevation", flagLegendRaster=TRUE, legendNameRaster="Elevation")
p = p + plot.point(dat, nameSize="January_temp", flagLegendSize=TRUE, legendNameSize="Temperature", sizmin=1, sizmax=3, color="yellow")
p = p + plot.decoration(title="Elevation and Temperatures", xlab="Easting", ylab="Northing")
ggPrint(p)