Main Classes
Here is a (non-exhaustive) list of classes of objects in
gstlearn:
- Db, DbGrid: Numerical data base
- DirParam, VarioParam and Vario: Experimental variograms
- Model: Variogram model
- Neigh: Neighborhood
- Anam: Gaussian anamorphosis
- Polygon: 2-D polygonal shapes
- Rule: Lithotype rule for thresholds used for truncated plurigaussian
models
Importing External File
Loading a CSV File
We start by downloading the file called
Scotland_Temperatures.csv
and we store it in the current
working directory. In this example, the file (called
filecsv
) is provided as a CSV format file. We load
it into a data frame (named datcsv
) using the relevant
R-command. Note that “MISS” keyword is used in this file to indicate a
missing value. Such values will be replaced by NA.
filecsv = loadData("Scotland", "Scotland_Temperatures.csv")
datcsv = read.csv(filecsv, na.strings = "MISS")
We can check the contents of the data frame (by simply typing its
name) and see that it contains four columns (respectively called
Longitude
, Latitude
, Elevation
,
January_temp
) and 236 rows (header line excluded).
datcsv
Creating Db object from a data.frame
The user can then create a database of the gstlearn
package (Db class) directly from the previously imported
data.frame.
dat = fromTL(datcsv)
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 0
## Number of Columns = 4
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = NA
## Column = 1 - Name = Latitude - Locator = NA
## Column = 2 - Name = Elevation - Locator = NA
## Column = 3 - Name = January_temp - Locator = NA
Creating Db object directly from CSV file
These operations can be performed directly by reading the CSV file
again and load it directly into a Db.
To do so, we start by creating CSVformat
object using
the CSVformat_create
function. This object is used to
specify various properties of the file we want to load, namely the
presence of a header line (through the argument flagHeader
)
and the way missing values are coded in the file (through the argument
naString
).
Then, the function Db_createFromCSV
allows to load
directly the CSV file into a gstlearn data base.
csv = CSVformat_create(flagHeader=TRUE, naString = "MISS")
dat = Db_createFromCSV(filecsv, csv=csv)
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 0
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = NA
## Column = 2 - Name = Latitude - Locator = NA
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = NA
Note that a “rank” variable has been automatically added. The
rank is always 1-based and must be distinguish from an
index (0-based) when calling gstlearn
functions (except for the []
operator, see below). The
rank variable could be later useful for certain functions of
the gstlearn package.
Importing Db File from a “Neutral File”
A last solution is to import it directly from the set of
demonstration files (provided together with the package and called
fileNF
) and stored in a specific format (Neutral file).
These NF (or neutral file) are currently used for
serialization of the gstlearn objects. They will probably be replaced in
the future by a facility backuping the whole workspace in one step.
Note that the contents of the Db is slightly different from the
result obtained when reading from CSV. Essentially, some variables have
a Locator
field defined, some do not. This concept will be
described later in this chapter and the difference can be ignored.
fileNF = loadData("Scotland", "Scotland_Temperatures.NF")
dat = Db_createFromNF(fileNF)
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
Discovering Db
The Db class
Db objects (as all objects that inherits from
AStringable) have a method display
allowing to
print a summary of the content of the data base. The same occurs when
typing the name of the variable at the end of a chunck (see above).
dat$display()
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL
As described in the “Data Base Summary” section, this Db
object contains 5 fields (called Columns), and contains 236
data points (called samples). Upon inspection, we see that the
4 variables of the csv file are present (Columns 1 through 4), alongside
with an additional variable called rank
(Column 0).
In addition, some interesting information tells you that this data
base corresponds to a 2-D dimension one: this will be described later
together with the use of the Locator information.
Remark: To get more information on the contents of the Db, we can
provide the display
method of a Db with a
DbStringFormat object used to describe which information we
would like to print. Such objects can be created using the function
DbStringFormat_createFromFlags
. We refer the reader to the
documentation of the DbStringFormat class for more details. The
example below provides a way to add summary statistics about some
variables of the Db to the Db summary.
dbfmt = DbStringFormat_createFromFlags(flag_stats=TRUE, names=c("Elevation", "January_temp"))
dat$display(dbfmt)
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Data Base Statistics
## --------------------
## 4 - Name Elevation - Locator NA
## Nb of data = 236
## Nb of active values = 236
## Minimum value = 2.000
## Maximum value = 800.000
## Mean value = 146.441
## Standard Deviation = 165.138
## Variance = 27270.713
## 5 - Name January_temp - Locator z1
## Nb of data = 236
## Nb of active values = 151
## Minimum value = 0.600
## Maximum value = 5.200
## Mean value = 2.815
## Standard Deviation = 1.010
## Variance = 1.020
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL
Monovariate statistics are better displayed using a single function
called dbStatisticsMono. This function waits for a vector of
enumerators of type EStatOption as statistic operators. Such vector is
created using a static function called fromKeys which is
available in all enumerators classes (i.e. inherits from
AEnum).
dbStatisticsMono(dat,
names=c("Elevation", "January_temp"),
opers=EStatOption_fromKeys(c("MEAN","MINI","MAXI")))
## Mean Minimum Maximum
## Elevation 87.974 3.000 387.000
## January_temp 2.815 0.600 5.200
Assessors for Db class
We can also consider the data base as a data frame and use the
[ ]
assessors. For instance, the full content of a
Db
can be displayed as a data.frame as follows.
dat[]
We can access to one or several variables. Note that the contents of
the Column corresponding to the target variable
(i.e. January_temp) is produced as a 1D vector.
Also note the presence of samples with NA
corresponding
to those where the target variable is not informed (‘MISS’ in the
original dataset file).
dat["January_temp"]
## [1] 1.7 2.0 4.6 NA 3.1 3.5 3.4 3.0 4.9 2.9 NA 1.3 NA 4.0 1.7 NA 1.9 3.3
## [19] 2.3 NA 2.3 2.6 NA 2.7 2.9 NA 1.0 1.2 NA 3.1 NA 3.7 2.1 2.5 2.9 NA
## [37] NA NA 3.1 2.1 NA 2.7 3.0 NA NA 1.8 NA NA 2.2 2.9 3.3 NA 5.0 1.6
## [55] NA 2.1 3.2 4.2 1.1 NA 2.7 0.6 3.2 NA 2.5 2.0 2.8 NA 3.2 3.2 4.5 3.3
## [73] 4.1 2.2 1.7 4.3 5.2 NA 1.6 3.9 3.1 NA 3.5 4.7 3.6 NA 1.8 1.7 NA NA
## [91] NA NA NA NA NA 1.7 NA 3.0 4.6 3.9 3.2 1.3 NA NA NA 4.7 NA 2.6
## [109] 2.0 4.7 1.2 2.9 0.9 3.0 NA 3.6 0.7 3.3 NA NA NA 2.7 NA 2.7 2.4 NA
## [127] NA 2.0 2.6 NA 4.3 NA NA NA NA 3.1 3.4 3.1 2.0 1.3 1.9 NA 3.3 2.7
## [145] 4.4 NA 3.0 0.9 0.7 NA 3.6 NA 3.5 NA 2.4 1.0 NA 3.6 NA NA NA NA
## [163] 3.0 NA 3.5 4.0 3.0 3.6 NA 3.2 1.7 2.7 1.9 NA NA 4.4 1.9 3.3 NA NA
## [181] 3.5 1.7 3.0 NA 2.7 NA 1.0 3.3 NA NA 3.2 3.9 NA NA 3.0 NA 3.8 NA
## [199] 2.8 NA 2.9 1.4 2.6 3.0 NA 2.8 2.9 3.6 NA 2.0 4.6 3.7 NA NA 4.5 2.7
## [217] NA 4.7 1.7 1.9 3.5 NA NA NA 2.1 2.3 3.1 NA NA 2.0 2.6 2.8 2.6 NA
## [235] 2.1 2.6
But it can be more restrictive as in the following paragraph, where
we only consider the samples 10 to 15, and only consider the variables
rank, Latitude, Elevation. In R indices in
array start from 1 to N (1-based). Indices slice ‘10:15’ in R means
indices {10,11,12,13,14,15} (the last index is considered which is
different from Python) which means ranks {10,11,12,13,14,15}.Be carefull
that for all other functions of the gstlearn package, indices
must be provided 0-based.
dat[10:15, c("rank", "Latitude", "Elevation")]
We can also replace the variable name by their Column index
(1-based in []
operator) in the data base.
dat[10:15, 3:4]
This is not recommended as the Column index of a given variable may
vary over time.
A particular function is available to convert all the data base in an
appropriate object of the Target Language (here R). A gstlearn Db is
converted into a data.frame using toTL.
dat$toTL()
Finally, an interesting feature of the [ ]
assessors is
that it allows to easily incorporate new variables into a Db or
modify the existing ones. For instance, in the next example, a new
variable newvar
is created and added to the data base
dat
.
dat["newvar"] = 12.3 * dat["Elevation"] - 2.1 * dat["*temp"]
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 6
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## Column = 5 - Name = newvar - Locator = NA
Remark: Note that variables names may be specified using traditional
regexp expressions (for instance, the symbol ‘*’ replaces any list of
characters meaning that ["*temp"]
selects all the variable
names ending with temp
).
The user also can remove a variable from the data base by doing the
following:
dat$deleteColumn("newvar")
## NULL
dat$display()
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
## NULL
Locators
The locators are used to specify the role assigned
to a Column for the rest of the study (unless they are modified). The
locator is characterized by its name (Z
for a variable and
X
for a coordinate) within the Enumeration
ELoc
.
dat$setLocators(c("Longitude","Latitude"), ELoc_X())
## NULL
dat$setLocator("*temp", ELoc_Z(), cleanSameLocator=TRUE)
## NULL
dat
##
## Data Base Characteristics
## =========================
##
## Data Base Summary
## -----------------
## File is organized as a set of isolated points
## Space dimension = 2
## Number of Columns = 5
## Total number of samples = 236
##
## Variables
## ---------
## Column = 0 - Name = rank - Locator = NA
## Column = 1 - Name = Longitude - Locator = x1
## Column = 2 - Name = Latitude - Locator = x2
## Column = 3 - Name = Elevation - Locator = NA
## Column = 4 - Name = January_temp - Locator = z1
As can be seen in the printout, variables Latitude
and
Longitude
have been designated as coordinates (pay
attention to the order) and January_temp
is the (unique)
variable of interest. Therefore any subsequent step will be performed as
a monovariate 2-D process.
The locator is translated into a letter,number pair
for better legibility: e.g. x1
for the first
coordinate.
Here are all the roles known by
gstlearn:
ELoc_printAll()
## -1 - UNKNOWN : Unknown locator
## 0 - X : Coordinate
## 1 - Z : Variable
## 2 - V : Variance of measurement error
## 3 - F : External Drift
## 4 - G : Gradient component
## 5 - L : Lower bound of an inequality
## 6 - U : Upper bound of an inequality
## 7 - P : Proportion
## 8 - W : Weight
## 9 - C : Code
## 10 - SEL : Selection
## 11 - DOM : Domain
## 12 - BLEX : Block Extension
## 13 - ADIR : Dip direction Angle
## 14 - ADIP : Dip Angle
## 15 - SIZE : Object height
## 16 - BU : Fault UP termination
## 17 - BD : Fault DOWN termination
## 18 - TIME : Time variable
## 19 - LAYER : Layer rank
## 20 - NOSTAT : Non-stationary parameter
## 21 - TGTE : Tangent
## 22 - SIMU : Conditional or non-conditional simulations
## 23 - FACIES : Facies simulated
## 24 - GAUSFAC : Gaussian value for Facies
## 25 - DATE : Date
## 26 - RKLOW : Rank for lower bound (when discretized)
## 27 - RKUP : Rank for upper bound (when discretized)
## 28 - SUM : Constraints on the Sum
## NULL
More with Db
Plotting a Db
Plot the contents of a Db using functions of the package (which
relies on ggplot2). The color option is used to
represent to january_temp
variable.
Note: Non availalble values (NaN) are display in gray. This will be
tunable in future versions.
p = ggDefaultGeographic()
p = p + plot.point(dat, nameColor="January_temp", flagLegendColor = TRUE,
legendNameColor="Temperature")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)
A more elaborated graphic representation displays the samples with a
symbol proportional to the Elevation (nameSize) and a
color representing the Temperature (nameColor).
p = ggDefaultGeographic()
p = p + plot.point(dat, nameSize="Elevation", nameColor="January_temp", flagLegendColor = TRUE,
legendNameColor="Temperature", legendNameSize="Elevation")
p = p + plot.decoration(title="January Temperature", xlab="Easting", ylab="Northing")
ggPrint(p)
Of course, you can use your own graphical routines (for example, a
direct call to ggplot2) by simply accessing to the
gstlearn data base values (using ‘[ ]’ accessor):
p = ggplot()
p = p + geom_point(data=dat[], mapping=aes(x=dat["x1"], y=dat["x2"], color=dat["January_temp"]))
p = p + labs(color = "Temperature")
p = p + labs(x = "Easting", y = "Northing")
p = p + labs(title = "January Temperature")
plot(p)
Grid Data Base
On the same area, a terrain model is available (as a demonstration
file available in the package distribution). We first download it and
create the corresponding data base defined on a grid support
(DbGrid).
fileNF = loadData("Scotland", "Scotland_Elevations.NF")
grid = DbGrid_createFromNF(fileNF)
grid
##
## Data Base Grid Characteristics
## ==============================
##
## Data Base Summary
## -----------------
## File is organized as a regular grid
## Space dimension = 2
## Number of Columns = 4
## Total number of samples = 11097
## Number of active samples = 3092
##
## Grid characteristics:
## ---------------------
## Origin : 65.000 535.000
## Mesh : 4.938 4.963
## Number : 81 137
##
## Variables
## ---------
## Column = 0 - Name = Longitude - Locator = x1
## Column = 1 - Name = Latitude - Locator = x2
## Column = 2 - Name = Elevation - Locator = f1
## Column = 3 - Name = inshore - Locator = sel
We can check that the grid is constituted of 81 columns and 137 rows,
or 11097 grid cells. We can also notice that some locators are already
defined (these information are stored in the Neutral File).
Selection
We can check the presence of a variable (called inshore
)
which is assigned to the sel
locator: this corresponds to a
Selection which acts as a binary filter: some grid cells are
active and others are masked off. The count of active samples is given
in the previous printout (3092). This selection remains active until it
is replaced or deleted (there may not be more than one selection defined
at a time per data base). This is what can be seen in the following
display where we represent the Elevation
only within the
inshore
selection.
p = ggDefaultGeographic()
p = p + plot.grid(grid, nameRaster="Elevation", flagLegendRaster=TRUE, legendNameRaster="Elevation")
p = p + plot.decoration(title="Elevation", xlab="Easting", ylab="Northing")
ggPrint(p)
Note that any variable can be considered as a Selection: it must
simply be assigned to the sel
locator using the
setLocator
variable described earlier.
Final plot
On this final plot, we combine grid and point representations.
p = ggDefaultGeographic()
p = p + plot.grid(grid, nameRaster="Elevation", flagLegendRaster=TRUE, legendNameRaster="Elevation")
p = p + plot.point(dat, nameSize="January_temp", flagLegendSize=TRUE, legendNameSize="Temperature", sizmin=1, sizmax=3, color="yellow")
p = p + plot.decoration(title="Elevation and Temperatures", xlab="Easting", ylab="Northing")
ggPrint(p)