Demonstration of gstlearn for the use of a Db¶
In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
Import packages¶
In [2]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
Global variables
In [3]:
gl.OptCst.define(gl.ECst.NTCOL,6)
gl.law_set_random_seed(13414)
Defining a Data set¶
The data is defined by simulating samples at random within a given box. This study is performed in 2-D but this is not considered as a limitation.
In [4]:
nech = 500
mydb = gl.Db.createFromBox(nech, [0,0], [100, 100])
mydb
Out[4]:
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 3 Maximum Number of UIDs = 3 Total number of samples = 500 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2
Displaying the Data set
In [5]:
ax = mydb.plot()
We now define a vector of 0-1 integer values at random again, according to a Bernoulli distribution with a probability of 0.2. This vector is added to the Data Base.
In [6]:
sel = gl.VectorHelper.simulateBernoulli(nech, 0.2)
gl.VectorHelper.displayStats("Statistics on the Selection vector",sel)
iuid = mydb.addColumns(sel,"sel")
Statistics on the Selection vector - Number of samples = 500 / 500 - Minimum = 0.000 - Maximum = 1.000 - Mean = 0.186 - St. Dev. = 0.389
In [7]:
dbfmt = gl.DbStringFormat.createFromFlags(flag_stats=True, names=["sel"])
mydb.display(dbfmt)
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 4 Maximum Number of UIDs = 4 Total number of samples = 500 Data Base Statistics -------------------- 4 - Name sel - Locator NA Nb of data = 500 Nb of active values = 500 Minimum value = 0.000 Maximum value = 1.000 Mean value = 0.186 Standard Deviation = 0.389 Variance = 0.151 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2 Column = 3 - Name = sel - Locator = NA
In [8]:
ax = mydb.plot(name_color="sel")
Extracting a new Data Base upon ranks¶
We demonstrate the possibility to extract a Data Base by specifying the selected ranks of an Input Data Base.
In [9]:
ranks = gl.VectorHelper.sampleRanks(mydb.getSampleNumber(), proportion=0.2)
print("Number of selected samples =", len(ranks))
Number of selected samples = 100
In [10]:
mydbred1 = gl.Db.createReduce(mydb, ranks=ranks)
In [11]:
ax = mydbred1.plot()
ax.decoration(title="Extraction by Ranks")
Extracting a new Data Base upon selection¶
We now create turn the variable 'sel' into a selection and createa new data set which is restricted to the only active samples
In [12]:
mydb.setLocator('sel', gl.ELoc.SEL)
mydbred2 = gl.Db.createReduce(mydb)
mydbred2
Out[12]:
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 4 Maximum Number of UIDs = 4 Total number of samples = 93 Number of active samples = 93 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2 Column = 3 - Name = sel - Locator = sel
In [13]:
ax = mydbred2.plot()
ax.decoration(title="Extraction by Selection")
In [ ]: