Demonstration of gstlearn for the use of a Db¶

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Import packages¶

In [2]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp

Global variables

In [3]:
gl.OptCst.define(gl.ECst.NTCOL,6)
gl.law_set_random_seed(13414)

Defining a Data set¶

The data is defined by simulating samples at random within a given box. This study is performed in 2-D but this is not considered as a limitation.

In [4]:
nech = 500
mydb = gl.Db.createFromBox(nech, [0,0], [100, 100])
mydb
Out[4]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 3
Maximum Number of UIDs       = 3
Total number of samples      = 500

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2

Displaying the Data set

In [5]:
ax = mydb.plot()

We now define a vector of 0-1 integer values at random again, according to a Bernoulli distribution with a probability of 0.2. This vector is added to the Data Base.

In [6]:
sel = gl.VectorHelper.simulateBernoulli(nech, 0.2)
gl.VectorHelper.displayStats("Statistics on the Selection vector",sel)
iuid = mydb.addColumns(sel,"sel")
Statistics on the Selection vector
 - Number of samples = 500 / 500
 - Minimum  =      0.000
 - Maximum  =      1.000
 - Mean     =      0.186
 - St. Dev. =      0.389
 
In [7]:
dbfmt = gl.DbStringFormat.createFromFlags(flag_stats=True, names=["sel"])
mydb.display(dbfmt)
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 4
Maximum Number of UIDs       = 4
Total number of samples      = 500

Data Base Statistics
--------------------
4 - Name sel - Locator NA
 Nb of data          =        500
 Nb of active values =        500
 Minimum value       =      0.000
 Maximum value       =      1.000
 Mean value          =      0.186
 Standard Deviation  =      0.389
 Variance            =      0.151

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = sel - Locator = NA
 
In [8]:
ax = mydb.plot(name_color="sel")

Extracting a new Data Base upon ranks¶

We demonstrate the possibility to extract a Data Base by specifying the selected ranks of an Input Data Base.

In [9]:
ranks = gl.VectorHelper.sampleRanks(mydb.getSampleNumber(), proportion=0.2)
print("Number of selected samples =", len(ranks))
Number of selected samples = 100
In [10]:
mydbred1 = gl.Db.createReduce(mydb, ranks=ranks)
In [11]:
ax = mydbred1.plot()
ax.decoration(title="Extraction by Ranks")

Extracting a new Data Base upon selection¶

We now create turn the variable 'sel' into a selection and createa new data set which is restricted to the only active samples

In [12]:
mydb.setLocator('sel', gl.ELoc.SEL)
mydbred2 = gl.Db.createReduce(mydb)
mydbred2
Out[12]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 4
Maximum Number of UIDs       = 4
Total number of samples      = 93
Number of active samples     = 93

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = sel - Locator = sel
In [13]:
ax = mydbred2.plot()
ax.decoration(title="Extraction by Selection")
In [ ]: