Data Base management (Db)¶

Import packages¶

In [1]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc

gdoc.setNoScroll()

Global variables

In [2]:
gl.OptCst.define(gl.ECst.NTCOL,6)
gl.law_set_random_seed(13414)

Defining a Data set¶

The data is defined by simulating samples at random within a given box. This study is performed in 2-D but this is not considered as a limitation.

In [3]:
nech = 500
mydb = gl.Db.createFromBox(nech, [0,0], [100, 100])
mydb
Out[3]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 3
Total number of samples      = 500

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2

Displaying the Data set

In [4]:
ax = mydb.plot()

We now define a vector of 0-1 integer values at random again, according to a Bernoulli distribution with a probability of 0.2. This vector is added to the Data Base.

In [5]:
sel = gl.VectorHelper.simulateBernoulli(nech, 0.2)
gl.VectorHelper.displayStats("Statistics on the Selection vector",sel)
iuid = mydb.addColumns(sel,"sel")
Statistics on the Selection vector
 - Number of samples = 500 / 500
 - Minimum  =      0.000
 - Maximum  =      1.000
 - Mean     =      0.186
 - St. Dev. =      0.389
 
In [6]:
dbfmt = gl.DbStringFormat.createFromFlags(flag_stats=True, names=["sel"])
mydb.display(dbfmt)
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 4
Total number of samples      = 500

Data Base Statistics
--------------------
4 - Name sel - Locator NA
 Nb of data          =        500
 Nb of active values =        500
 Minimum value       =      0.000
 Maximum value       =      1.000
 Mean value          =      0.186
 Standard Deviation  =      0.389
 Variance            =      0.151

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = sel - Locator = NA
 
In [7]:
ax = mydb.plot(nameColor="sel")

Extracting a new Data Base upon ranks¶

We demonstrate the possibility to extract a Data Base by specifying the selected ranks of an Input Data Base.

In [8]:
ranks = gl.VectorHelper.sampleRanks(mydb.getSampleNumber(), proportion=0.2)
print("Number of selected samples =", len(ranks))
Number of selected samples = 100
In [9]:
mydbred1 = gl.Db.createReduce(mydb, ranks=ranks)
In [10]:
ax = mydbred1.plot()
ax.decoration(title="Extraction by Ranks")

Extracting a new Data Base upon selection¶

We now create turn the variable 'sel' into a selection and createa new data set which is restricted to the only active samples

In [11]:
mydb.setLocator('sel', gl.ELoc.SEL)
mydbred2 = gl.Db.createReduce(mydb)
mydbred2
Out[11]:
Data Base Characteristics
=========================

Data Base Summary
-----------------
File is organized as a set of isolated points
Space dimension              = 2
Number of Columns            = 4
Total number of samples      = 93
Number of active samples     = 93

Variables
---------
Column = 0 - Name = rank - Locator = NA
Column = 1 - Name = x-1 - Locator = x1
Column = 2 - Name = x-2 - Locator = x2
Column = 3 - Name = sel - Locator = sel
In [12]:
ax = mydbred2.plot()
ax.decoration(title="Extraction by Selection")
In [ ]: