Data Base management (Db)¶
Import packages¶
In [1]:
import numpy as np
import pandas as pd
import sys
import os
import matplotlib.pyplot as plt
import gstlearn as gl
import gstlearn.plot as gp
import gstlearn.document as gdoc
gdoc.setNoScroll()
Global variables
In [2]:
gl.OptCst.define(gl.ECst.NTCOL,6)
gl.law_set_random_seed(13414)
Defining a Data set¶
The data is defined by simulating samples at random within a given box. This study is performed in 2-D but this is not considered as a limitation.
In [3]:
nech = 500
mydb = gl.Db.createFromBox(nech, [0,0], [100, 100])
mydb
Out[3]:
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 3 Total number of samples = 500 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2
Displaying the Data set
In [4]:
ax = mydb.plot()
We now define a vector of 0-1 integer values at random again, according to a Bernoulli distribution with a probability of 0.2. This vector is added to the Data Base.
In [5]:
sel = gl.VectorHelper.simulateBernoulli(nech, 0.2)
gl.VectorHelper.displayStats("Statistics on the Selection vector",sel)
iuid = mydb.addColumns(sel,"sel")
Statistics on the Selection vector - Number of samples = 500 / 500 - Minimum = 0.000 - Maximum = 1.000 - Mean = 0.186 - St. Dev. = 0.389
In [6]:
dbfmt = gl.DbStringFormat.createFromFlags(flag_stats=True, names=["sel"])
mydb.display(dbfmt)
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 4 Total number of samples = 500 Data Base Statistics -------------------- 4 - Name sel - Locator NA Nb of data = 500 Nb of active values = 500 Minimum value = 0.000 Maximum value = 1.000 Mean value = 0.186 Standard Deviation = 0.389 Variance = 0.151 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2 Column = 3 - Name = sel - Locator = NA
In [7]:
ax = mydb.plot(nameColor="sel")
Extracting a new Data Base upon ranks¶
We demonstrate the possibility to extract a Data Base by specifying the selected ranks of an Input Data Base.
In [8]:
ranks = gl.VectorHelper.sampleRanks(mydb.getSampleNumber(), proportion=0.2)
print("Number of selected samples =", len(ranks))
Number of selected samples = 100
In [9]:
mydbred1 = gl.Db.createReduce(mydb, ranks=ranks)
In [10]:
ax = mydbred1.plot()
ax.decoration(title="Extraction by Ranks")
Extracting a new Data Base upon selection¶
We now create turn the variable 'sel' into a selection and createa new data set which is restricted to the only active samples
In [11]:
mydb.setLocator('sel', gl.ELoc.SEL)
mydbred2 = gl.Db.createReduce(mydb)
mydbred2
Out[11]:
Data Base Characteristics ========================= Data Base Summary ----------------- File is organized as a set of isolated points Space dimension = 2 Number of Columns = 4 Total number of samples = 93 Number of active samples = 93 Variables --------- Column = 0 - Name = rank - Locator = NA Column = 1 - Name = x-1 - Locator = x1 Column = 2 - Name = x-2 - Locator = x2 Column = 3 - Name = sel - Locator = sel
In [12]:
ax = mydbred2.plot()
ax.decoration(title="Extraction by Selection")