Statistics¶

This file gives the elementary information on calculation of basic Statistics.

In [1]:
import gstlearn.document as gdoc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import gstlearn as gl 
from IPython.display import Markdown

gdoc.setNoScroll()

We will illustrate the notions on the Scotland temperatures data set.

In [2]:
data = pd.read_csv(gdoc.loadData("Scotland", "Scotland_Temperatures.csv"))
data
Out[2]:
Longitude Latitude Elevation January_temp
0 372.1 658.9 255 1.7
1 303.5 665.9 125 2
2 218.4 597.9 8 4.6
3 245.0 955.0 90 MISS
4 326.8 691.2 32 3.1
... ... ... ... ...
231 273.2 564.6 47 2.8
232 333.9 730.1 30 2.6
233 185.0 655.0 115 MISS
234 259.8 587.9 119 2.1
235 260.8 668.6 107 2.6

236 rows × 4 columns

In [3]:
z = data["Elevation"].to_numpy()

1) Position¶

In [4]:
Markdown(gdoc.loadDoc("Statistics_mean.md"))
Out[4]:

Mean¶

ˉz=1nn∑i=1zi

In [5]:
m = np.mean(z)
rounded = np.round(m,decimals=2)
print("Mean = " + str(rounded))
Mean = 146.44
In [6]:
Markdown(gdoc.loadDoc("Statistics_median.md"))
Out[6]:

Median¶

The median m of the set of values is a value such as half of the total observations are below and half is above.

  • If the size n is odd, it is the value of the z(n+12) where z(i) is the value of the i th observation (when ordered in increasing order).

  • If n is even, we can take z(n2)+z(n2+1)2.

The median is less sensitive than the mean to extreme values.

In [7]:
print("Median = " + str(np.median(z)))
Median = 85.5
In [8]:
Markdown(gdoc.loadDoc("Statistics_quartiles.md"))
Out[8]:

Quartiles¶

The quartiles are the values which divide the samples as follows:

  • lower quartile: 25% of the individuals are below
  • upper quartile: 25% of the individuals are above
In [9]:
print("Quartiles = " + str(np.quantile(z,[0,0.25,0.5,0.75,1])))
Quartiles = [  2.    28.5   85.5  216.25 800.  ]
In [10]:
Markdown(gdoc.loadDoc("Statistics_quantiles.md"))
Out[10]:

Quantiles (p)¶

We can generalize to any proportion p.

The p−quantile denoted qp is a value which divides the samples such as a proportion p of the individuals are below the quantile.

The median is q12. The lower quartile is q14 and the upper quartile is q34.

2) Dispersion¶

In [11]:
Markdown(gdoc.loadDoc("Statistics_range.md"))
Out[11]:

Range¶

maxi=1,…,nzi−mini=1,…,nzi

In [12]:
print("Range= " + str(np.max(z)-np.min(z)))
Range= 798
In [13]:
Markdown(gdoc.loadDoc("Statistics_interq.md"))
Out[13]:

Inter-quartiles distance¶

q34−q14

In [14]:
print("Inter-quartiles distance = " + str(np.diff(np.quantile(z,[0.25,0.75]))))
Inter-quartiles distance = [187.75]
In [15]:
Markdown(gdoc.loadDoc("Statistics_variance.md"))
Out[15]:

Variance¶

The variance measures the average distance between each individual to the mean:

1nn∑i=1(zi−ˉz)2

It is sometimes more convenient to use the equivalent formula

1nn∑i=1z2i−ˉz2=¯z2−ˉz2

Note that for statistical reasons, one often prefers to use

1n−1n∑i=1(zi−m)2

for the variance. The two formulas give close results when n is large.

In [16]:
# Variance
n = len(z)
print("Variance (First formula) = " + str(np.mean((z-np.mean(z))**2)))
print("Variance (Sec.  formula) = " + str(np.mean(z**2)-np.mean(z)**2))
print("Variance (numpy version) = " + str(np.var(z)))
Variance (First formula) = 27270.71258259121
Variance (Sec.  formula) = 27270.712582591204
Variance (numpy version) = 27270.71258259121
In [17]:
Markdown(gdoc.loadDoc("Statistics_std.md"))
Out[17]:

Standard Deviation¶

To have a measure in the same unit as the variable, one often consider the standard deviation.

√1nn∑i=1(zi−ˉz)2

In [18]:
print("Variance (numpy version) = " + str(np.std(z)))
Variance (numpy version) = 165.13846487899545

3) Distribution¶

In [19]:
Markdown(gdoc.loadDoc("Statistics_histo.md"))
Out[19]:

Histogram¶

To have a good idea of the distribution of a variable, one can compute the histogram.

The idea is

  • divide the range of the variable [min,Max] into small intervals. Here, we only treat the case were all intervals have the same size
  • compute the number of samples in each interval.

Normalized histogram rescales the ordinate such as the total surface is equal to 1.

In [20]:
nbin = 20
ax = plt.hist(z,bins=nbin)
No description has been provided for this image
In [21]:
#Histogram (normalized)
ax = plt.hist(z,bins=nbin,density=True)
No description has been provided for this image
In [22]:
Markdown(gdoc.loadDoc("Statistics_histocum.md"))
Out[22]:

Cumulated histogram¶

We can represent the cumulated histogram. It is a function which computes, for each value, the proportion of individuals below this value. It can be written as

F(zc)=1nn∑i=111]zi,+∞](zc)

where 11A is the indicator function of the set A:

11A(x)={1 if x∈A0 otherwise 

In [23]:
#Cumulative histogram
p = 0.8
x = np.sort(z)
y = np.linspace(1/len(z),1,len(x))
a = plt.plot(x, y)
a = plt.scatter(np.quantile(z,p),p,c="r")
No description has been provided for this image
In [24]:
Markdown(gdoc.loadDoc("Statistics_quantileF.md"))
Out[24]:

Quantile function¶

If we inverse the two axes, we obtain the quantile function which gives, for each value p, the quantile qp.

q(p)=F−1(p)

In [25]:
#Quantile function
p = 0.8
plt.plot(y,x)
a = plt.scatter(p,np.quantile(z,p),c="r")
No description has been provided for this image
In [26]:
Markdown(gdoc.loadDoc("Statistics_Ore.md"))
Out[26]:

Ore¶

In mine, we often consider the ore function T(zc)=1−F(zc)

Indeed, it gives the proportion of the data which are above a cut-off.

In [27]:
#Ore
ore = 1. - y
a = plt.plot(x,ore)
No description has been provided for this image
In [28]:
Markdown(gdoc.loadDoc("Statistics_Metal.md"))
Out[28]:

Metal¶

Q(zc)=1nn∑i=1zi11]zi,+∞](zc)

In [29]:
#Metal (normalized)
metal = 1/len(x)*(np.sum(x)-np.cumsum(x))
a = plt.plot(x, metal)
No description has been provided for this image
In [30]:
Markdown(gdoc.loadDoc("Statistics_Grade.md"))
Out[30]:

Grade¶

m(zc)=Q(zc)T(zc)

In [31]:
#Grade
a = plt.plot(x[:-1],metal[:-1]/ore[:-1])
No description has been provided for this image
In [32]:
Markdown(gdoc.loadDoc("Statistics_QT.md"))
Out[32]:

Q(T) curve¶

We just represent the Metal with respect to the Ore for various cut-off values zc.

In [33]:
#Q(T) curve
a = plt.plot(ore, metal/metal[0])
a = plt.plot([0,1],[0,1],"--")
No description has been provided for this image
In [34]:
Markdown(gdoc.loadDoc("Statistics_Benefit.md"))
Out[34]:

Conventional benefit¶

B(zc)=Q(zc)−zcT(zc)

In [35]:
#Benefit
a = plt.plot(x,metal-x*ore)
No description has been provided for this image

4) Bivariate Statistics¶

In [36]:
Markdown(gdoc.loadDoc("Statistics_Bivariate.md"))
Out[36]:

Now we consider two variables:

  • z(1)=(z(1)1,…,z(1)n)
  • z(2)=(z(2)1,…,z(2)n)

and we will study their relationship.

In [37]:
temp = data["January_temp"].to_numpy()
elev = data["Elevation"].to_numpy()
sel = temp!="MISS"
z2=temp[sel].astype("float")
z1=elev[sel]
In [38]:
Markdown(gdoc.loadDoc("Statistics_Covariance.md"))
Out[38]:

Covariance¶

We can compute the covariance between the two vectors z(1) and z(2).

cov(z(1),z(2))=1nn∑i=1(z(1)i−ˉz(1))(z(2)i−ˉz(2))

where ˉz(j) is the mean of the variable z(j) with j=1,2.

In [39]:
# Covariance
print("Covariance = " + str(np.cov(z1,z2)[0,1]))
Covariance = -72.91027814569537
In [40]:
Markdown(gdoc.loadDoc("Statistics_Correlation.md"))
Out[40]:

Correlation coefficient¶

The covariance depends on the scale of z(1) and z(2). In order to have a scale invariant measure, we can use the correlation coefficient ρ=cov(z(1),z(2))√var(z(1))var(z(2))

The correlation coefficient lies within [−1,1].

When it is equal to −1 or 1, the variables are linked by a linear relationship

z(2)=a.z(1)+b

where the sign of a corresponds to the sign of ρ.

When ρ=0, we say that the variables are uncorrelated. But they can still have a link (not linear).

In [41]:
print("Correlation coefficient",np.round(np.corrcoef(z1,z2)[0,1],4))
Correlation coefficient -0.8023
In [42]:
Markdown(gdoc.loadDoc("Statistics_CovarianceM.md"))
Out[42]:

Covariance matrix¶

When we have several variables z(1),…,z(p), we can compute their covariance matrix Σ which stores the covariances between each pair of variable.

Σ=[var(z(1))cov(z(1),z(2))…cov(z(1),z(p))cov(z(2),z(1))var(z(2))…cov(z(2),z(p))⋮⋮⋱⋮cov(z(p),z(1))cov(z(p),z(2))…var(z(p))]

Note that this matrix is symmetric.

If the variables (centered by their means) are stored in a matrix Zc (one column per variable), then

Σ=1nZTcZc where T designates the transposition.

In other words, ZTc is the matrix where each line is a variable.

In [43]:
#Covariance matrix
print("Covariance matrix = \n" + str(np.cov(z1,z2)))
Covariance matrix = 
[[ 8.04385263e+03 -7.29102781e+01]
 [-7.29102781e+01  1.02658631e+00]]
In [44]:
print("Variance",np.var(z1) * len(z1)/(len(z1)-1))
Variance 8043.852626931566
In [45]:
print("Covariance matrix = \n" + str(np.cov(z1,z2,bias=True)))
Covariance matrix = 
[[ 7.99058208e+03 -7.24274286e+01]
 [-7.24274286e+01  1.01978773e+00]]
In [46]:
Markdown(gdoc.loadDoc("Statistics_Scatter.md"))
Out[46]:

Scatter plot¶

We can represent the scatter plot between the two variables (only isotopic samples are represented).

In [47]:
a = plt.scatter(z1,z2,s=1)
No description has been provided for this image
In [48]:
Markdown(gdoc.loadDoc("Statistics_Regr.md"))
Out[48]:

Here the relation could be considered as linear. Let's try to find the coefficents of the regression line.

Linear regression¶

Simple linear regression¶

We can model the relationship between z(1) and z(2) by using a linear regression. model z(2)=az(1)+b+R where R is a residual.

We try to find (a,b) by minimizing the sum of the squared difference between z(2) and az(1)+b:

||R||2=n∑i=1(z(2)i−(az(1)i+b))2.

We can show that the coefficients a and b can be estimated by

ˆa=cov(z(1),z(2))var(z(1))

and b by

ˆb=ˉz(2)−ˆaˉz(1)

In [49]:
# Regression 
ahat = np.cov(z1,z2,bias=True)[1,:][0]/np.var(z1)
bhat = np.mean(z2) - ahat*np.mean(z1)
plt.scatter(z1,z2,s=1)
a = plt.plot([np.min(z1),np.max(z1)],[bhat+ahat*np.min(z1),bhat+ahat*np.max(z1)])
No description has been provided for this image
In [50]:
Markdown(gdoc.loadDoc("Statistics_MRegr.md"))
Out[50]:

Multiple linear regression¶

When we have several variables x(1),…,x(p) to explain an interest variable y we can also use a linear regression

y=p∑j=1βjx(j)+β0+R

Note that for convenience, we will rewrite the relation

y=p∑j=0βjx(j)+R

where the variable x(0) is equal to 1.

Last, we can rewrite more compactly

y=βTX+R

where β=[β0⋮βp]

and X is the table with all the observations. The first column contains 1's and then each column is a variable X=[1x(1)…x(p)]

As in the simple linear regression case, we will try to minimize

||R||2=||y−βTX||2

We can show that

ˆβ=(XTX)−1XTy

In [51]:
Markdown(gdoc.loadDoc("Statistics_hist2d.md"))
Out[51]:

Bivariate histogram¶

To represent the two variables, we can perform a 2d histogram.

In [52]:
# 2d histogram
nbin = 15
ax = plt.hist2d(z1,z2,nbin)
No description has been provided for this image
In [53]:
Markdown(gdoc.loadDoc("Statistics_histcond.md"))
Out[53]:

Conditional distribution¶

Then we could look at the histogram of z2 for a given class of z1.

For instance, if we consider the 2nd class, z1∈[28.6,54.2] :

It shows the (empirical) conditional distribution of z2 knowing that z1∈[28.6,54.2].

In [54]:
#Histogram in a class
axc = plt.hist(z2[np.where((z1<ax[1][2])*(z1>ax[1][1]))],bins=nbin,density=True)
No description has been provided for this image
In [55]:
Markdown(gdoc.loadDoc("Statistics_meancond.md"))
Out[55]:

Conditional mean (or regression)¶

In the same spirit, we can consider the conditional mean (mean of z2 for a given classe of z1). It is the conditional mean.

If we iterate over all the classes, we obtain the empirical regression.

In [56]:
## Conditional mean
ax = plt.hist2d(z1,z2,15)
plt.scatter(z1,z2,s=1)
m = np.empty_like(ax[1])
for i in range(ax[1].shape[0]-1):
    ind = np.where(np.logical_and(z1>ax[1][i],z1<ax[1][i+1]))[0]
    if len(ind)>0:
        m[i] = np.mean(z2[ind])
    else:
        m[i] = None
ax=plt.plot(ax[1],m,c="r")
No description has been provided for this image