Ml - Geek went Freak!

Ml

Why: Pearson's correlation parametric and Spearman's non-parametric?

Note: This post assumes the perspective of studying monotonic correlation and deducing Linear regression.

There seems to be lot of myths surrounding Pearson and Spearman coefficients. One such is:

Pearson’s correlation is parametric while Spearman’s coefficient is non-parametric.

This is not entirely true.

First lets layout the definition of Parametric.

Parametric, here, means that Pearson’s correlation coefficient assumes distribution. Whereas, Spearman’s correlation coefficient is distribution free.

As we discussed in previous blog post, even though Spearman’s correlation performs better than Pearson’s correlation when given non-normal data set, Pearson’s correlation doesn’t actually assume normal distribution.

The notion that “Pearson correlation assumes normal distribution” stems from old practices of using Pearson’s correlation for significance testing.

Moreover, I don’t know how in the seven worlds “Parametric” can be bent to mean “distribution-dependent”. Maybe a little, but this terminology is mostly counter-intuitive.

References

  1. Why is Pearson parametric and Spearman non-parametric?

Pearson correlation visualization

The range of correlation coefficient is [-1, 1].

A value of zero means that there is no correlation between X and Y.

A value of 1 means there is perfect correlation between them: when X goes up, Y goes up in a perfectly linear fashion.

A value of -1 is a perfect anti-correlation: when x goes up, y goes down in an exactly linear manner.

Here is an attempt to visualize this relationship:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
%matplotlib inline

# Generate X
lX = np.arange(0, 10)

# Generate Y1: Normal random
lY1 = np.random.randn(10) + 1

# Generate Y2: X
lY2 = lX * -2

# Generate Y3: X * 2
lY3 = lX * 2

# Generate Y4: sin(X)
lY4 = np.sin(lX)

Lets plot these equations:

plt.plot(lX, lY1, 'b', label="randn")
plt.plot(lX, lY2, 'r', label="-2 * X")
plt.plot(lX, lY3, 'c', label="2 * X")
plt.plot(lX, lY4, 'm', label="sin(X)")
plt.legend(loc="upper left")
plt.show()

Now, lets plot the data set we have:

Plots of dataset

From the plots above, we would expect:

  1. Y1 to have no or zero correlation
  2. Y2 to have anti-correlation
  3. Y3 to have correlation
  4. ????
print("r(X, Y1) = ", pearsonr(lX, lY1)[0])
print("r(X, Y2) = ", pearsonr(lX, lY2)[0])
print("r(X, Y3) = ", pearsonr(lX, lY3)[0])
print("r(X, Y4) = ", pearsonr(lX, lY4)[0])

r(X, Y1) = 0.284990463813
r(X, Y2) = -1.0
r(X, Y3) = 1.0
r(X, Y4) = 0.0534535063704

As we expected, Y2 showed strongest anti-correlation, y3 strongest correlation.

Random normal data set showed weak correlation of 0.28.

The interesting data set is Y4. Y4 is non-linearly correlated with X but pearson correlation coefficient can only detect linear correlation.

statsmodels: Use of add_constant

statsmodels.regression.linear_model.OLS does not include intercept by default. User is expected to manually add one if required.

Lets consider the following data set:

Y = b0 + (x * b1)

Where b0 = 5, b1 = 2

import statsmodels.api as sm
import numpy as np

lX = np.arange(0, 10)
lY1 = (lX * 2) + 5

Lets run OLS on it,

lRes = sm.OLS(lY1, lX).fit()
lRes.params

You would expect OLS to return array [5., 2.]. But it returns,

array([ 2.78947368])

This is because it dint include intercept.

We can manually add a constant column to include intercept.

lX4 = np.copy(lX1).reshape((10, 1))
lOnes = np.ones(lX4.shape)
lX4 = np.hstack((lOnes, lX4))

lRes = sm.OLS(lY1, lX4).fit()
lRes.params

Now, OLS finds the correct params including intercept.

statsmodels however provides a convenience function called add_constant that adds a constant column to input data set.

lX2 = sm.add_constant(lX1)
lRes = sm.OLS(lY1, lX2).fit()
lRes.params

Does Pearson correlation require normality?

There seems to be lot of myths surrounding Pearson coefficients. One such is:

Pearson correlation only works when the data sets have normal marginal distribution and bivariate normal distribution.

This is not entirely true.

This misconception seams to arise due to two reasons:

  1. Confusion between symmetry and normality
  2. Sample data set vs Population data set
  3. Assumption that Pearson correlation will be used for significance testing

Symmetry vs Normality

Disproving the norm

First lets look into the Population data set itself. Should X and Y be marginally normal and bivariate normal for Pearson correlation to work?

Lets test this using 3 data samples:

  1. Normal linear data approximated with linear uniform distribution of low sample count
  2. Uniform data generated with linear uniform distribution of high sample count
  3. Bimodal data generated with artificially fiddling with uniform distribution
  4. Exponential data set (not exponential distributed)
lX10 = np.arange(0, 10)
lY1 = lX10.copy()

lX100 = np.arange(0, 100)
lY2 = lX100.copy()

lY3 = lX100.copy()
lY3[20:40] = 20
lY3[60:80] = 80

lY4 = np.exp(lX100)

Normal dataset

Normal dataset scatter plot
Normal dataset scatter plot
Normal dataset distribution
Normal dataset distribution

Pearson coefficient for normal data set is 1.0 as expected.

Uniform dataset

Uniform dataset scatter plot
Uniform dataset scatter plot
Uniform dataset distribution
Uniform dataset distribution

Pearson coefficient for uniform data set is 1.0. Hmm. That was surprising!?

Bimodal dataset

Now, lets try something kinky! Lets do bimodal ;).

Bimodal dataset scatter plot
Bimodal dataset scatter plot
Bimodal dataset distribution
Bimodal dataset distribution

Pearson coefficient for bimodal dataset is 0.97617685270275845. What????

Exponential dataset

Lets try something different this time. Lets do a distribution that is not symmetric.

Exponential data dataset scatter plot
Exponential data dataset scatter plot
Exponential data dataset distribution
Exponential data dataset distribution

Pearson coefficient for exponential data dataset is 0.71687040197302898.

Conclusion

So, it seems like Pearson correlation seems to be doing fine on non-normal data as long as it is symmetric.

This also means that Pearson correlation is effected by outliers.

Why symmetry?

That is how covariance formula works.

What about the myth?

Why is normality required then?

Sample data set vs Population data set

Another important thing to note is that, Pearson coefficient works best when the samples of the independent variable are drawn with uniform distribution or normal distribution.

Assumption that Pearson correlation will be used for significance testing

Further significance tests use Pearson correlation and they require that the data sets be marginally normal and bivariate normal. But if you are only interested in monotonic correlation and regression, symmetry is enough.

References

  1. Pearson’s or Spearman’s correlation with non-normal data
  2. Why is Pearson parametric and Spearman non-parametric

Covariance

Covariance studies how linearly two data sets, X and Y, vary with each other.

If Y goes up when X goes up, you expect a significant positive value.

If Y goes down when X goes up, you expect a significant negative value.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, HTML
from scipy.stats import pearsonr
%matplotlib inline

lX = np.arange(0, 10)

# Y1: X * 2
lY1 = lX * 2

# Y2: (X * -2) + 18
lY2 = (lX * -2) + 18

# Y3: Random normal
lY3 = np.random.randint(0, 18, 10)

plt.plot(lX, lY1, 'b')
plt.plot(lX, lY2, 'g')
plt.plot(lX, lY3, 'y')
plt.show()

Plot of data sets

Correlation

When Y increases as X increases, most of the sum terms sum up to a significant positive number.

lXRes = lX - lX.mean()
lY1Res = lY1 - lY1.mean()

lFrame = pd.DataFrame({
    'X': lXRes,
    'Y1': lY1Res,
    'XY1': lXRes * lY1Res,
}, columns=["x", "y1", "data"])
display(lFrame)
(lXRes * lY1Res).sum()
X Y1 XY1
0 -4.5 -9 ✔ 40.5
1 -3.5 -7 ✔ 24.5
2 -2.5 -5 ✔ 12.5
3 -1.5 -3 ✔ 4.5
4 -0.5 -1 ✔ 0.5
5 0.5 1 ✔ 0.5
6 1.5 3 ✔ 4.5
7 2.5 5 ✔ 12.5
8 3.5 7 ✔ 24.5
9 4.5 9 ✔ 40.5

This results in a positive covariance,

print("cov(X, Y1) = ", np.cov(lX, lY1)[0][1])

cov(X, Y1) = 18.3333333333

Anti-correlation

When Y decreases as X increases, most of the sum terms sum up to a significant negative number.

lXRes = lX - lX.mean()
lY2Res = lY2 - lY2.mean()
lFrame = pd.DataFrame({
    'x': lXRes,
    'y2': lY2Res,
    'data': lXRes * lY2Res,
}, columns=["x", "y2", "data"])
display(lFrame)
(lXRes * lY2Res).sum()
X Y2 XY2
0 -4.5 9 ✔ -40.5
1 -3.5 7 ✔ -24.5
2 -2.5 5 ✔ -12.5
3 -1.5 3 ✔ -4.5
4 -0.5 1 ✔ -0.5
5 0.5 -1 ✔ -0.5
6 1.5 -3 ✔ -4.5
7 2.5 -5 ✔ -12.5
8 3.5 -7 ✔ -24.5
9 4.5 -9 ✔ -40.5

This results in a negative covariance,

print("cov(X, Y2) = ", np.cov(lX, lY2)[0][1])

cov(X, Y2) = -18.3333333333

No correlation

When there is no correlation between X and Y, most of the sum terms cancel each other.

lY3Res = lY3 - lY3.mean()
lFrame = pd.DataFrame({
    'x': lXRes,
    'y3': lY2Res,
    'data': lXRes * lY3Res,
}, columns=["x", "y3", "data"])
display(lFrame)
(lXRes * lY3Res).sum()
X Y3 XY3
0 -4.5 9 ✔ 21.6
1 -3.5 7 ✔ 9.8
2 -2.5 5 ✘-13.0
3 -1.5 3 ✔ 5.7
4 -0.5 1 ✔ 1.4
5 0.5 -1 ✔ 2.6
6 1.5 -3 ✘ -2.7
7 2.5 -5 ✘ -7.0
8 3.5 -7 ✔ 4.2
9 4.5 -9 ✔ 32.4
print("cov(X, Y3) = ", np.cov(lX, lY3)[0][1])

cov(X, Y3) = 6.11111111111

The problem with covariances as you see is, they are not comparable. Pearson’s correlation coefficient computes a better measure of correlation by dividing covariance by product of standard deviations of both data sets.

Pearson correlation coefficient

Pearson correlation coefficient studies the linear relationship (or lack thereof) between two given data sets X and Y.

Conditions

  1. It can only find presence or absence of linear relationship between X and Y

Formula

Pearson correlation coefficient is normalized covariance of X and Y

$$tex r = \frac{cov(X, Y)}{\sigma_X * \sigma_Y} tex$$

Why it works?

Pearson’s correlation coefficient is improvisation of Covariance. This blog post explains why and how Covariance measures linear correlation between data sets.

The problem with covariances as you see is, they are not comparable. Pearson’s correlation coefficient computes a better measure of correlation by dividing covariance by product of standard deviations of both data sets.

This keeps the Pearson correlation coefficient between -1 and plus 1. With -1 indicating anti-correlation, 1 indicating correlation and 0 indicating no correlation.

How to interpret?

The range of correlation coefficient is [-1, 1].

A value of zero means that there is no correlation between X and Y.

A value of 1 means there is perfect correlation between them: when X goes up, Y goes up in a perfectly linear fashion.

A value of -1 is a perfect anti-correlation: when x goes up, y goes down in an exactly linear manner.

Refer to this blog post for visualization of this relationship.

How to use it?

TODO

When to use it?

TODO

References

  1. Linear correlation and Regression

Negative values in numpy's randn

If you are used to rand function, which generates neat uniformly distributed random numbers in the range of [0, 1), you will be surprised when you use randn for the first time. For two reasons:

  1. randn generated negative numbers
  2. randn generates numbers greater than 1 and lesser than -1

Examples

Negative

lRandom = np.random.randn(10)
print(lRandom[lRandom < 0])

The above code produced the following output during a sample run:

[-0.52004631 -0.4080691 -0.04164258 -0.46942423 -0.84344794 -0.01001501]

Greater than 2

lRandom = np.random.randn(500)
lRandom[lRandom > 2]

The above code produced the following output during a sample run:

[ 2.09666448 2.29351194 2.16025808 2.78635893 2.3467666 2.54232853 2.35466425 2.26961216 2.62167745 2.0261606 2.00743211]

Reason

This is because randn unlike rand generates random numbers backed by normal distribution with mean = 0 and variance = 1.

If you plot the histogram of the samples from randn, it becomes quite obvious:

lRandom = np.random.randn(5000)

lHist, lBin = np.histogram(lRandom)

plot = plt.plot(lBin[:-1], lHist, 'r--', linewidth=1)
plt.show()

The above code produced the following output during a sample run:

Histogram of randn samples