Numpy - Geek went Freak!

Numpy

Loading sklearn datasets

Datasets or toy datasets, as sklearn calls it, reside in sklearn.datasets package.

A dataset can be loaded by using sklearn.datasets.load_*() function.

In this post, let us consider iris dataset. iris dataset can be loaded using sklearn.datasets.load_iris().

By default sklearn provides datasets as sklearn.datasets.base.Bunch.

from sklearn.datasets import load_iris
irisData = load_iris()
print(type(irisData))

The Bunch structure is convenient since it holds data, target, feature_names and target_names.data and target fields are both numpy.ndarray containing independent and dependent variables respectively.

from sklearn.datasets import load_iris
irisData = load_iris()
print(type(irisData))
print(type(irisData.data), type(irisData.target))
print(irisData.feature_names)
print(irisData.target_names)
print(irisData.data)
print(irisData.target)

sklearn datasets’ load methods can also provide the features and targets directly as numpy.ndarray by using the return_X_y argument.

from sklearn.datasets import load_iris
irisData = load_iris(return_X_y=True)
print(irisData[0])
print(irisData[1])

Loading statsmodels datasets

statsmodels comes with some sample datasets built-in. In this tutorial, we are going to learn how to use datasets in statsmodels.

The built-in datasets are available in package statsmodels.api.datasets.

In this tutorial lets explore statsmodels.api.datasets.fair.

One can load data from the datasets either as numpy.recarray or pandas.core.frame.DataFrame.

statsmodels.api.datasets.fair.load().data provides data as numpy.recarray.

statsmodels.api.datasets.fair.load_pandas().data provides data as pandas.core.frame.DataFrame.

The following code will display the dataset as table in ipython notebook.

import statsmodels.api as sm
dta = sm.datasets.fair.load_pandas().data
dta

Pearson correlation visualization

The range of correlation coefficient is [-1, 1].

A value of zero means that there is no correlation between X and Y.

A value of 1 means there is perfect correlation between them: when X goes up, Y goes up in a perfectly linear fashion.

A value of -1 is a perfect anti-correlation: when x goes up, y goes down in an exactly linear manner.

Here is an attempt to visualize this relationship:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
%matplotlib inline

# Generate X
lX = np.arange(0, 10)

# Generate Y1: Normal random
lY1 = np.random.randn(10) + 1

# Generate Y2: X
lY2 = lX * -2

# Generate Y3: X * 2
lY3 = lX * 2

# Generate Y4: sin(X)
lY4 = np.sin(lX)

Lets plot these equations:

plt.plot(lX, lY1, 'b', label="randn")
plt.plot(lX, lY2, 'r', label="-2 * X")
plt.plot(lX, lY3, 'c', label="2 * X")
plt.plot(lX, lY4, 'm', label="sin(X)")
plt.legend(loc="upper left")
plt.show()

Now, lets plot the data set we have:

Plots of dataset

From the plots above, we would expect:

  1. Y1 to have no or zero correlation
  2. Y2 to have anti-correlation
  3. Y3 to have correlation
  4. ????
print("r(X, Y1) = ", pearsonr(lX, lY1)[0])
print("r(X, Y2) = ", pearsonr(lX, lY2)[0])
print("r(X, Y3) = ", pearsonr(lX, lY3)[0])
print("r(X, Y4) = ", pearsonr(lX, lY4)[0])

r(X, Y1) = 0.284990463813
r(X, Y2) = -1.0
r(X, Y3) = 1.0
r(X, Y4) = 0.0534535063704

As we expected, Y2 showed strongest anti-correlation, y3 strongest correlation.

Random normal data set showed weak correlation of 0.28.

The interesting data set is Y4. Y4 is non-linearly correlated with X but pearson correlation coefficient can only detect linear correlation.

Pearson correlation coefficient

Pearson correlation coefficient studies the linear relationship (or lack thereof) between two given data sets X and Y.

Conditions

  1. It can only find presence or absence of linear relationship between X and Y

Formula

Pearson correlation coefficient is normalized covariance of X and Y

$$tex r = \frac{cov(X, Y)}{\sigma_X * \sigma_Y} tex$$

Why it works?

Pearson’s correlation coefficient is improvisation of Covariance. This blog post explains why and how Covariance measures linear correlation between data sets.

The problem with covariances as you see is, they are not comparable. Pearson’s correlation coefficient computes a better measure of correlation by dividing covariance by product of standard deviations of both data sets.

This keeps the Pearson correlation coefficient between -1 and plus 1. With -1 indicating anti-correlation, 1 indicating correlation and 0 indicating no correlation.

How to interpret?

The range of correlation coefficient is [-1, 1].

A value of zero means that there is no correlation between X and Y.

A value of 1 means there is perfect correlation between them: when X goes up, Y goes up in a perfectly linear fashion.

A value of -1 is a perfect anti-correlation: when x goes up, y goes down in an exactly linear manner.

Refer to this blog post for visualization of this relationship.

How to use it?

TODO

When to use it?

TODO

References

  1. Linear correlation and Regression

Negative values in numpy's randn

If you are used to rand function, which generates neat uniformly distributed random numbers in the range of [0, 1), you will be surprised when you use randn for the first time. For two reasons:

  1. randn generated negative numbers
  2. randn generates numbers greater than 1 and lesser than -1

Examples

Negative

lRandom = np.random.randn(10)
print(lRandom[lRandom < 0])

The above code produced the following output during a sample run:

[-0.52004631 -0.4080691 -0.04164258 -0.46942423 -0.84344794 -0.01001501]

Greater than 2

lRandom = np.random.randn(500)
lRandom[lRandom > 2]

The above code produced the following output during a sample run:

[ 2.09666448 2.29351194 2.16025808 2.78635893 2.3467666 2.54232853 2.35466425 2.26961216 2.62167745 2.0261606 2.00743211]

Reason

This is because randn unlike rand generates random numbers backed by normal distribution with mean = 0 and variance = 1.

If you plot the histogram of the samples from randn, it becomes quite obvious:

lRandom = np.random.randn(5000)

lHist, lBin = np.histogram(lRandom)

plot = plt.plot(lBin[:-1], lHist, 'r--', linewidth=1)
plt.show()

The above code produced the following output during a sample run:

Histogram of randn samples