Statsmodels - Geek went Freak!


Loading statsmodels datasets

statsmodels comes with some sample datasets built-in. In this tutorial, we are going to learn how to use datasets in statsmodels.

The built-in datasets are available in package statsmodels.api.datasets.

In this tutorial lets explore statsmodels.api.datasets.fair.

One can load data from the datasets either as numpy.recarray or pandas.core.frame.DataFrame.

statsmodels.api.datasets.fair.load().data provides data as numpy.recarray.

statsmodels.api.datasets.fair.load_pandas().data provides data as pandas.core.frame.DataFrame.

The following code will display the dataset as table in ipython notebook.

import statsmodels.api as sm
dta = sm.datasets.fair.load_pandas().data

statsmodels: Use of add_constant

statsmodels.regression.linear_model.OLS does not include intercept by default. User is expected to manually add one if required.

Lets consider the following data set:

Y = b0 + (x * b1)

Where b0 = 5, b1 = 2

import statsmodels.api as sm
import numpy as np

lX = np.arange(0, 10)
lY1 = (lX * 2) + 5

Lets run OLS on it,

lRes = sm.OLS(lY1, lX).fit()

You would expect OLS to return array [5., 2.]. But it returns,

array([ 2.78947368])

This is because it dint include intercept.

We can manually add a constant column to include intercept.

lX4 = np.copy(lX1).reshape((10, 1))
lOnes = np.ones(lX4.shape)
lX4 = np.hstack((lOnes, lX4))

lRes = sm.OLS(lY1, lX4).fit()

Now, OLS finds the correct params including intercept.

statsmodels however provides a convenience function called add_constant that adds a constant column to input data set.

lX2 = sm.add_constant(lX1)
lRes = sm.OLS(lY1, lX2).fit()