Pythonml - Geek went Freak!

Pythonml

Loading sklearn datasets

Datasets or toy datasets, as sklearn calls it, reside in sklearn.datasets package.

A dataset can be loaded by using sklearn.datasets.load_*() function.

In this post, let us consider iris dataset. iris dataset can be loaded using sklearn.datasets.load_iris().

By default sklearn provides datasets as sklearn.datasets.base.Bunch.

from sklearn.datasets import load_iris
irisData = load_iris()
print(type(irisData))

The Bunch structure is convenient since it holds data, target, feature_names and target_names.data and target fields are both numpy.ndarray containing independent and dependent variables respectively.

from sklearn.datasets import load_iris
irisData = load_iris()
print(type(irisData))
print(type(irisData.data), type(irisData.target))
print(irisData.feature_names)
print(irisData.target_names)
print(irisData.data)
print(irisData.target)

sklearn datasets’ load methods can also provide the features and targets directly as numpy.ndarray by using the return_X_y argument.

from sklearn.datasets import load_iris
irisData = load_iris(return_X_y=True)
print(irisData[0])
print(irisData[1])

Loading statsmodels datasets

statsmodels comes with some sample datasets built-in. In this tutorial, we are going to learn how to use datasets in statsmodels.

The built-in datasets are available in package statsmodels.api.datasets.

In this tutorial lets explore statsmodels.api.datasets.fair.

One can load data from the datasets either as numpy.recarray or pandas.core.frame.DataFrame.

statsmodels.api.datasets.fair.load().data provides data as numpy.recarray.

statsmodels.api.datasets.fair.load_pandas().data provides data as pandas.core.frame.DataFrame.

The following code will display the dataset as table in ipython notebook.

import statsmodels.api as sm
dta = sm.datasets.fair.load_pandas().data
dta