### What is a dataframe

###### Sun 24 April 2016

Back when *Big Data* was a genuinely new concept and *Data Scientist's*
weren't rock stars, the SciPy ecosystem was
beginning to mature. It felt as though Python deserved to dominate the
data analysis world but uptake was still slow outside the realm of the
physical sciences and engineering. Something seemed to be missing; and
there was a new language on the block which was steeling all the
attention: R. As well as having more mature statistical libraries, R had
a killer feature for anyone doing ad hoc data invstigation: the Data
Frame.

I remember the feeling of relief and optimism when this missing feature was addressed by the arival of Pandas, bringing data frames to Python. Finally we had a tool which could bring the power of Numpy to a more general audience and stem the flow of attention towards R.

Pandas data frames have the flexibility of R but used Numpy arrays underneath to give big efficiency gains for bulk computations. Pandas' success is based on how it manages to hide this implementation from the user, allowing them to see them in terms of what they are familiar with. To a software engineer they look like database tables; to a physicist they may look like annotated matrices; to many with a general numerate background they look like spreadsheets.

I want to demonstrate that data frames are an amalgamation of all of these different models with some clever manipulation under the hood to paper over the cracks. Even though a data frame seems completely general, there are a few restrictions that are not obvious. Understanding the data frame's restrictions also helps us see through its usability tricks to the true data model underneath.

## Inside the data frame

When I want to understand the implementation of a data structure I often
mess arount on the IPython prompt to find out what attributes and
methods an object has. Using IPython autocomplete I can see our data
frames have an attribute `_data` which looks promising. What do they
look like?

print '== DF1 : 2 cols, float, int' print df._data, '\n' print '== DF1.T' print df.T._data, '\n' print '== DF2 : 3 cols, float, int, string' print df2._data, '\n' print '== DF2.T' print df2.T._data, '\n'

== DF1 : 2 cols, float, int BlockManager Items: Index([u'a', u'b'], dtype='object') Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64') FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64 IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64 == DF1.T BlockManager Items: Int64Index([0, 1, 2, 3, 4], dtype='int64') Axis 1: Index([u'a', u'b'], dtype='object') FloatBlock: slice(0, 5, 1), 5 x 2, dtype: float64 == DF2 : 3 cols, float, int, string BlockManager Items: Index([u'a', u'b', u'c'], dtype='object') Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64') FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64 IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64 ObjectBlock: slice(2, 3, 1), 1 x 5, dtype: object == DF2.T BlockManager Items: Int64Index([0, 1, 2, 3, 4], dtype='int64') Axis 1: Index([u'a', u'b', u'c'], dtype='object') ObjectBlock: slice(0, 5, 1), 5 x 3, dtype: object

It's worth taking a few moments to examine these structures. Each
`BlockManager` contains an `Items` index, a `Axis` index and one
or more typed blocks. The blocks appear to be a numpy array with a slice
object which defines which of the frame's columns are stored in that
block. Importantly:

- Each block stores one or more columns but must span all rows
- Each block has one dtype

With this structure Pandas should be able to store multiple columns in one block. Let's check this is the case.

df3 = pd.DataFrame({'a': np.arange(0, 5), 'b': range(0, 5)}) df3._data

BlockManager Items: Index([u'a', u'b'], dtype='object') Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64') IntBlock: slice(0, 2, 1), 2 x 5, dtype: int64

## Intermission: NetCDF

If you are familiar with atmospheric sciencies or oceanography you are
probably getting a sense of deja vu right now. The `BlockManager`
looks suspiciously like how the CF-NetCF
conventions are used to encode
observation and model data in the environmental sciences. In both cases
variables are stored in homogenous arrays where their domain is
represented by separate coordinate axis variables. When variables share
a domain they are associated with the same coordinate axes, similar to
how we might align columns in a spreadsheet next to a common coordinate
column.

Both Pandas and CF-NetCDF store data in a fundamentally different way to most relational databases which store rows consecutively. I will save full exploration of this for another post but I will try to wet your apetite by pointing out a further similarity with the emerging trend of Columnar Databases. In fact, it seems no coincidence that the inventor of Pandas has recently announced involvement in a new columnar technology Apache Arrow

## Performance and dtypes

We know that what makes Pandas efficient is it's use of numpy arrays
underneath, so it should be no surprise that it can be slower to
calculate across collumns than across rows. To demonstrate this let's do
a simple calculation on a column and compare it to the same data after
it has been cast to `object` by transposing. We'll capture these
different representations of the same data as `a1` and `a2`.

df = pd.DataFrame({'a': np.linspace(0, 100, 10000)}) df['b'] = df['a'].astype(np.str) a1 = df['a'] a2 = df.T.ix[0] print 'Dtypes: a1 =', a1.dtype, ' a2 =', a2.dtype print 'a1 == a2:', np.all(a1 == a2) # Verify the data is the same

Dtypes: a1 = float64 a2 = object a1 == a2: True

print '== a1' %timeit np.sum(a1 * a1) print '== a2' %timeit np.sum(a2 * a2)

== a1 10000 loops, best of 3: 160 µs per loop == a2 1000 loops, best of 3: 1.09 ms per loop

There is about 7-fold degradation in performance when processing the
object array `a2` compared to the float array `a1` even for this
trivial calculation.

## Panels

Finally we should take a quick peek at Pandas' lesser known data structures that expand the data frame concept to more dimensions. Pandas defines Panel for 3D data and Panel4D for 4D data. The question is which Pandas concept expands increases in dimensionality: columns or rows?

wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=pd.date_range('1/1/2000', periods=5), minor_axis=['A', 'B', 'C', 'D'])

wp._data

BlockManager Items: Index([u'Item1', u'Item2'], dtype='object') Axis 1: <class 'pandas.tseries.index.DatetimeIndex'> [2000-01-01, ..., 2000-01-05] Length: 5, Freq: D, Timezone: None Axis 2: Index([u'A', u'B', u'C', u'D'], dtype='object') FloatBlock: slice(0, 2, 1), 2 x 5 x 4, dtype: float64

## Summing up

I like to think of a Pandas DataFrame as a collection of 1D arrays (the columns) whose elements are alligned along a common axis (the index). The implementation of DataFrames stores columns together rather than rows and is therefore more similar to columnar databases or NetCDF than it is to traditional relational databases.

Pandas takes great efforts to provide a symetric interface with respect to the columns and rows, upcasting to Python Objects where necessary. This symetry breaks down when you are concerned about the types of your data, for instance when efficiency is important or when you want to enforce uniformity to ensure clean data.

The less common Pandas objects Panel and Panel4D extend the
dimensionality of columns to multi-dimensional items, thus Panels are
conceptually collections of 2D arrays with 2 common axes. This is made
less obvious by changing the terminology from `columns` to `items`
in the Panel interface.

Category: Data Science Tagged: pandas python