Back when *Big Data* was a genuinely new concept and *Data Scientist's* weren't rock stars, the SciPy ecosystem was beginning to mature. It felt as though Python deserved to dominate the data analysis world but uptake was still slow outside the realm of the physical sciences and engineering. Something seemed to be missing; and there was a new language on the block which was steeling all the attention: R. As well as having more mature statistical libraries, R had a killer feature for anyone doing ad hoc data invstigation: the Data Frame.

I remember the feeling of relief and optimism when this missing feature was addressed by the arival of Pandas, bringing data frames to Python. Finally we had a tool which could bring the power of Numpy to a more general audience and stem the flow of attention towards R.

Pandas data frames have the flexibility of R but used Numpy arrays underneath to give big efficiency gains for bulk computations. Pandas' success is based on how it manages to hide this implementation from the user, allowing them to see them in terms of what they are familiar with. To a software engineer they look like database tables; to a physicist they may look like annotated matrices; to many with a general numerate background they look like spreadsheets.

I want to demonstrate that data frames are an amalgamation of all of these different models with some clever manipulation under the hood to paper over the cracks. Even though a data frame seems completely general, there are a few restrictions that are not obvious. Understanding the data frame's restrictions also helps us see through its usability tricks to the true data model underneath.

## Inside the data frame

When I want to understand the implementation of a data structure I often mess arount on the IPython prompt to find out what attributes and methods an object has. Using IPython autocomplete I can see our data frames have an attribute `_data`

which looks promising. What do they look like?

```
print '== DF1 : 2 cols, float, int'
print df._data, '\n'
print '== DF1.T'
print df.T._data, '\n'
print '== DF2 : 3 cols, float, int, string'
print df2._data, '\n'
print '== DF2.T'
print df2.T._data, '\n'
```

```
== DF1 : 2 cols, float, int BlockManager Items: Index(\[u\'a\',
u\'b\'\], dtype=\'object\') Axis 1: Int64Index(\[0, 1, 2, 3, 4\],
dtype=\'int64\') FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
== DF1.T BlockManager Items: Int64Index(\[0, 1, 2, 3, 4\],
dtype=\'int64\') Axis 1: Index(\[u\'a\', u\'b\'\], dtype=\'object\')
FloatBlock: slice(0, 5, 1), 5 x 2, dtype: float64
== DF2 : 3 cols, float, int, string BlockManager Items: Index(\[u\'a\',
u\'b\', u\'c\'\], dtype=\'object\') Axis 1: Int64Index(\[0, 1, 2, 3,
4\], dtype=\'int64\') FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64 ObjectBlock: slice(2, 3,
1), 1 x 5, dtype: object
== DF2.T BlockManager Items: Int64Index(\[0, 1, 2, 3, 4\],
dtype=\'int64\') Axis 1: Index(\[u\'a\', u\'b\', u\'c\'\],
dtype=\'object\') ObjectBlock: slice(0, 5, 1), 5 x 3, dtype: object
```

It's worth taking a few moments to examine these structures. Each `BlockManager`

contains an `Items`

index, a `Axis`

index and one or more typed blocks. The blocks appear to be a numpy array with a slice object which defines which of the frame's columns are stored in that block. Importantly:

- Each block stores one or more columns but must span all rows
- Each block has one dtype

With this structure Pandas should be able to store multiple columns in one block. Let's check this is the case.

```
= pd.DataFrame({'a': np.arange(0, 5), 'b': range(0, 5)})
df3 df3._data
```

```
BlockManager Items: Index(\[u\'a\', u\'b\'\], dtype=\'object\') Axis 1:
Int64Index(\[0, 1, 2, 3, 4\], dtype=\'int64\') IntBlock: slice(0, 2, 1),
2 x 5, dtype: int64
```

## Intermission: NetCDF

If you are familiar with atmospheric sciencies or oceanography you are probably getting a sense of deja vu right now. The `BlockManager`

looks suspiciously like how the CF-NetCF conventions are used to encode observation and model data in the environmental sciences. In both cases variables are stored in homogenous arrays where their domain is represented by separate coordinate axis variables. When variables share a domain they are associated with the same coordinate axes, similar to how we might align columns in a spreadsheet next to a common coordinate column.

Both Pandas and CF-NetCDF store data in a fundamentally different way to most relational databases which store rows consecutively. I will save full exploration of this for another post but I will try to wet your apetite by pointing out a further similarity with the emerging trend of Columnar Databases. In fact, it seems no coincidence that the inventor of Pandas has recently announced involvement in a new columnar technology Apache Arrow

## Performance and dtypes

We know that what makes Pandas efficient is it's use of numpy arrays underneath, so it should be no surprise that it can be slower to calculate across collumns than across rows. To demonstrate this let's do a simple calculation on a column and compare it to the same data after it has been cast to `object`

by transposing. We'll capture these different representations of the same data as `a1`

and `a2`

.

```
= pd.DataFrame({'a': np.linspace(0, 100, 10000)})
df 'b'] = df['a'].astype(np.str)
df[
= df['a']
a1 = df.T.ix[0]
a2 print 'Dtypes: a1 =', a1.dtype, ' a2 =', a2.dtype
print 'a1 == a2:', np.all(a1 == a2) # Verify the data is the same
```

`Dtypes: a1 = float64 a2 = object a1 == a2: True`

```
print '== a1'
%timeit np.sum(a1 * a1)
print '== a2'
%timeit np.sum(a2 * a2)
```

```
== a1 10000 loops, best of 3: 160 µs per loop == a2 1000 loops, best of
3: 1.09 ms per loop
```

There is about 7-fold degradation in performance when processing the object array `a2`

compared to the float array `a1`

even for this trivial calculation.

## Panels

Finally we should take a quick peek at Pandas' lesser known data structures that expand the data frame concept to more dimensions. Pandas defines Panel for 3D data and Panel4D for 4D data. The question is which Pandas concept expands increases in dimensionality: columns or rows?

```
= pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
wp =pd.date_range('1/1/2000', periods=5),
major_axis=['A', 'B', 'C', 'D']) minor_axis
```

` wp._data`

```
BlockManager Items: Index(\[u\'Item1\', u\'Item2\'\], dtype=\'object\')
Axis 1: \<class \'pandas.tseries.index.DatetimeIndex\'\> \[2000-01-01,
\..., 2000-01-05\] Length: 5, Freq: D, Timezone: None Axis 2:
Index(\[u\'A\', u\'B\', u\'C\', u\'D\'\], dtype=\'object\') FloatBlock:
slice(0, 2, 1), 2 x 5 x 4, dtype: float64
```

## Summing up

I like to think of a Pandas DataFrame as a collection of 1D arrays (the columns) whose elements are alligned along a common axis (the index). The implementation of DataFrames stores columns together rather than rows and is therefore more similar to columnar databases or NetCDF than it is to traditional relational databases.

Pandas takes great efforts to provide a symetric interface with respect to the columns and rows, upcasting to Python Objects where necessary. This symetry breaks down when you are concerned about the types of your data, for instance when efficiency is important or when you want to enforce uniformity to ensure clean data.

The less common Pandas objects Panel and Panel4D extend the dimensionality of columns to multi-dimensional items, thus Panels are conceptually collections of 2D arrays with 2 common axes. This is made less obvious by changing the terminology from `columns`

to `items`

in the Panel interface.