Back when Big Data was a genuinely new concept and Data Scientist's weren't rock stars, the SciPy ecosystem was beginning to mature. It felt as though Python deserved to dominate the data analysis world but uptake was still slow outside the realm of the physical sciences and engineering. Something seemed to be missing; and there was a new language on the block which was steeling all the attention: R. As well as having more mature statistical libraries, R had a killer feature for anyone doing ad hoc data invstigation: the Data Frame.
I remember the feeling of relief and optimism when this missing feature was addressed by the arival of Pandas, bringing data frames to Python. Finally we had a tool which could bring the power of Numpy to a more general audience and stem the flow of attention towards R.
Pandas data frames have the flexibility of R but used Numpy arrays underneath to give big efficiency gains for bulk computations. Pandas' success is based on how it manages to hide this implementation from the user, allowing them to see them in terms of what they are familiar with. To a software engineer they look like database tables; to a physicist they may look like annotated matrices; to many with a general numerate background they look like spreadsheets.
I want to demonstrate that data frames are an amalgamation of all of these different models with some clever manipulation under the hood to paper over the cracks. Even though a data frame seems completely general, there are a few restrictions that are not obvious. Understanding the data frame's restrictions also helps us see through its usability tricks to the true data model underneath.
Inside the data frame
When I want to understand the implementation of a data structure I often mess arount on the IPython prompt to find out what attributes and methods an object has. Using IPython autocomplete I can see our data frames have an attribute _data
which looks promising. What do they look like?
print '== DF1 : 2 cols, float, int'
print df._data, '\n'
print '== DF1.T'
print df.T._data, '\n'
print '== DF2 : 3 cols, float, int, string'
print df2._data, '\n'
print '== DF2.T'
print df2.T._data, '\n'
== DF1 : 2 cols, float, int BlockManager Items: Index(\[u\'a\',
u\'b\'\], dtype=\'object\') Axis 1: Int64Index(\[0, 1, 2, 3, 4\],
dtype=\'int64\') FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
== DF1.T BlockManager Items: Int64Index(\[0, 1, 2, 3, 4\],
dtype=\'int64\') Axis 1: Index(\[u\'a\', u\'b\'\], dtype=\'object\')
FloatBlock: slice(0, 5, 1), 5 x 2, dtype: float64
== DF2 : 3 cols, float, int, string BlockManager Items: Index(\[u\'a\',
u\'b\', u\'c\'\], dtype=\'object\') Axis 1: Int64Index(\[0, 1, 2, 3,
4\], dtype=\'int64\') FloatBlock: slice(0, 1, 1), 1 x 5, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64 ObjectBlock: slice(2, 3,
1), 1 x 5, dtype: object
== DF2.T BlockManager Items: Int64Index(\[0, 1, 2, 3, 4\],
dtype=\'int64\') Axis 1: Index(\[u\'a\', u\'b\', u\'c\'\],
dtype=\'object\') ObjectBlock: slice(0, 5, 1), 5 x 3, dtype: object
It's worth taking a few moments to examine these structures. Each BlockManager
contains an Items
index, a Axis
index and one or more typed blocks. The blocks appear to be a numpy array with a slice object which defines which of the frame's columns are stored in that block. Importantly:
- Each block stores one or more columns but must span all rows
- Each block has one dtype
With this structure Pandas should be able to store multiple columns in one block. Let's check this is the case.
= pd.DataFrame({'a': np.arange(0, 5), 'b': range(0, 5)})
df3 df3._data
BlockManager Items: Index(\[u\'a\', u\'b\'\], dtype=\'object\') Axis 1:
Int64Index(\[0, 1, 2, 3, 4\], dtype=\'int64\') IntBlock: slice(0, 2, 1),
2 x 5, dtype: int64
Intermission: NetCDF
If you are familiar with atmospheric sciencies or oceanography you are probably getting a sense of deja vu right now. The BlockManager
looks suspiciously like how the CF-NetCF conventions are used to encode observation and model data in the environmental sciences. In both cases variables are stored in homogenous arrays where their domain is represented by separate coordinate axis variables. When variables share a domain they are associated with the same coordinate axes, similar to how we might align columns in a spreadsheet next to a common coordinate column.
Both Pandas and CF-NetCDF store data in a fundamentally different way to most relational databases which store rows consecutively. I will save full exploration of this for another post but I will try to wet your apetite by pointing out a further similarity with the emerging trend of Columnar Databases. In fact, it seems no coincidence that the inventor of Pandas has recently announced involvement in a new columnar technology Apache Arrow
Performance and dtypes
We know that what makes Pandas efficient is it's use of numpy arrays underneath, so it should be no surprise that it can be slower to calculate across collumns than across rows. To demonstrate this let's do a simple calculation on a column and compare it to the same data after it has been cast to object
by transposing. We'll capture these different representations of the same data as a1
and a2
.
= pd.DataFrame({'a': np.linspace(0, 100, 10000)})
df 'b'] = df['a'].astype(np.str)
df[
= df['a']
a1 = df.T.ix[0]
a2 print 'Dtypes: a1 =', a1.dtype, ' a2 =', a2.dtype
print 'a1 == a2:', np.all(a1 == a2) # Verify the data is the same
Dtypes: a1 = float64 a2 = object a1 == a2: True
print '== a1'
%timeit np.sum(a1 * a1)
print '== a2'
%timeit np.sum(a2 * a2)
== a1 10000 loops, best of 3: 160 µs per loop == a2 1000 loops, best of
3: 1.09 ms per loop
There is about 7-fold degradation in performance when processing the object array a2
compared to the float array a1
even for this trivial calculation.
Panels
Finally we should take a quick peek at Pandas' lesser known data structures that expand the data frame concept to more dimensions. Pandas defines Panel for 3D data and Panel4D for 4D data. The question is which Pandas concept expands increases in dimensionality: columns or rows?
= pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
wp =pd.date_range('1/1/2000', periods=5),
major_axis=['A', 'B', 'C', 'D']) minor_axis
wp._data
BlockManager Items: Index(\[u\'Item1\', u\'Item2\'\], dtype=\'object\')
Axis 1: \<class \'pandas.tseries.index.DatetimeIndex\'\> \[2000-01-01,
\..., 2000-01-05\] Length: 5, Freq: D, Timezone: None Axis 2:
Index(\[u\'A\', u\'B\', u\'C\', u\'D\'\], dtype=\'object\') FloatBlock:
slice(0, 2, 1), 2 x 5 x 4, dtype: float64
Summing up
I like to think of a Pandas DataFrame as a collection of 1D arrays (the columns) whose elements are alligned along a common axis (the index). The implementation of DataFrames stores columns together rather than rows and is therefore more similar to columnar databases or NetCDF than it is to traditional relational databases.
Pandas takes great efforts to provide a symetric interface with respect to the columns and rows, upcasting to Python Objects where necessary. This symetry breaks down when you are concerned about the types of your data, for instance when efficiency is important or when you want to enforce uniformity to ensure clean data.
The less common Pandas objects Panel and Panel4D extend the dimensionality of columns to multi-dimensional items, thus Panels are conceptually collections of 2D arrays with 2 common axes. This is made less obvious by changing the terminology from columns
to items
in the Panel interface.