What happens when you turn an iterator inside out?

Sun 18 September 2016

This post is an extended version of a lightning talk I gave at *PyConUK 2016* on 2016-09-17

I want share with you an interesting connection between two programming paradigms that, at first glance, seem completely different. This is, of course, not my idea but material on the web is relatively difficult to come by, particularly if you are not into theory. This treatment tries to be more approachable and look from the Python perspective.

So what is an iterator?

As Python programmers we feel we've got a pretty good understanding of iterators. They are a stalwart of imperative programming, allowing us to express many tasks efficiently as loops rather than gathering sets of things into intermediate data structures. Abstractly we have objects that can be iterated over, which we will call iteratables and objects that represent the process of iterating called iterators

iter(iterable)  -> iterator
next(iterator)  -> value

The point being that you can continually call next(iterator) to yield multiple values. Also, if you call iter(iterable) a second time you will get a fresh iterator which will yield values independently (usually, but not necessarily, the same values).

If we are going to perform surgery on this definition we need to be a little more precise. We need to capture all the possible results of calling these functions. In particular, calling next() have several outcomes depending on whether it raises an exception and why.

iter(iterable)  -> iterator
next(iterator)  -> (value | StopIteration | Exception)

We can simplify this notation by removing our names to just leave an expression of what is happening. Both iterable and iterator interfaces have only one method, so we can replace them with single functions:

iterable = () -> iterator
iterator = () -> (value | StopIteration | Exception)

resulting in the expression

() -> ( () -> (value | StopIteration | Exception) )

Inside out?

We are now in a position to perform our manauver. All we have to do is reverse the arrows, turning inputs into outputs and vica versa. Since we have a higher order function in here the result will be a little more subtle than just going backwards.

Our original expression

() -> ( () -> (value | StopIteration | Exception) )


==> () <- ( () <- (value | StopIteration | Exception) )

or rewriting it left to right

==> ( (value | StopIteration | Exception) -> () ) -> ()

But what does this mean? We have a function that takes another function as argument, that second function takes either a value, StopIteration or Exception.

To make what is going on a little more explicit we can notice that any function which takes one of several types could be replaced by a tuple of functions taking each type. In our case we can split our inner function into 3 separate functions:

(value | StopIteration | Exception) -> ()

    == ( (value)     -> (),      # Emit value
         ()          -> (),      # No move values
         (Exception) -> ()       # Error
       ) -> ()

These functions look a bit odd because they don't return anything, they only receive a value or, in the case of StopIteration, there is no need to pass the value at all because there is only one possible reason for calling it. Presumably, something is done to these values though, in that they trigger some sort of computation.

In fact, they are exactly like event handlers. What we have done is replaced an object which allows us to pull values out of it until there are no more values left with an object which you send values until there is an error or there are no more values.

Finally, lets put the names back that we removed earlier. Clearly we don't have iterators/iterables any more so we'll need some different names. What we want is observer/observable:

( (value)     -> (),
  ()          -> (),
  (Exception) -> ()  )  ->  ()

observer   = ( (value) -> (), () -> (), (Exception) -> () )

observable = observer -> ()

This is Reactive Programming with ReactiveX

Observables came out of C# and LINQ and was subsequently been ported to the JVM world and many other languages, inluding JavaScript and Python. They have recently become popular for creating event-driven web apps and are central to events in Angular 2. They can be a higher-level abstraction than Promises supporting a wealth of operators to combine and filter streams of events.

But they are just iterators inside out!

A full discussion of the power of Rx would require it's own blog post but I should include at least one example in Python. Consider it a Hello World for Rx.

from rx import Observable

ob = Observable.from_('Spam egg Spam Spam bacon Spam'.split())

def on_next(value):
    print('Got {}'.format(value))

def on_completed():

subscription = ob.subscribe(on_next, on_completed=on_completed)
Got Spam
Got egg
Got Spam
Got Spam
Got bacon
Got Spam

Why care?

Whether you are developing dynamic web applications or large-scale data pipelines, we are being confronted with new paradigms that should solve our problems. Events, Streams, Asynchronous, Reactive -- all these things are undoubtably good but how are we going to start using them with our existing business logic? One thing we are told is that async and blocking code do not mix and yet everyone knows and understands imperative, iterator-style code.

It's good to realise events and iterators are not so different. One is simply the dual of the other. This changes our perspective allowing us to design our architecture differently. Producers are in control rather than consumers.

I am just beginning to explore the possibilities of Rx. It's clearly useful in JavaScript web apps but what about data pipelines? What about Python?

If you are interested you will like

Or to completely blow your mind this video is a fun summary of duality from the inventor of ReactiveX, Eric Meijer

Category: Software Tagged: python reactivex category_theory


What is a dataframe

Sun 24 April 2016

Back when Big Data was a genuinely new concept and Data Scientist's weren't rock stars, the SciPy ecosystem was beginning to mature. It felt as though Python deserved to dominate the data analysis world but uptake was still slow outside the realm of the physical sciences and engineering ...

Category: Data Science Tagged: pandas python


Read More

Blogging with IPython-notebook

Thu 14 March 2013

I have a new blogging engine to play with. This time I'm going with Pelican which generates static HTML, making it very easy to redeploy. So far I've been stunned at how quickly I can pull content into this blogging system. In particular it was very easy to ...

Category: Meta Tagged: ipython


Read More

Example of using esgf-pyclient to plot via OPeNDAP

Sun 24 February 2013

First import the required modules and select the CEDA ESGF-index node's search service.

from pyesgf.search import SearchConnection
from netCDF4 import Dataset
from mpl_toolkits.basemap import Basemap

CEDA_SERVICE = 'http://esgf-index1.ceda.ac.uk/esg-search/search'

Let's ask the question "How many datasets from the HadCM3 decadal2000 simulations do ...

Category: Data Science Tagged: esgf opendap


Read More

Analysis of replicated CMIP5 datasets

Wed 06 February 2013

This notebook describes the analysis of a snapshot of all replicas in the CMIP5 archive on February 6th 2013. A list of all dataset versions at each of the 3 replicating data centres is read, analysed and compared with the total number of datasets listed in the CMIP5 archive.

This ...

Category: Data Science Tagged: esgf cmip5


Read More

Bringing Git to data archival

Tue 15 January 2013

I am increasingly excited about distributed version control and how it enables easy collaboration between software developers without technical and social barriers such as synchronisation of work and maintenance of control.</p>

The obvious question is how the DVC systems can be applied to scientific collaboration and my particular specialism ...

Category: Data Science Tagged: esgf cmip5


Read More
Page 1 of 1