Iterator Chains as Pythonic Data Processing Pipelines
Here’s another great feature of iterators in Python: By chaining together multiple iterators you can write highly efficient data processing “pipelines.”
If you take advantage of Python’s generator functions and generator expressions, you’ll be building concise and powerful iterator chains in no time.
In this tutorial you’ll find out what this technique looks like in practice and how you can use it in your own programs.
The first time I saw this pattern in action in a PyCon presentation by David Beazley, it simply blew my mind.
But first things first—let’s do a quick recap:
Generators and generator expressions are syntactic sugar for writing iterators in Python. They abstract away much of the boilerplate code needed when writing class-based iterators.
While a regular function produces a single return value, generators produce a sequence of results. You could say they generate a stream of values over the course of their lifetime.
For example, I can define the following generator that produces the series of integer values from one to eight by keeping a running counter and yielding a new value every time next()
gets called on it:
def integers(): for i in range(1, 9): yield i
You can confirm this behaviour by running the following code in a Python REPL:
>>> chain = integers() >>> list(chain) [1, 2, 3, 4, 5, 6, 7, 8]
So far, so not-very-interesting. But we’ll quickly change this now. You see, generators can be “connected” to each other in order to build efficient data processing algorithms that work like a pipeline.
Making Generator “Pipelines”
You can take the “stream” of values coming out of the integers()
generator and feed them into another generator again. For example, one that takes each number, squares it, and then passes it on:
def squared(seq): for i in seq: yield i * i
This is what our “data pipeline” or “chain of generators” would do now:
>>> chain = squared(integers()) >>> list(chain) [1, 4, 9, 16, 25, 36, 49, 64]
And we can keep on adding new building blocks to this pipeline. Data flows in one direction only, and each processing step is shielded from the others via a well-defined interface.
This is similar to how pipelines work in Unix. We chain together a sequence of processes so that the output of each process feeds directly as input to the next one.
Building Longer Generator Chains
Why don’t we add another step to our pipeline that negates each value and then passes it on to the next processing step in the chain:
def negated(seq): for i in seq: yield -i
If we rebuild our chain of generators and add negated
at the end, this is the output we get now:
>>> chain = negated(squared(integers())) >>> list(chain) [-1, -4, -9, -16, -25, -36, -49, -64]
My favorite thing about chaining generators is that the data processing happens one element at a time. There’s no buffering between the processing steps in the chain:
- The
integers
generator yields a single value, let’s say 3. - This “activates” the
squared
generator, which processes the value and passes it on to the next stage as 3 × 3 = 9 - The square number yielded by the
squared
generator gets fed immediately into thenegated
generator, which modifies it to -9 and yields it again.
You could keep extending this chain of generators to build out a processing pipeline with many steps. It would still perform efficiently and could easily be modified because each step in the chain is an individual generator function.
Chained Generator Expressions
Each individual generator function in this processing pipeline is quite concise. With a little trick, we can shrink down the definition of this pipeline even more, without sacrificing much readability:
integers = range(8) squared = (i * i for i in integers) negated = (-i for i in squared)
Notice how I’ve replaced each processing step in the chain with a generator expression built on the output of the previous step. This code is equivalent to the chain of generators we built throughout this tutorial:
>>> negated <generator object <genexpr> at 0x1098bcb48> >>> list(negated) [0, -1, -4, -9, -16, -25, -36, -49]
The only downside to using generator expressions is that they can’t be configured with function arguments, and you can’t reuse the same generator expression multiple times in the same processing pipeline.
But of course, you could mix-and-match generator expressions and regular generators freely in building these pipelines. This will help improve readability with complex pipelines.
Chained Iterators in Python – Key Takeaways
In this tutorial you saw how chaining together multiple iterators let’s you write highly efficient data processing “pipelines.” This is another great feature of iterators in Python:
- Generators can be chained together to form highly efficient and maintainable data processing pipelines.
- Chained generators process each element going through the chain individually.
- Generator expressions can be used to write concise pipeline definitions, but this can impact readability.