Iterator Chains as Pythonic Data Processing Pipelines

By Dan Bader — Get free updates of new posts here.

Here’s another great feature of iterators in Python: By chaining together multiple iterators you can write highly efficient data processing “pipelines.”

If you take advantage of Python’s generator functions and generator expressions, you’ll be building concise and powerful iterator chains in no time.

In this tutorial you’ll find out what this technique looks like in practice and how you can use it in your own programs.

The first time I saw this pattern in action in a PyCon presentation by David Beazley, it simply blew my mind.

But first things first—let’s do a quick recap:

Generators and generator expressions are syntactic sugar for writing iterators in Python. They abstract away much of the boilerplate code needed when writing class-based iterators.

While a regular function produces a single return value, generators produce a sequence of results. You could say they generate a stream of values over the course of their lifetime.

For example, I can define the following generator that produces the series of integer values from one to eight by keeping a running counter and yielding a new value every time next() gets called on it:

def integers():
    for i in range(1, 9):
        yield i

You can confirm this behaviour by running the following code in a Python REPL:

>>> chain = integers()
>>> list(chain)
[1, 2, 3, 4, 5, 6, 7, 8]

So far, so not-very-interesting. But we’ll quickly change this now. You see, generators can be “connected” to each other in order to build efficient data processing algorithms that work like a pipeline.

Making Generator “Pipelines”

You can take the “stream” of values coming out of the integers() generator and feed them into another generator again. For example, one that takes each number, squares it, and then passes it on:

def squared(seq):
    for i in seq:
        yield i * i

This is what our “data pipeline” or “chain of generators” would do now:

>>> chain = squared(integers())
>>> list(chain)
[1, 4, 9, 16, 25, 36, 49, 64]

And we can keep on adding new building blocks to this pipeline. Data flows in one direction only, and each processing step is shielded from the others via a well-defined interface.

This is similar to how pipelines work in Unix. We chain together a sequence of processes so that the output of each process feeds directly as input to the next one.

Building Longer Generator Chains

Why don’t we add another step to our pipeline that negates each value and then passes it on to the next processing step in the chain:

def negated(seq):
    for i in seq:
        yield -i

If we rebuild our chain of generators and add negated at the end, this is the output we get now:

>>> chain = negated(squared(integers()))
>>> list(chain)
[-1, -4, -9, -16, -25, -36, -49, -64]

My favorite thing about chaining generators is that the data processing happens one element at a time. There’s no buffering between the processing steps in the chain:

The integers generator yields a single value, let’s say 3.
This “activates” the squared generator, which processes the value and passes it on to the next stage as 3 × 3 = 9
The square number yielded by the squared generator gets fed immediately into the negated generator, which modifies it to -9 and yields it again.

You could keep extending this chain of generators to build out a processing pipeline with many steps. It would still perform efficiently and could easily be modified because each step in the chain is an individual generator function.

Chained Generator Expressions

Each individual generator function in this processing pipeline is quite concise. With a little trick, we can shrink down the definition of this pipeline even more, without sacrificing much readability:

integers = range(8)
squared = (i * i for i in integers)
negated = (-i for i in squared)

Notice how I’ve replaced each processing step in the chain with a generator expression built on the output of the previous step. This code is equivalent to the chain of generators we built throughout this tutorial:

>>> negated
<generator object <genexpr> at 0x1098bcb48>
>>> list(negated)
[0, -1, -4, -9, -16, -25, -36, -49]

The only downside to using generator expressions is that they can’t be configured with function arguments, and you can’t reuse the same generator expression multiple times in the same processing pipeline.

But of course, you could mix-and-match generator expressions and regular generators freely in building these pipelines. This will help improve readability with complex pipelines.

Chained Iterators in Python – Key Takeaways

In this tutorial you saw how chaining together multiple iterators let’s you write highly efficient data processing “pipelines.” This is another great feature of iterators in Python:

Generators can be chained together to form highly efficient and maintainable data processing pipelines.
Chained generators process each element going through the chain individually.
Generator expressions can be used to write concise pipeline definitions, but this can impact readability.

<strong><em>Improve Your Python</em></strong> with a fresh 🐍 <strong>Python Trick</strong> 💌 every couple of days

Improve Your Python with a fresh 🐍 Python Trick 💌 every couple of days

🔒 No spam ever. Unsubscribe any time.

This article was filed under: python.

Related Articles:

What Are Python Generators? – Generators are a tricky subject in Python. With this tutorial you’ll make the leap from class-based iterators to using generator functions and the “yield” statement in no time.
Python Iterators: A Step-By-Step Introduction – Understanding iterators is a milestone for any serious Pythonista. With this step-by-step tutorial you’ll understanding class-based iterators in Python, completely from scratch.
Generator Expressions in Python: An Introduction – Generator expressions are a high-performance, memory–efficient generalization of list comprehensions and generators. In this tutorial you’ll learn how to use them from the ground up.
Make your Python code more readable with custom exception classes – In this short screencast I’ll walk you through a simple code example that demonstrates how you can use custom exception classes in your Python code to make it easier to understand, easier to debug, and more maintainable.
Using get() to return a default value from a Python dict – Python’s dictionaries have a “get” method to look up a key while providing a fallback value. This short screencast tutorial gives you a real-world example where this might come in handy.

Latest Articles:

Interfacing Python and C: The CFFI Module – How to use Python’s built-in CFFI module for interfacing Python with native libraries as an alternative to the “ctypes” approach.
Write More Pythonic Code by Applying the Things You Already Know – There’s a mistake I frequently make when I learn new things about Python… Here’s how you can avoid this pitfall and learn something about Python’s “enumerate()” function at the same time.
Working With File I/O in Python – Learn the basics of working with files in Python. How to read from files, how to write data to them, what file seeks are, and why files should be closed.
How to Reverse a String in Python – An overview of the three main ways to reverse a Python string: “slicing”, reverse iteration, and the classic in-place reversal algorithm. Also includes performance benchmarks.
Mastering Click: Writing Advanced Python Command-Line Apps – How to improve your existing Click Python CLIs with advanced features like sub-commands, user input, parameter types, contexts, and more.
Working with Random Numbers in Python – An overview for working with randomness in Python, using only functionality built into the standard library and CPython itself.

← Browse All Articles

Making Generator “Pipelines”

Building Longer Generator Chains

Chained Generator Expressions

Chained Iterators in Python – Key Takeaways

🐍 Python Tricks 💌