Python Training by Dan Bader

Iterator Chains as Pythonic Data Processing Pipelines

Here’s another great feature of iterators in Python: By chaining together multiple iterators you can write highly efficient data processing “pipelines.”

Python Iterator Chains

If you take advantage of Python’s generator functions and generator expressions, you’ll be building concise and powerful iterator chains in no time.

In this tutorial you’ll find out what this technique looks like in practice and how you can use it in your own programs.

The first time I saw this pattern in action in a PyCon presentation by David Beazley, it simply blew my mind.

But first things first—let’s do a quick recap:

Generators and generator expressions are syntactic sugar for writing iterators in Python. They abstract away much of the boilerplate code needed when writing class-based iterators.

While a regular function produces a single return value, generators produce a sequence of results. You could say they generate a stream of values over the course of their lifetime.

For example, I can define the following generator that produces the series of integer values from one to eight by keeping a running counter and yielding a new value every time next() gets called on it:

def integers():
    for i in range(1, 9):
        yield i

You can confirm this behaviour by running the following code in a Python REPL:

>>> chain = integers()
>>> list(chain)
[1, 2, 3, 4, 5, 6, 7, 8]

So far, so not-very-interesting. But we’ll quickly change this now. You see, generators can be “connected” to each other in order to build efficient data processing algorithms that work like a pipeline.

Making Generator “Pipelines”

You can take the “stream” of values coming out of the integers() generator and feed them into another generator again. For example, one that takes each number, squares it, and then passes it on:

def squared(seq):
    for i in seq:
        yield i * i

This is what our “data pipeline” or “chain of generators” would do now:

>>> chain = squared(integers())
>>> list(chain)
[1, 4, 9, 16, 25, 36, 49, 64]

And we can keep on adding new building blocks to this pipeline. Data flows in one direction only, and each processing step is shielded from the others via a well-defined interface.

This is similar to how pipelines work in Unix. We chain together a sequence of processes so that the output of each process feeds directly as input to the next one.

Building Longer Generator Chains

Why don’t we add another step to our pipeline that negates each value and then passes it on to the next processing step in the chain:

def negated(seq):
    for i in seq:
        yield -i

If we rebuild our chain of generators and add negated at the end, this is the output we get now:

>>> chain = negated(squared(integers()))
>>> list(chain)
[-1, -4, -9, -16, -25, -36, -49, -64]

My favorite thing about chaining generators is that the data processing happens one element at a time. There’s no buffering between the processing steps in the chain:

  1. The integers generator yields a single value, let’s say 3.
  2. This “activates” the squared generator, which processes the value and passes it on to the next stage as 3 × 3 = 9
  3. The square number yielded by the squared generator gets fed immediately into the negated generator, which modifies it to -9 and yields it again.

You could keep extending this chain of generators to build out a processing pipeline with many steps. It would still perform efficiently and could easily be modified because each step in the chain is an individual generator function.

Chained Generator Expressions

Each individual generator function in this processing pipeline is quite concise. With a little trick, we can shrink down the definition of this pipeline even more, without sacrificing much readability:

integers = range(8)
squared = (i * i for i in integers)
negated = (-i for i in squared)

Notice how I’ve replaced each processing step in the chain with a generator expression built on the output of the previous step. This code is equivalent to the chain of generators we built throughout this tutorial:

>>> negated
<generator object <genexpr> at 0x1098bcb48>
>>> list(negated)
[0, -1, -4, -9, -16, -25, -36, -49]

The only downside to using generator expressions is that they can’t be configured with function arguments, and you can’t reuse the same generator expression multiple times in the same processing pipeline.

But of course, you could mix-and-match generator expressions and regular generators freely in building these pipelines. This will help improve readability with complex pipelines.

Chained Iterators in Python – Key Takeaways

In this tutorial you saw how chaining together multiple iterators let’s you write highly efficient data processing “pipelines.” This is another great feature of iterators in Python:

  • Generators can be chained together to form highly efficient and maintainable data processing pipelines.
  • Chained generators process each element going through the chain individually.
  • Generator expressions can be used to write concise pipeline definitions, but this can impact readability.

<strong><em>Improve Your Python</em></strong> with a fresh 🐍 <strong>Python Trick</strong> 💌 every couple of days

Improve Your Python with a fresh 🐍 Python Trick 💌 every couple of days

🔒 No spam ever. Unsubscribe any time.

This article was filed under: python.

Related Articles:
  • What Are Python Generators? – Generators are a tricky subject in Python. With this tutorial you’ll make the leap from class-based iterators to using generator functions and the “yield” statement in no time.
  • Python Iterators: A Step-By-Step Introduction – Understanding iterators is a milestone for any serious Pythonista. With this step-by-step tutorial you’ll understanding class-based iterators in Python, completely from scratch.
  • Generator Expressions in Python: An Introduction – Generator expressions are a high-performance, memory–efficient generalization of list comprehensions and generators. In this tutorial you’ll learn how to use them from the ground up.
  • Make your Python code more readable with custom exception classes – In this short screencast I’ll walk you through a simple code example that demonstrates how you can use custom exception classes in your Python code to make it easier to understand, easier to debug, and more maintainable.
  • Using get() to return a default value from a Python dict – Python’s dictionaries have a “get” method to look up a key while providing a fallback value. This short screencast tutorial gives you a real-world example where this might come in handy.
What the Virtualenv?!

What the Virtualenv?!
See how to avoid common Python packaging pitfalls with this free email course:
» Click here to get the first lesson

Latest Articles:
Python T-Shirts &amp; Hoodies

Python T-Shirts & Hoodies
Your Coworkers Wil Be Mad With Envy
» Browse Python Apparel at Nerdlettering.com

← Browse All Articles