Python Training by Dan Bader

Records, Structs, and Data Transfer Objects in Python

How to implement records, structs, and “plain old data objects” in Python using only built-in data types and classes from the standard library.

Compared to arrays, record data structures provide a fixed number of fields, each field can have a name, and may have a different type.

I’m using the definition of a “record” loosely in this article. For example, I’m also going to discuss types like Python’s built-in tuple that may or may not be considered “records” in a strict sense because they don’t provide named fields.

Python provides several data types you can use to implement records, structs, and data transfer objects. In this article you’ll get a quick look at each implementation and its unique characteristics. At the end you’ll find a summary and a decision making guide that will help you make your own pick.

Alright, let’s get started:

✅ The dict Built-in

Python dictionaries store an arbitrary number of objects, each identified by a unique key. Dictionaries are often also called “maps” or “associative arrays” and allow the efficient lookup, insertion, and deletion of any object associated with a given key.

Using dictionaries as a record data type or data object in Python is possible. Dictionaries are easy to create in Python as they have their own syntactic sugar built into the language in the form of dictionary literals. The dictionary syntax is concise and quite convenient to type.

Data objects created using dictionaries are mutable and there’s little protection against misspelled field names, as fields can be added and removed freely at any time. Both of these properties can introduce surprising bugs and there’s always a trade-off to be made between convenience and error resilience.

car1 = {
    'color': 'red',
    'mileage': 3812.4,
    'automatic': True,
}
car2 = {
    'color': 'blue',
    'mileage': 40231.0,
    'automatic': False,
}

# Dicts have a nice repr:
>>> car2
{'color': 'blue', 'automatic': False, 'mileage': 40231.0}

# Get mileage:
>>> car2['mileage']
40231.0

# Dicts are mutable:
>>> car2['mileage'] = 12
>>> car2['windshield'] = 'broken'
>>> car2
{'windshield': 'broken', 'color': 'blue',
 'automatic': False, 'mileage': 12}

# No protection against wrong field names,
# or missing/extra fields:
car3 = {
    'colr': 'green',
    'automatic': False,
    'windshield': 'broken',
}

✅ The tuple Built-in

Python’s tuples are a simple data structure for grouping arbitrary objects. Tuples are immutable—they cannot be modified once they’ve been created.

Performancewise, tuples take up slightly less memory than lists in CPython and they’re faster to construct at instantiation time. As you can see in the bytecode disassembly below, constructing a tuple constant takes a single LOAD_CONST opcode while constructing a list object with the same contents requires several more operations:

>>> import dis
>>> dis.dis(compile("(23, 'a', 'b', 'c')", '', 'eval'))
  1       0 LOAD_CONST           4 ((23, 'a', 'b', 'c'))
          3 RETURN_VALUE

>>> dis.dis(compile("[23, 'a', 'b', 'c']", '', 'eval'))
  1       0 LOAD_CONST           0 (23)
          3 LOAD_CONST           1 ('a')
          6 LOAD_CONST           2 ('b')
          9 LOAD_CONST           3 ('c')
         12 BUILD_LIST           4
         15 RETURN_VALUE

However you shouldn’t place too much emphasis on these differences. In practice the performance difference will often be negligible and trying to squeeze out extra performance out of a program by switching from lists to tuples will likely be the wrong approach.

A potential downside of plain tuples is that the data you store in them can only be pulled out by accessing it through integer indexes. You can’t give names to individual properties stored in a tuple. This can impact code readability.

Also, a tuple is always an ad-hoc structure. It’s difficult to ensure that two tuples have the same number of fields and the same properties stored on them.

This makes it easy to introduce “slip-of-the-mind” bugs by mixing up the field order, for example. Therefore I would recommend you keep the number of fields stored in a tuple as low as possible.

# Fields: color, mileage, automatic
car1 = ('red', 3812.4, True)
car2 = ('blue', 40231.0, False)

# Tuple instances have a nice repr:
>>> car1
('red', 3812.4, True)
>>> car2
('blue', 40231.0, False)

# Get mileage:
>>> car2[1]
40231.0

# Tuples are immutable:
>>> car2[1] = 12
TypeError: "'tuple' object does not support item assignment"

# No protection against missing/extra fields
# or a wrong order:
>>> car3 = (3431.5, 'green', True, 'silver')

✅ Writing a Custom Class

Classes allow you to define reusable “blueprints” for data objects to ensure each object provides the same set of fields.

Using regular Python classes as record data types is feasible, but it also takes manual work to get the convenience features of other implementations. For example, adding new fields to the __init__ constructor is verbose and takes time.

Also, the default string representation for objects instantiated from custom classes is not very helpful. To fix that you may have to add your own __repr__ method, which again is usually quite verbose and must be updated every time you add a new field.

Fields stored on classes are mutable and new fields can be added freely, which may or may not be what you intend. It’s possible to provide more access control and to create read-only fields using the @property decorator, but this requires writing more glue code.

Writing a custom class is a great option whenever you’d like to add business logic and behavior to your record objects using methods. But this means these objects are technically no longer plain data objects.

class Car:
    def __init__(self, color, mileage, automatic):
        self.color = color
        self.mileage = mileage
        self.automatic = automatic

car1 = Car('red', 3812.4, True)
car2 = Car('blue', 40231.0, False)

# Get the mileage:
>>> car2.mileage
40231.0

# Classes are mutable:
>>> car2.mileage = 12
>>> car2.windshield = 'broken'

# String representation is not very useful
# (must add a manually written __repr__ method):
>>> car1
<Car object at 0x1081e69e8>

✅ The collections.namedtuple Class

The namedtuple class available in Python 2.6+ provides an extension of the built-in tuple data type. Similarly to defining a custom class, using namedtuple allows you to define reusable “blueprints” for your records that ensure the correct field names are used.

Namedtuples are immutable just like regular tuples. This means you cannot add new fields or modify existing fields after the namedtuple instance was created.

Besides that, namedtuples are, well…named tuples. Each object stored in them can be accessed through a unique identifier. This frees you from having to remember integer indexes, or resorting to workarounds like defining integer constants as mnemonics for your indexes.

Namedtuple objects are implemented as regular Python classes internally. When it comes to memory usage they are also “better” than regular classes and just as memory efficient as regular tuples:

>>> from collections import namedtuple
>>> from sys import getsizeof

>>> p1 = namedtuple('Point', 'x y z')(1, 2, 3)
>>> p2 = (1, 2, 3)

>>> getsizeof(p1)
72
>>> getsizeof(p2)
72

Namedtuples can be an easy way to clean up your code and to make it more readable by enforcing a better structure for your data.

I find that going from ad-hoc data types like dictionaries with a fixed format to namedtuples helps me express the intent of my code more clearly. Often when I apply this refactoring I magically come up with a better solution for the problem I’m facing.

Using namedtuples over unstructured tuples and dicts can also make my coworkers’ lives easier because namedtuples make the data passed around “self-documenting”, at least to a degree.

For more information and code examples, check out my tutorial on namedtuples here on dbader.org.

from collections import namedtuple

Car = namedtuple('Car' , 'color mileage automatic')

car1 = Car('red', 3812.4, True)

# Instances have a nice repr:
>>> car1
Car(color='red', mileage=3812.4, automatic=True)

# Accessing fields
>>> car1.mileage
3812.4

# Fields are immtuable:
>>> car1.mileage = 12
AttributeError: "can't set attribute"
>>> car1.windshield = 'broken'
AttributeError: "'Car' object has no attribute 'windshield'"

✅ The typing.NamedTuple Class

This class added in Python 3.6 is the younger sibling of collections.namedtuple. It is very similar to namedtuple, the main difference is an updated syntax for defining new record types and added support for type hints.

Please note that type annotations are not enforced without a separate type checking tool like mypy—but even without tool support they can provide useful hints to other programmers (or be terribly confusing if the type hints get out of date.)

from typing import NamedTuple

class Car(NamedTuple):
    color: str
    mileage: float
    automatic: bool

car1 = Car('red', 3812.4, True)

# Instances have a nice repr
>>> car1
Car(color='red', mileage=3812.4, automatic=True)

# Accessing fields
>>> car1.mileage
3812.4

# Fields are immutable
>>> car1.mileage = 12
AttributeError: "can't set attribute"
>>> car1.windshield = 'broken'
AttributeError: "'Car' object has no attribute 'windshield'"

# Type annotations are not enforced without
# a separate type checking tool like mypy:
>>> Car('red', 'NOT_A_FLOAT', 99)
Car(color='red', mileage='NOT_A_FLOAT', automatic=99)

⚠️ The struct.Struct Class

This class performs conversions between Python values and C structs serialized into Python bytes objects. It can be used to handle binary data stored in files or from network connections, for example.

Structs are defined using a format strings-like mini language that allows you to define the arrangement of various C data types, like char, int, and long, as well as their unsigned variants.

The struct module is seldom used to represent data objects that are meant to be handled purely inside Python code. They’re intended primarily as a data exchange format, rather than a way of holding data in memory that’s only used by Python code.

In some cases packing primitive data into structs may use less memory than keeping it in other data types—but that would be a quite advanced (and probably unnecessary) optimization.

from struct import Struct

MyStruct = Struct('i?f')

data = MyStruct.pack(23, False, 42.0)

# All you get is a blob of data:
>>> data
b'\x17\x00\x00\x00\x00\x00\x00\x00\x00\x00(B'

# Data blobs can be unpacked again:
>>> MyStruct.unpack(data)
(23, False, 42.0)

⚠️ The types.SimpleNamespace Class

Here’s one more “esoteric” choice for implementing data objects in Python. This class was added in Python 3.3 and it provides attribute access to its namespace. It also includes a meaningful __repr__ by default.

As its name proclaims, SimpleNamespace is simple—it’s basically a glorified dictionary that allows attribute access and prints nicely. Attributes can be added, modified, and deleted freely.

from types import SimpleNamespace
car1 = SimpleNamespace(color='red', mileage=3812.4, automatic=True)

# The default repr:
>>> car1
namespace(automatic=True, color='red', mileage=3812.4)

# Instances are mutable
>>> car1.mileage = 12
>>> car1.windshield = 'broken'
>>> del car1.automatic
>>> car1
namespace(color='red', mileage=12, windshield='broken')

Which type should I use for data objects in Python?

As you’ve seen there’s quite a number of different options to implement records or data objects in Python. Generally your decision will depend on your use case:

  • You only have a few (2-3) fields: Using a plain tuple object may be okay because the field order is easy to remember or field names are superfluous. For example, think of an (x, y, z) point in 3D space.
  • You need immutable fields: In this case plain tuples, collections.namedtuple, typing.NamedTuple would all make good options for implementing this type of data object.
  • You need to lock down field names to avoid typos: collections.namedtuple and typing.NamedTuple are your friends.
  • You want to keep things simple: A plain dictionary object might be a good choice due to the convenient syntax that closely resembles JSON.
  • You need full control over your data structure: It’s time to write a custom class with @property setters and getters.
  • You need to add behavior (methods) to the object: You should write a custom class. Either from scratch or by extending collections.namedtuple or typing.NamedTuple.
  • You need to pack data tightly to serialize it to disk or send it over the network: Time to bust out struct.Struct, this is a great use case for it.

If you’re looking for a safe default choice, my general recommendation for implementing a plain record, struct, or data object in Python would be to:

  • use collections.namedtuple in Python 2.x; and
  • its younger sibling typing.NamedTuple in Python 3.

Read the full “Fundamental Data Structures in Python” article series here. This article is missing something or you found an error? Help a brother out and leave a comment below.

<strong><em>Improve Your Python</em></strong> with a fresh 🐍 <strong>Python Trick</strong> 💌 every couple of days

Improve Your Python with a fresh 🐍 Python Trick 💌 every couple of days

🔒 No spam ever. Unsubscribe any time.

This article was filed under: data-structures, programming, and python.

Related Articles:

📘 Python Tricks: The Book — A Buffet of Awesome Python Features: My new book that teaches you the coolest aspects of Python with short and easy to digest examples. » Click here to get a free sample chapter

Latest Articles:

🐍🐍 Peer-to-Peer Learning for Pythonistas PythonistaCafe is an invite-only, online community of Python and software development enthusiasts helping each other succeed and grow: » Learn more at pythonistacafe.com

← Browse All Articles