Records, Structs, and Data Transfer Objects in Python
How to implement records, structs, and “plain old data objects” in Python using only built-in data types and classes from the standard library.
Compared to arrays, record data structures provide a fixed number of fields, each field can have a name, and may have a different type.
I’m using the definition of a “record” loosely in this article. For example, I’m also going to discuss types like Python’s built-in tuple
that may or may not be considered “records” in a strict sense because they don’t provide named fields.
Python provides several data types you can use to implement records, structs, and data transfer objects. In this article you’ll get a quick look at each implementation and its unique characteristics. At the end you’ll find a summary and a decision making guide that will help you make your own pick.
Alright, let’s get started:
✅ The dict
Built-in
Python dictionaries store an arbitrary number of objects, each identified by a unique key. Dictionaries are often also called “maps” or “associative arrays” and allow the efficient lookup, insertion, and deletion of any object associated with a given key.
Using dictionaries as a record data type or data object in Python is possible. Dictionaries are easy to create in Python as they have their own syntactic sugar built into the language in the form of dictionary literals. The dictionary syntax is concise and quite convenient to type.
Data objects created using dictionaries are mutable and there’s little protection against misspelled field names, as fields can be added and removed freely at any time. Both of these properties can introduce surprising bugs and there’s always a trade-off to be made between convenience and error resilience.
car1 = { 'color': 'red', 'mileage': 3812.4, 'automatic': True, } car2 = { 'color': 'blue', 'mileage': 40231.0, 'automatic': False, } # Dicts have a nice repr: >>> car2 {'color': 'blue', 'automatic': False, 'mileage': 40231.0} # Get mileage: >>> car2['mileage'] 40231.0 # Dicts are mutable: >>> car2['mileage'] = 12 >>> car2['windshield'] = 'broken' >>> car2 {'windshield': 'broken', 'color': 'blue', 'automatic': False, 'mileage': 12} # No protection against wrong field names, # or missing/extra fields: car3 = { 'colr': 'green', 'automatic': False, 'windshield': 'broken', }
✅ The tuple
Built-in
Python’s tuples are a simple data structure for grouping arbitrary objects. Tuples are immutable—they cannot be modified once they’ve been created.
Performancewise, tuples take up slightly less memory than lists in CPython and they’re faster to construct at instantiation time. As you can see in the bytecode disassembly below, constructing a tuple constant takes a single LOAD_CONST
opcode while constructing a list object with the same contents requires several more operations:
>>> import dis >>> dis.dis(compile("(23, 'a', 'b', 'c')", '', 'eval')) 1 0 LOAD_CONST 4 ((23, 'a', 'b', 'c')) 3 RETURN_VALUE >>> dis.dis(compile("[23, 'a', 'b', 'c']", '', 'eval')) 1 0 LOAD_CONST 0 (23) 3 LOAD_CONST 1 ('a') 6 LOAD_CONST 2 ('b') 9 LOAD_CONST 3 ('c') 12 BUILD_LIST 4 15 RETURN_VALUE
However you shouldn’t place too much emphasis on these differences. In practice the performance difference will often be negligible and trying to squeeze out extra performance out of a program by switching from lists to tuples will likely be the wrong approach.
A potential downside of plain tuples is that the data you store in them can only be pulled out by accessing it through integer indexes. You can’t give names to individual properties stored in a tuple. This can impact code readability.
Also, a tuple is always an ad-hoc structure. It’s difficult to ensure that two tuples have the same number of fields and the same properties stored on them.
This makes it easy to introduce “slip-of-the-mind” bugs by mixing up the field order, for example. Therefore I would recommend you keep the number of fields stored in a tuple as low as possible.
# Fields: color, mileage, automatic car1 = ('red', 3812.4, True) car2 = ('blue', 40231.0, False) # Tuple instances have a nice repr: >>> car1 ('red', 3812.4, True) >>> car2 ('blue', 40231.0, False) # Get mileage: >>> car2[1] 40231.0 # Tuples are immutable: >>> car2[1] = 12 TypeError: "'tuple' object does not support item assignment" # No protection against missing/extra fields # or a wrong order: >>> car3 = (3431.5, 'green', True, 'silver')
✅ Writing a Custom Class
Classes allow you to define reusable “blueprints” for data objects to ensure each object provides the same set of fields.
Using regular Python classes as record data types is feasible, but it also takes manual work to get the convenience features of other implementations. For example, adding new fields to the __init__
constructor is verbose and takes time.
Also, the default string representation for objects instantiated from custom classes is not very helpful. To fix that you may have to add your own __repr__
method, which again is usually quite verbose and must be updated every time you add a new field.
Fields stored on classes are mutable and new fields can be added freely, which may or may not be what you intend. It’s possible to provide more access control and to create read-only fields using the @property decorator, but this requires writing more glue code.
Writing a custom class is a great option whenever you’d like to add business logic and behavior to your record objects using methods. But this means these objects are technically no longer plain data objects.
class Car: def __init__(self, color, mileage, automatic): self.color = color self.mileage = mileage self.automatic = automatic car1 = Car('red', 3812.4, True) car2 = Car('blue', 40231.0, False) # Get the mileage: >>> car2.mileage 40231.0 # Classes are mutable: >>> car2.mileage = 12 >>> car2.windshield = 'broken' # String representation is not very useful # (must add a manually written __repr__ method): >>> car1 <Car object at 0x1081e69e8>
✅ The collections.namedtuple Class
The namedtuple
class available in Python 2.6+ provides an extension of the built-in tuple
data type. Similarly to defining a custom class, using namedtuple
allows you to define reusable “blueprints” for your records that ensure the correct field names are used.
Namedtuples are immutable just like regular tuples. This means you cannot add new fields or modify existing fields after the namedtuple instance was created.
Besides that, namedtuples are, well…named tuples. Each object stored in them can be accessed through a unique identifier. This frees you from having to remember integer indexes, or resorting to workarounds like defining integer constants as mnemonics for your indexes.
Namedtuple objects are implemented as regular Python classes internally. When it comes to memory usage they are also “better” than regular classes and just as memory efficient as regular tuples:
>>> from collections import namedtuple >>> from sys import getsizeof >>> p1 = namedtuple('Point', 'x y z')(1, 2, 3) >>> p2 = (1, 2, 3) >>> getsizeof(p1) 72 >>> getsizeof(p2) 72
Namedtuples can be an easy way to clean up your code and to make it more readable by enforcing a better structure for your data.
I find that going from ad-hoc data types like dictionaries with a fixed format to namedtuples helps me express the intent of my code more clearly. Often when I apply this refactoring I magically come up with a better solution for the problem I’m facing.
Using namedtuples over unstructured tuples and dicts can also make my coworkers’ lives easier because namedtuples make the data passed around “self-documenting”, at least to a degree.
For more information and code examples, check out my tutorial on namedtuples here on dbader.org.
from collections import namedtuple Car = namedtuple('Car' , 'color mileage automatic') car1 = Car('red', 3812.4, True) # Instances have a nice repr: >>> car1 Car(color='red', mileage=3812.4, automatic=True) # Accessing fields >>> car1.mileage 3812.4 # Fields are immtuable: >>> car1.mileage = 12 AttributeError: "can't set attribute" >>> car1.windshield = 'broken' AttributeError: "'Car' object has no attribute 'windshield'"
✅ The typing.NamedTuple Class
This class added in Python 3.6 is the younger sibling of collections.namedtuple
. It is very similar to namedtuple
, the main difference is an updated syntax for defining new record types and added support for type hints.
Please note that type annotations are not enforced without a separate type checking tool like mypy—but even without tool support they can provide useful hints to other programmers (or be terribly confusing if the type hints get out of date.)
from typing import NamedTuple class Car(NamedTuple): color: str mileage: float automatic: bool car1 = Car('red', 3812.4, True) # Instances have a nice repr >>> car1 Car(color='red', mileage=3812.4, automatic=True) # Accessing fields >>> car1.mileage 3812.4 # Fields are immutable >>> car1.mileage = 12 AttributeError: "can't set attribute" >>> car1.windshield = 'broken' AttributeError: "'Car' object has no attribute 'windshield'" # Type annotations are not enforced without # a separate type checking tool like mypy: >>> Car('red', 'NOT_A_FLOAT', 99) Car(color='red', mileage='NOT_A_FLOAT', automatic=99)
⚠️ The struct.Struct Class
This class performs conversions between Python values and C structs serialized into Python bytes
objects. It can be used to handle binary data stored in files or from network connections, for example.
Structs are defined using a format strings-like mini language that allows you to define the arrangement of various C data types, like char
, int
, and long
, as well as their unsigned
variants.
The struct
module is seldom used to represent data objects that are meant to be handled purely inside Python code. They’re intended primarily as a data exchange format, rather than a way of holding data in memory that’s only used by Python code.
In some cases packing primitive data into structs may use less memory than keeping it in other data types—but that would be a quite advanced (and probably unnecessary) optimization.
from struct import Struct MyStruct = Struct('i?f') data = MyStruct.pack(23, False, 42.0) # All you get is a blob of data: >>> data b'\x17\x00\x00\x00\x00\x00\x00\x00\x00\x00(B' # Data blobs can be unpacked again: >>> MyStruct.unpack(data) (23, False, 42.0)
⚠️ The types.SimpleNamespace Class
Here’s one more “esoteric” choice for implementing data objects in Python. This class was added in Python 3.3 and it provides attribute access to its namespace. It also includes a meaningful __repr__
by default.
As its name proclaims, SimpleNamespace
is simple—it’s basically a glorified dictionary that allows attribute access and prints nicely. Attributes can be added, modified, and deleted freely.
from types import SimpleNamespace car1 = SimpleNamespace(color='red', mileage=3812.4, automatic=True) # The default repr: >>> car1 namespace(automatic=True, color='red', mileage=3812.4) # Instances are mutable >>> car1.mileage = 12 >>> car1.windshield = 'broken' >>> del car1.automatic >>> car1 namespace(color='red', mileage=12, windshield='broken')
Which type should I use for data objects in Python?
As you’ve seen there’s quite a number of different options to implement records or data objects in Python. Generally your decision will depend on your use case:
- You only have a few (2-3) fields: Using a plain tuple object may be okay because the field order is easy to remember or field names are superfluous. For example, think of an
(x, y, z)
point in 3D space. - You need immutable fields: In this case plain tuples,
collections.namedtuple
,typing.NamedTuple
would all make good options for implementing this type of data object. - You need to lock down field names to avoid typos:
collections.namedtuple
andtyping.NamedTuple
are your friends. - You want to keep things simple: A plain dictionary object might be a good choice due to the convenient syntax that closely resembles JSON.
- You need full control over your data structure: It’s time to write a custom class with
@property
setters and getters. - You need to add behavior (methods) to the object: You should write a custom class. Either from scratch or by extending
collections.namedtuple
ortyping.NamedTuple
. - You need to pack data tightly to serialize it to disk or send it over the network: Time to bust out
struct.Struct
, this is a great use case for it.
If you’re looking for a safe default choice, my general recommendation for implementing a plain record, struct, or data object in Python would be to:
- use
collections.namedtuple
in Python 2.x; and - its younger sibling
typing.NamedTuple
in Python 3.
Read the full “Fundamental Data Structures in Python” article series here. This article is missing something or you found an error? Help a brother out and leave a comment below.