Representing Signed Ints as Bytes in Python

2021-11-27

TL/DR: Python doesn’t let you explicitly create signed integers, but the int.to_bytes() method allows you to specify a boolean unsigned argument. I set up an experiment to test this, but could have just read the docs instead…

In retrospect, this post is more about the process than the final destination.

The Problem

I’m working through the Designing Data-Intensive Applications book (specifically a BitCask log-based key/value hash index implementation in Python) and I have the need to convert an integer to bytes.

More specifically, I want key size and value size to be represented by a fixed number of bytes on disk for each record. I’d like to know how many bytes are required to represent various lengths of key size and value size. Ideally, these boundaries should be communicated to anyone using this structure. They will be useful for internal error handling as well.

In summary; I wanted to know:

How to convert an int to bytes
What values (sizes) I can represent with n number of bytes

And my assumptions were:

ints in Python feel like signed ints by default, since they accept negative values
int.to_bytes() will also likely default to some form of signed int representation (spoiler alert, I was wrong here)

Structuring the Experiment

I wrote a function to determine the minimum number of bytes required to represent an integer. I’m not sure if there’s an easier way to do this with some built-in methods, but this worked well enough for me.

Then, I wrote a for loop to continuously append a single byte character to an empty byte string. Within each iteration, I find the length of this bytestring and pass that value to my function to see how many bytes are required to represent the length of my byte string.

I could have simply created an int range, but the byte-length method is more representative of the problem I’m ultimately wanting to solve and I may end up using this code for other experiments in the future.

I also could have tested the known boundaries for signed and unsigned ints (127, 128, 255, 256), but computers are fast and a loop felt simpler and more robust.

# (Python 3.8)

import math
import sys

def get_num_bytes_of_int(hashmod_key: int) -> int:
    """
    Given an integer, tell me the minimum number of
    bytes required to represent this integer.
    """

    min_num_bytes = math.ceil(hashmod_key.bit_length() / 8)
    return len(hashmod_key.to_bytes(length = min_num_bytes, byteorder=sys.byteorder))


bytestring = b''
last_num_bytes_required = 0
while True:
    
    bytestring += b'a'
    len_bytestring = len(bytestring)
    num_bytes_required = get_num_bytes_of_int(len_bytestring)
    
    if num_bytes_required != last_num_bytes_required:
        print(f"Byte string of {len_bytestring} length requires {num_bytes_required} bytes.")
        last_num_bytes_required = num_bytes_required

    # setting a boundary / break point for my experiment
    if len_bytestring > 1e5:
        break

And these were the results from the terminal:

Byte string of 1 length requires 1 bytes.
Byte string of 256 length requires 2 bytes.
Byte string of 65536 length requires 3 bytes.

In other words, this experiment showed that one byte can represent an integer value up to 255, while two bytes can represent an integer value of 65,535. These correspond to values I would expect from an unsigned integer. This is not what I expected, since the default int type in Python appears to be a signed int that accepts negative values.

The Plot Thickens

My next test involved simply executing to_bytes() on a negative int to see how it could possibly represent that value in bytes:

# Python REPL:

>>> x = -1 
>>> x.to_bytes(length=2, byteorder=sys.byteorder)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: can't convert negative int to unsigned

I was still confused at this point, but happy that this failed. Had this succeeded and represented the negative int as bytes, then I would have no idea what’s going on under the hood. So then, how would you ever represent signed ints in bytes from within Python?

Conclusion

Let’s go read ~~a bunch of secondary sources in the form of tutorials and blog posts on this topic~~ the docs!

The signed argument determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised. The default value for signed is False.

There’s my answer. When converting ints to bytes in Python, the default is to treat the output bytes as unsigned.

Secondary Conclusion

Sometimes it’s a better learning exercise to write a quick test to explore a topic, and then go look at the docs. This interactive approach makes the docs more meaningful and helps solidify the content in your mind.

The Problem

Structuring the Experiment

The Plot Thickens

Conclusion

Secondary Conclusion

Contents