How to estimate disk space.

Published on July 8, 2022. python (65), how-to (3)

A few years ago while interviewing Nelson Elhage for Staff Engineer, he mentioned “estimation” as a particularly valuable skill. I hadn’t thought much about the idea of estimation as a skill, but his comment made me remember one of the best architecture interviews I’ve ever given, where the candidate was able to significantly narrow down the possible solutions by asking for a few details (as best I can remember: queries per second, expected number of rows, and necessary columns).

Historically, I’ve never been partcularly good at estimating disk space, so I decided to take a few hours working on that skill, which has turned into these notes which I hope will be helpful for others looking to improve their estimation as well. I’ll start by sharing a few useful rules for estimating disk space, then pull together a SQLite3 snippet to demonstrate validating an estimate.

Kilobytes (1000 bytes) and Kibibytes (1024 bytes)

The first point of confusion when estimating sizes is whether a kilobyte has 1024 bytes or 1000 bytes. The “correct answer” is that a “kilobyte” (kB) is 1000 bytes and a “kibibyte” (KB) is 1024 bytes. This distinction continues in other units, e.g. megabyte (MB, 1000^2) versus mebibyte (MiB, 1024^2), and gigabyte (GB, 1000^3) versus gibibyte (GiB, 1024^3).

The three key things here are: (1) understanding both the kilobyte and kibibyte units, (2) recognizing that someone you’re communicating with may not be familiar with the kibibyte family of units, and (3) adapting your approach to what they’re familiar with. In real life there are no points for being technically correct.

Sizing UTF-8, ints, and so on

While there are many data types worth being familiar with, the most common ones are:

UTF-8 characters are encoded in 1-4 bytes. Whether you should estimate space usage at 1, 2, 3 or 4 bytes depends on what you’re encoding. For example, if you’re only encoding ASCII values, then 1 byte is sufficient. What’s most important is being able to explain your rationale. For example, saying that you’re taking the most conservative case and using 4 bytes per character is reasonable.
Using UTF-8 sizes as a baseline, you can estimate the size of string, varchar, text or other similar fields. The precise overhead of those fields will absolutely vary a fair bit depending on the specific technology being used to store the data
integers depend on the maximum value you need to store, but a good baseline is 4 bytes. If you want to be more precise, then figure out the maximum size of the integer, and from there you can determine the number of kilobytes necessary to represent it by calculating
```
maxRepresentableIntegerInNBytes(N) = 2 ^ (N bytes * 8 bits) - 1
```
For example, the largest number representable in 2 bytes is
```
2 ^ (2 bytes * 8 bits) -1 = 65535
```
floats are typically stored in 4 bytes
booleans are often represented as 1 byte integers (e.g. in MySQL)
enums are often represented as a 2 byte integers (e.g. in MySQL)
datetimes are often represents in 5 bytes (e.g. in MySQL)

Equipped with these rules, let’s do a practice run at estimating a database’s size. Imagine we have 10,000 people represented in our database. We track each person’s age and name. The average name is 25 characters long, and maximum age we want to support is 125. How much space should this take?

bytesPerName = 25 characters * 4 bytes = 100 bytes
bytesPerAge = 1 byte  # because 2^(1 bytes * 8bits) = 255
bytesPerRow = 100 bytes + 1 byte
totalBytes = 101 bytes * 10,000 rows
totalKiloBytes = (101 * 10000) / 1000          # 1,100 kB
totalMegaBytes = (101 * 10000) / (1000 * 1000)   # 1.1 MB

So, about 1.1 MB to store this. Alternatively this is 0.96 MiB, calculated via:

(101 * 10000) / (1024 * 1024)  # 0.96 MiB

You can now estimate the generate size of datasets.

Indexes, replication, etc

There’s a gap between the theoretical cost of storing data and the actual cost of storing data in a database. You might be using a replica-based tool like MongoDB or Cassandra that stores three copies of each piece of data. You might be using a primary-secondary replication system that stores two copies of each piece of data. The storage impact is pretty easy to calculate here (respectively, 3x or 2x the base cost).

Indexes offer a classic tradeoff between additional storage and reduced query times, but exactly how much storage cost they’ll take up depends on the specifics of the index itself. As a simple approach to sizing indices, determine the size of the columns included in the index, multiply that by the number of rows indexed, and add that total to the underlying storage cost of the data itself. If you create an index including every field, then roughly estimate twice the total storage cost.

Depending on the particular database being used, there will be other features that take up space as well, and truly understanding their impact on the size will require understanding the particular database more deeply. The best way to accomplish that is to validate sizing directly with that database.

Validating with SQLite3

The good news is that it’s relatively easy to validate data sizing, which we’ll do now using Python and SQLite3. We’ll start by recreating the above estimation of 10,000 rows each containing a 25 character name and an age.

import uuid
import random
import sqlite3

def generate_users(path, rows):
    db = sqlite3.connect(path)
    cursor = db.cursor()
    cursor.execute("drop table if exists users")
    cursor.execute("create table users (name, age)")
    for i in range(rows):
        name = str(uuid.uuid4())[:25]
        age = random.randint(0, 125)
        cursor.execute("insert into users values (?, ?)", (name, age))
    db.commit()
    db.close()

if __name__ == "__main__":
    generate_users("users.db", 10000)

Earlier we’d estimated this as 0.96 MiB, but running this script I see it’s only 344 KiB, just over a third of the expected space. Debugging our calculation a bit, we can see that we assumed 4 bytes per character, but the names we’re generating (truncated UUID4s) are all ascii characters, so would actually be 1 byte per character. Let’s reestimate the value based on that:

bytesPerName = 25 characters * 1 byte = 25 bytes
bytesPerAge = 1 byte
bytesPerRow = 26 bytes
totalKibiBytes = (26 * 10,000) / 1024  # 245 KIB

Alright, that’s fairly close assuming there’s some overhead, which there certainly is. For example, SQLite3 transparently creates a “rowid” column to use as the primary key, which is a 64 bit integer, which requires 4 bytes to represent. If we add those 4 bytes to our previous estimate of 26 bytes per row, then we get an estimated size of 293 KiB, which is comfortably close to our estimate.

Go forth and estimate sizes

Estimating the size of data is a relatively straightforward skill that’s both (a) easy to never learn and (b) quite useful once you’ve learned it. It can be useful when architecting systems, reasoning through debugging a complex distributed system problem, and certainly in discussing an architecture problem in an interview. Some of the useful distinctions that disk space estimation can help you answer:

Can it fit in memory?
Can it fit on one server with an SSD?
Does this data need to be sharded across many servers? How many?
Can this index fit on one server?
If not, how will you partition the index properly?
Etc, etc

Despite using them for some time, I continue to find it surprising how much this sort of technique can properly constrain your solution space.