On Monorepos vs Project repos

I’ve seen some talk about whether to keep everything in one codebase, vs having per-project repositories. The answer is very clear to me: Monolithic repos are a must, but Git submodules are functionally equivalent (as I’ll describe later); You should start with one repo, and then subdivide when you have clear submodules.

The importance of monolithic repos vs per project is not about performance, or even directly about organization of your code. Both are fairly clear. It’s about organization of your build system. Good builds are fully deterministic and idempotent, and that is very hard to achieve with a set of per-project repositories.

The unix standard has a very good layout for where code goes. Shared libraries go in /lib, /usr/lib, /usr/local/lib; Headers go in /usr/include or /usr/local/include. But this isn’t a structure for how to organize your code when writing it; It’s a build environment structure. It makes more sense when your entire organization is sharing just one unix machine and environment, but we’re well beyond that.

Because you’re aiming for determinism, you need to be sure that your build environment is the same each time. The traditional unix file structure is not good for this purpose, at least, not directly. That structure is not rebuilt every time; You copy files on top of it, you add and remove, but you don’t reset. You can accomplish a reset – Reimaging your build machine, or using a docker container. And, in fact, the latter is what many people have switched to. But that docker container is another layer of abstraction, another piece you need to manage for full determinism. You need to kill and rebuild it each build, and most solutions don’t. The run new builds in the same container repeatedly, creating uncertainty.

This is why I like Bazel so much. It removes much of the uncertainty, by rebuilding from scratch each time. It has a separate, well defined environment, that it manages and assures is in the correct state. It’s not magic; You can change things, break things, and fool it. But if you don’t touch it, it does *the right thing* and keeps your structure clean, without taking any risks of breaking your whole machine.

Bazel operates on a concept called a ‘workspace’. There’s not a whole lot to it; You pick an arbitrary root directory, and a flag file defines the workspace root. Everything underneath is considered one logical unit. If you’ve got a monolithic codebase, this is a no-brainer.

Git submodules complicate builds a little, but not much. Instead of saying “this build was built at commit A”, you need to know that it was built at commits (A, B, C). But you probably don’t really care to always know the full set of A, B, C; They may move somewhat independently, but your build infrastructure can and should simply serialize them; Keeping a mapping of simple, linear commit numbers to a tuple of commit hashes for each submodule. There’s one disadvantage – It makes ‘who broke the build’ a race condition, if two modules change at the same time. That is solved by a simple answer: Suck it up, both commiters should debug 🙂

Leave a Reply