Dependencies Belong in Version Control

November 25th, 2023

I believe that all project dependencies belong in version control. Source code, binary assets, third-party libraries, and even compiler toolchains. Everything.

The process of building any project should be trivial. Clone repo, invoke build command, and that's it. It shouldn't require a complex configure script, downloading Strawberry Perl, installing Conda, or any of that bullshit.

Infact I'll go one step further. A user should be able to perform a clean OS install, download a zip of master, disconnect from the internet, and build. The build process shouldn't require installing any extra tools or content. If it's something the build needs then it belongs in version control.

First Instinct

Your gut reaction may be revulsion. That's it not possible. Or that it's unreasonable.

You're not totally wrong. If you're using Git for version control then committing ten gigabytes of cross-platform compiler toolchains is infeasible.

That doesn't change my claim. Dependencies do belong in version control. Even if it's not practical today due to Git's limitations. More on that later.

Why

Why do dependencies belong in version control? I'll give a few reasons.

  1. Usability
  2. Reliability
  3. Reproducibility
  4. Sustainability

Usability

Committing dependencies makes projects trivial to build and run. I have regularly failed to build open source projects and given up in a fit of frustrated rage.

My background is C++ gamedev. C++ infamously doesn't have a standard build system. Which means every project has it's own bullshit build system, project generator, dependency manager, scripting runtimes, etc.

ML and GenAI projects are a god damned nightmare to build. They're so terrible to build that there are countless meta-projects that exists solely to provide one-click installers (example: EasyDiffusion). These installers are fragile and sometimes need to be run several times to succeed.

Commit your dependencies and everything "just works". My extreme frustration with trying, and failing, to build open source projects is what inspired this post.

Reliability

Have you ever had a build fail because of a network error on some third-party server? Commit your dependencies and that will never happen.

There's a whole class of problems that simply disappear when depdendencies are committed. Builds won't break because of an OS update. Network errors don't exist. You eliminate "works on my machine" issues because someone didn't have the right version of CUDA installed.

Reproducibility

Builds are much easier to reproduce when version control contains everything. Great build systems are hermetic and allow for determistic builds. This is only possible when your build doesn't depend on your system environment.

Lockfiles are only a partial solution to reproducibility. Docker images are a poor man's VCS.

Sustainability

Committing dependencies makes it trivial to recreate old builds. God help you if you try to build a webdev stack from 2013.

In video games it's not uncommon to release old games on new platforms. These games can easily be 10 or 20 years old. How many modern projects will be easy to build in 20 years? Hell, how many will be easy to build in 5?

Commit your dependencies and ancient code bases will be as easy to rebuild as possible. Although new platforms will require new code, of course.

Proof of Life

To prove that this isn't completely crazy I built a proof of life C++ demo. My program is exceedingly simple:

#include <fmt/core.h>

int main() {
  fmt::print("Hello world from C++ πŸ‘‹\n");
  fmt::print("goodbye cruel world from C++ ☠️\n");
  return 0;
}

The folder structure looks like this:

\root
    \sample_cpp_app
        - main.cpp
    \thirdparty
        \fmt (3 MB)
    \toolchains
        \win
            \cmake (106 MB)
            \LLVM (2.5 GB)
            \mingw64 (577 MB)
            \ninja (570 KB)
            \Python311 (20.5 MB)
    - CMakeLists.txt
    - build.bat
    - build.py

The toolchains folder contains five dependencies - CMake, LLVM, Ming64, Ninja, and Python 3.11. Their combined size is 3.19 gigabytes. No effort was made to trim these folders down in size.

The build.bat file nukes all environment variables and sets PATH=C:\Windows\System32;. This ensures only the included toolchains are used to compile.

The end result is a C++ project that "just works".

But Wait There's More

Here's where it gets fun. I wrote a Python that script that scans the directory for "last file accessed time" to track "touched files". This let's me check how many toolchain files are actually needed by the build. It produces this output:

Checking initial file access times... πŸ₯ΈπŸ‘¨β€πŸ”¬πŸ”¬

Building... πŸ‘·β€β™‚οΈπŸ’ͺπŸ› οΈ
Compile success! 😁

Checking new file access times... πŸ₯ΈπŸ‘¨β€πŸ”¬πŸ”¬
File Access Stats
    Touched 508 files. Total Size: 272.00 MB
    Untouched 23138 files. Total Size: 2.93 GB
    Touched 2.1% of files
    Touched 8.3% of bytes

Running program...
    Target exe: c:\temp\code\toolchain_vcs\bin\main.exe

Hello world from C++ πŸ‘‹
goodbye cruel world from C++ ☠️

Built and ran successfully! 😍

Well will you look at that!

Despite committing 3 gigabytes of toolchains we only actually needed a mere 272 megabytes. Well under 10%! Even better we touched just 2.0% of repo files.

The largest files touched were:

clang++.exe     [116.04 MB]
ld.lld.exe      [86.05 MB]
llvm-ar.exe     [28.97 MB]
cmake.exe       [11.26 MB]
libgcc.a        [5.79 MB]
libstdc++.dll.a [5.32 MB]
libmsvcrt.a     [2.00 MB]
libstdc++-6.dll [1.93 MB]
libkernel32.a   [1.27 MB]

My key takeaway is this: toolchain file sizes are tractable for version control if you can trim the fat.

This sparks my joy. Imagine cloning a repo, clicking build, and having it just work. What a wonderful and delightful world that would be!

A Vision for the Future

I'd like to paint a small dream for what I will call Next Gen Version Control Software (NGVCS). This is my vision for a Git/Perforce successor. Here are some of the key featurs I want NGVCS to have:

Let's pretend for a moment that every open source project commits their dependencies. Each one contains a full copy of Python, Cuda, Clang, MSVC, libraries, etc. What would happen?

First, the user clones a random GenAI repo. This is near instantaneous as files are not prefetched. The user then invokes the build script. As files are accessed they're downloaded. The very first build may download a few hundred megabytes of data. Notably it does NOT download the entire repo. If the user is on Linux it won't download any binaries for macOS or Windows.

Second, the user clones another GenAI repo and builds. Does this need to re-download gigabytes of duplicated toolchain content? No! Both projects use NGVCS which has a system wide file cache. Since we're also using a copy-on-write file system these files instantly materialize in the second repo at zero cost.

The end result is beautiful. Every project is trivial to fetch, build, and run. And users only have to download the minimum set of files to do so.

The Real World and Counter Arguments

Hopefully I've convinced some of you that committing dependencies is at least a good idea in an ideal world.

Now let's consider the real world and a few counter arguments.

The Elephant in the Room - Git

Unfortunately I must admit that committing dependencies is not be practical today. The problem is Git. One of my unpopular opinions is that Git isn't very good. Among its many sins is terrible support for large files and large repositories.

The root issue is that Git's architecture and default behavior expects all users to have a full copy of the entire repo history. Which means every version of every binary toolchain for every platform. Yikes!

There are various work arounds - Git LFS, Git Submodules, shallow clones, partial clones, etc. The problem is these aren't first-class features. They are, imho, second-class hacks. πŸ˜“

In theory Git could be updated to more properly support large projects. I believe Git should be shallow and partial by default. Almost all software projects are defacto centralized. Needing full history isn't the default, it's an edge case. Users should opt-in to full history only if they need it.

Containers

An alternative to committing dependencies is to use containers. If you build out of a container you get most, if not all, of the benefits. You can even maintain an archive of docker images that reliably re-build tagged releases.

Congrats, you're now using Docker as your VCS!

My snarky opinion is that Docker and friends primarily exist because modern build systems are so god damned fragile that the only way to reliably build and deploy is to create a full OS image. This is insanity!

Containers shouldn't be required simply to build and run projects. It's embarassing that's the world we live in.

Licensing

Not all dependencies are authorized for redistribution. I believe MSVC and XCode both disallow redistribution of compiler toolchains? Game consoles like Sony PlayStation and Nintendo Switch don't publicly release headers, libs, or compilers.

This is mostly ok. If you're working on a console project then you're already working on a closed source project. Developers already use permission controls to gate access.

The lack of redistribution rights for "normal" toolchains is annoying. However permissive options are available. If committing dependencies becomes common practice then I think it's likely that toolchain licenses will update to accomdate.

Updating Dependencies

Committing library dependencies to version control means they need to be updated. If you have lots of repos to update this could be a moderate pain in the ass.

This is also the opposite of how Linux works. In Linux land you use a hot mess of system libraries sprinkled chaotically across the search path. That way when there is a security fix you update a single .so (or three) and your system is safe.

I think this is largely a non-issue. Are you building and running your services out of Docker? Do you have a fleet of machines? Do you have lockfiles? Do you compile any thirdparty libraries from source? If the answer to any of these questions is yes, and it is, then you already have a non-trivial procedure to apply security fixes.

Committing dependencies to VCS doesn't make security updates much harder. In fact, having a monorepo source of truth can make things easier!

DVCS

One of Git's claims to fame is its distributed nature. At long last developers can commit work from an internetless cafe or airplane!

My NGVCS dream implies defacto centralization. Especially for large projects with large histories. Does that mean an internet connection is required? Absolutely not! Even Perforce, the King of centralized VCS, supports offline mode. Git continues to function locally even when working with shallow and partial Git clones.

Offline mode and decentralization are independent concepts. I don't know why so many people get this wrong.

Libraries

Do I really think that every library, such as fmt, should commit gigabytes of compilers to version control?

That's a good question. For languages like Rust which have a universal build system probably not. For languages like C++ and Python maybe yes! It'd be a hell of a lot easier to contribute to open source projects if step 0 wasn't "spend 8 hours configuring environment to build".

For libraries the answer may be "it depends". For executables I think the answer is "yes, commit everything".

Dreams vs Reality

NGVCS is obviously a dream. It doesn't exist today. Actually, that's not quite true. This is exactly how Google and Meta operate today. Infact numerous large companies have custom NGVCS equivalents for internal use. Unfortunately there isn't a good solution in the public sphere.

Is committing dependencies reasonable for Git users today? The answer is... almost? It's at least closer than most people realize! A full Python deployment is merely tens to hundreds of megabytes. Clang is only a few gigabytes. A 2TB SSD is only $100. I would enthusiastically donate a few gigabytes of hard drive space in exchange for builds that "just work".

Committing dependencies to Git might be possible to do cleanly today with shallow, sparse, and LFS clones. Maybe. It'd be great if you could run git clone --depth=1 --sparse=windows. Maybe someday.

Conclusion

I strongly believe that dependencies belong in version control. I believe it is "The Right Thing". There are significant benefits to usability, reliability, reproducibility, sustainability, and more.

Committing all dependencies to a Git repo may be more practical than you realize. The actual file size is very reasonable.

Improvements to VCS software can allow repos to commit cross-platform dependencies while allowing users to download the bare minimum amount of content. It's the best of everything.

I hope that I have convinced you that committing dependencies and toolchains is "The Right Thing". I hope that version control systems evolve to accomodate this as a best practice.

Thank you.

Bonus Section

If you read it this far, thank you! Here are some extra thoughts I wanted to share but couldn't squeeze into the main article.

Sample Project

The sample project can be downloaded via Dropbox as a 636mb .7zip file. It should be trivial to download and build! Linux and macOS toolchains aren't included because I only have a Windows machine to test on. It's not on GitHub because they have an unnecessary file size limit.

Git LFS

My dream NGVCS has first class support for all the features I mentioned and more.

Git LFS is, imho, a hacky, second class citizen. It works and people use it. But it requires a bunch of extra effort and running extra commands.

Deployment

I have a related rant that not only should all dependencies be checked into the build system, but that deployments should also include all dependencies. Yes, deploy 2gb+ of CUDA dlls so your exe will reliably run. No, don't force me to use Docker to run your simple Python project.

Git Alternatives

There are a handful of interesting Git alternatives in the pipeline.

  1. Jujutsu - Git but better
  2. Pijul - Somewhat academic patch-based VCS
  3. Sapling - Open source version of Meta's VCS. Not fully usable outside of Meta infra.
  4. Xethub - Git at 100Tb scale to support massive ML models

Git isn't going to be replaced anytime soon, unfortunately. But there are a variety of projects exploring different ideas. VCS is far from a solved problem. Be open minded!

Package Managers

Package managers are not necessarily a silver bullet. Rust's Cargo is pretty good. NPM is fine I guess. Meanwhile Python's package ecosystem is an absolute disaster. There may be a compile-time vs run-time distinction here.

A good package manager is a decent solution. However package managers exist on a largely per-language basis. And sometimes per-platform. Committing dependencies is a guaranteed good solution for all languages on all platforms.

Polyglot projects that involve multiple languages need multiple package managers. Yuck.