Improving Open Source with Amalgamation

July 16, 2016

Open source is wonderful. It makes me so happy to see so many developers creating and contributing to open source.

But sometimes the hardest part of using open source is just getting the damn code to compile. The complexities are abundant. Tracking down deeply nested library dependencies. Installing obscure scripting languages. Integrating yet another build system into an existing pipeline.

Wouldn’t it be nice if it could be little bit simpler? I’d like to think so.

TL;DR

I wrote a Python script that amalgamates zlib from many files into two files. Amalgamation makes libraries easier to distribute and use. I would love to see open source authors release their libraries in amalgamated form.

Python Script: fts_amalgamate.py
Output: zlib_amalg.h / zlib_amalg.c

Caveat

My background is in video games. We often use Windows as our primary dev environment. We ship on PC (Win, Mac, Linux), mobile (iOS, Android) and console (Xbox, PlayStation, Nintendo). Game dev perspectives tend to be somewhat unique. Our world is quite different from Linux-centric view that dominates open source.

It’s frustrating to see “cross platform” only for Windows to be a total afterthought. The lack of a package manager doesn’t help. But if you want to be cross platform for reals then you’ve got to deal with that and more.

Single File

Build systems are complicated. Everyone has their own. Good open source should be easy to integrate into existing pipelines on existing projects.

A trend I’m happy with is single file libraries. Why? Because dropping a single file into a code base is as easy as it gets! STB is one of the best examples of this.

The easier code is to use the better. More users, more contributors, and more goodness.

SQLite

An exceptionally popular C library is SQLite. They claim to be the most widely deployed database in the world.

Do you know what the source code for SQLite looks like? Here, let me show you.

SQL Files

It’s four files. That’s it. Four simple files. That’s about as easy as you could ask for.

But this is a bit of a lie. SQLite isn’t really 4 files. One of those files is 6.7 megabytes large and 200,000 lines long. The real code base is spread across 140 files. This is an amalgamation.

Amalgamation; the action, process, or result of combining or uniting.

SQLite has a script that combines their many file source into four files. These files compile out of the box with no dependencies. Distribution is a breeze.

Quest

This made me wonder. What if popular C/C++ libraries could be amalgamated by a generic script? That would save a lot of pain and suffering.

So that’s what I did. Here were some of my initial requirements.

My current solution took several iterations to achieve. I started to write about the process to get here. But instead I’m going to skip the process and focus on the result.

zlib

One of my first choices for amalgamation was zlib. Most projects I’ve worked on have used zlib and they’ve all used pre-built binaries. Why? It’s just C code. Not even that much. Only 8,000 lines of actual code. That should be easy for any project to compile.

File Order

First thing to do is determine include order. For small projects this can be done by hand. For anything non-trivial we need automation. Here’s my solution.

  1. Get library to compile in Visual Studio.
  2. Capture build output with /showincludes.
  3. Write Python script to scrape output and print include order.

Here’s a snippet of what that build output looks like:

Build Output

And here’s what the scraper output looks like:

Scraped Output

All platform includes are stripped out. Headers are sorted. Source files are all accounted for. Just what we need.

I realize the irony of doing this specific to Visual Studio despite my previous complaints on cross platform. GCC has similar compile flags. If anyone bothers to read this post I’ll add a GCC/Clang version.

Include Masking

We know copy and pasting files alone isn’t enough. We need to mask out any #include that references an amalgamated header.

Includes are matched with a simple regular expression

Regex

Which is used to turn this:

Regex Input

Into this:

Regex Output

I prefer to comment out lines instead of deleting them. It keeps the original code and amalgamated code just a little bit closer together.

Collision

When amalgamating we run the risk of name collisions. Different source files may define static variables with the same name. I don’t rely on any sort of magic to solve these. A per-file regex find and replace is used.

Per file regex controls

In this case ZIP existed in one place as #define 2 and in another place as an enum value. Easy fix; zlib required 10 such fixes.

Code Duplication

Projects sometimes contain duplicated code. In different source files this is fine. In an amalgamated file it’s a problem.

More regex controls

I don’t love that this is line number based. It’s too fragile. For zlib I removed entire functions. That could be more automagic.

Nesting

A problem with zlib is nesting. Most source files in most projects include all headers at the top. But not always.

In zlib, crc32.c includes crc32.h on line 185. Above that line are many ifdefs and defines. Which, unfortunately, the header needs to compile. This is a problem.

The fix is to nest crc32.h inside crc32.c in zlib_amalg.c. Which is what the compiler normally does anyhow.

To be honest I don’t like nesting. I don’t like that zlib headers are dependent on #define bullshit written in random source files. I think that’s sloppy code and bad design. But it’s also code released in 1995 and used by tens of thousands of shipped projects. So I won’t complain too much.

Prefixes

We’re almost done.

Compiler Fix

This code is in a few places in zlib. A work around for ancient compiler bugs. Which unfortunately causes problems when amalgamated. So I prefix zlib_amalg.h with #define NO_DUMMY_DECL. Sorry 1995 compilers.

Success

Tada!

zlib output

We amalgamated zlib into two simple files. They compile and work as you’d expect. Mission accomplished!

Failure

Now that we’ve amalgamated zlib everything is great, right? Not quite.

There’s a problem. Libraries, such as libpng, which depend on zlib don’t expect an amalgamated version. They #include “zlib.h” and we renamed the file. Oops!

This is the point where I decided to throw in the towel. We could press onward quite a ways. But it’s already apparent that complexities are compounding.

At this point in time my conclusion is that users shouldn’t amalgamate libraries. It doesn’t make things simpler as I’d hoped.

The best solution is for open source authors to officially support and release amalgamations. That’s what SQLite does. In my opinion it would be better if that was standard operating procedure.

Philosophy

If there’s one thing I’d like readers to take away it’s this:

Compiling a C library doesn’t require anything more than code and a compiler.

Compiling C code isn’t hard. Compiling cross-platform code is a solved problem. Libraries don’t have to force users to use inscrutable makefiles with complicated ./configure steps.

Most projects use makefiles. That’s great. Most projects should use makefiles. But makefiles aren’t mandatory. Users shouldn’t be required to use makefiles. Users are going to integrate source code into their own build systems. Let’s make that as easy as possible.

One of the projects I attempted to amalgamate was OpenSSL. Here’s what that involves:

OpenSSL Requirements

Are you shitting me? I shouldn’t have to download Strawberry Perl to compile a C library. That’s absurd.

The problem is that OpenSSL has a convoluted ./configure step. Much of the source code is sprinkled throughout perl scripts. The code doesn’t actually exist until you configure.

Another library I attempted to amalgamate was nanomsg. What could be more nano than single file? Nanomsg also relies on a configure step. It almost amalgamates after configuring. But not quite.

Nanomsg’s configure excludes certain files depending on your platform. Some files are obvious such as thread_posix.h. But there’s also random files such as poller.c.

Why? It’s so unnecessary. Nanomsg already relies on defines such as NN_HAVE_OSX. All the code needs is to ifdef guard against a define that already exists and is already used.

Nanomsg Config

Misc Thoughts

Here’s some scattered thoughts I wanted to share but didn’t fit in.

fts_amalgamate doesn’t support moving code across files. Libraries sometimes need to move implementations from source to headers to amalgamate. I think the correct fix is for the library author to do this in the raw source.

Amalgamated code compiled 5 to 20 times faster in some cases. Compiling a single file is fast. Pasting the same code over and over and over is not.

For large libraries I would consider up to three files; foo.h / foo_internal.h / foo.c. This may minimize compile timer for users.

Amalgamation can prevent the need to add extra include directories to your project.

Configure can be useful to test for platform capabilities. This complexity is primary needed for *nix permutations.

Libraries may need several defines on the command line. They’re rarely documented. Which is frustrating. Document all such defines at the top of foo_amalg.h. Users can provide them manually or run an optional configure step to generate defines.

My amalgamation script is single file and doesn’t accept cmdline arguments. This is to keep it easy to understand for educational purposes.

Libraries should, in my opinion, include dependencies. This is trivial if those dependencies are single file. IMGUI is a great example.

fts_amalgamate was written solely with C/C++ in mind. It could work for any language with minimal work.

Turning zlib into a true single file would require projects set multiple defines. To flip things such as ZEXTERN. I chose to keep things simple.

Amalgamated files may still need to be isolated into their own projects. Many old C libs need _CRT_SECURE_NO_WARNINGS to compile with no warnings.

Conclusion

This project has been a fun and educational process. I learned a lot about why certain open source projects are structured the way they are.

I strongly feel that C/C++ libaries can and should be easy to drop into existing build pipelines. There’s always going to be some amount of work. Adding include directories, specifying lib dependencies, adding defines, etc.

I believe many popular libraries should be amalgamated. I believe doing so would make them easier to use. And I certainly believe that compiling C code shouldn’t require installing Strawberry frickin’ Perl.

Python Script: fts_amalgamate.py
Output: zlib_amalg.h / zlib_amalg.c