Parallelism if more than one target depends on the same subdir

Recursive make is especially painful when it comes to parallelism. Take a look at this Makefile fragment:

all: fred bob
subproj:
    touch $@.new
    sleep 1
    mv $@.new $@
fred:
    $(MAKE) subproj
    touch $@
bob:
    $(MAKE) subproj
    touch $@

If we run it serially, it all looks good:

$ rm -f subproj fred bob; make --no-print-directory
make subproj
touch subproj.new
sleep 1
mv subproj.new subproj
touch fred
make subproj
make[1]: 'subproj' is up to date.
touch bob

But if we run it in parallel, life sucks:

$ rm -f subproj fred bob; make -j2 --no-print-directory
make subproj
make subproj
touch subproj.new
touch subproj.new
sleep 1
sleep 1
mv subproj.new subproj
mv subproj.new subproj
mv: cannot stat 'ubproj.new': No such file or directory
touch fred
make[1]: *** [subproj] Error 1
make: *** [bob] Error 2

What happened? The sub-make that runs subproj ended up getting twice at once, because both fred and bob need to build it.

If fred and bob had put in a dependency on subproj, then GNU make would be smart enough to only build one of them at a time; it can do ordering inside a single make process. So this example is a bit contrived. But imagine that fred and bob are two separate applications being built from the same toplevel Makefile, and they both depend on the library in subproj. You'd run into this problem if you use recursive make.

Of course, you might try to solve this by using nonrecursive make, but that's really hard. What if subproj is a library from some other vendor? Will you modify all their makefiles to fit into your nonrecursive makefile scheme? Probably not.

Another common workaround is to have the toplevel Makefile build subproj, then fred and bob. This works, but if you don't run the toplevel Makefile and want to go straight to work in the fred project, building fred won't actually build subproj first, and you'll get errors.

redo solves all these problems. It maintains global locks across all its instances, so you're guaranteed that no two instances will try to build subproj at the same time. And this works even if subproj is a make-based project; you just need a simple subproj.do that runs make subproj.

Dependency problems that only show up during parallel builds

One annoying thing about parallel builds is... they do more things in parallel. A very common problem in make is to have a Makefile rule that looks like this:

all: a b c

When you make all, it first builds a, then b, then c. What if c depends on b? Well, it doesn't matter when you're building in serial. But with -j3, you end up building a, b, and c at the same time, and the build for c crashes. You should have said:

all: a b c
c: b
b: a

and that would have fixed it. But you forgot, and you don't find out until you build with exactly the wrong -j option.

This mistake is easy to make in redo too. But it does have a tool that helps you debug it: the --shuffle option. --shuffle takes the dependencies of each target, and builds them in a random order. So you can get parallel-like results without actually building in parallel.

What about distributed builds?

FIXME: So far, nobody has tried redo in a distributed build environment. It surely works with distcc, since that's just a distributed compiler. But there are other systems that distribute more of the build process to other machines.

The most interesting method I've heard of was explained (in public, this is not proprietary information) by someone from Google. Apparently, the Android team uses a tool that mounts your entire local filesystem on a remote machine using FUSE and chroots into that directory. Then you replace the $SHELL variable in your copy of make with one that runs this tool. Because the remote filesystem is identical to yours, the build will certainly complete successfully. After the $SHELL program exits, the changed files are sent back to your local machine. Cleverly, the files on the remote server are cached based on their checksums, so files only need to be re-sent if they have changed since last time. This dramatically reduces bandwidth usage compared to, say, distcc (which mostly just re-sends the same preparsed headers over and over again).

At the time, he promised to open source this tool eventually. It would be pretty fun to play with it.

The problem:

This idea won't work as easily with redo as it did with make. With make, a separate copy of $SHELL is launched for each step of the build (and gets migrated to the remote machine), but make runs only on your local machine, so it can control parallelism and avoid building the same target from multiple machines, and so on. The key to the above distribution mechanism is it can send files to the remote machine at the beginning of the $SHELL, and send them back when the $SHELL exits, and know that nobody cares about them in the meantime. With redo, since the entire script runs inside a shell (and the shell might not exit until the very end of the build), we'd have to do the parallelism some other way.

I'm sure it's doable, however. One nice thing about redo is that the source code is so small compared to make: you can just rewrite it.

Can I convince a sub-redo or sub-make to not use parallel builds?

Yes. Put this in your .do script:

unset MAKEFLAGS

The child makes will then not have access to the jobserver, so will build serially instead.

What does the "broken --jobserver-auth" error mean?

redo (and GNU make) use the MAKEFLAGS environment variable to pass information about the parallel build environment from one process to the next. Inside MAKEFLAGS is a string that looks like either --jobserver-auth=X,Y or --jobserver-fds=X,Y, depending on the version of make.

If redo finds one of these strings, but the file descriptors named by X and Y are not available in the subprocess, that means some ill-behaved parent process has closed them. This prevents parallelism from working, so redo aborts to let you know something is seriously wrong.

GNU make will intentionally close these file descriptors if you write a Makefile rule that contains neither the exact string $(MAKE) nor a leading + character. So you might have had a Makefile rule that looked like this:

subdir/all:
    $(MAKE) -C subdir all

and that worked as expected: the sub-make inherited your parallelism settings. But people are sometimes surprised to find that this doesn't work as expected:

subdir/all:
    make -C subdir all

In that case, the sub-make does not inherit the jobserver file descriptors, so it runs serially. If for some reason you don't want to use $(MAKE) but you do want parallelism, you need to write something like this instead:

subdir/all:
    +make -C subdir all

And similarly, if you recurse into redo instead of make, you need the same trick:

subdir/all:
    +redo subdir/all

There are a few other programs that also close file descriptors. For example, if your .do file starts with #!/usr/bin/env xonsh, you might run into a bug in xonsh where it closes file descriptors incorrectly.

If you really can't stop your program from closing file descriptors that it shouldn't, you can work around the problem by unsetting MAKEFLAGS. This will let your program build, but will disable parallelism.