Docker and kvm containers (from scratch)

Containers

"Containers" became popular a few years ago with the emergence of Docker, but they are actually the result of a long line of evolution starting with chroot, a concept which dates all the way back to 1979. The idea of a container, or a chroot, is to run a process or set of processes in a (more or less) isolated environment that's separate from your main operating system.

The first iteration, chroot, only isolated the filesystem: chroot would "change" the "root" directory (hence the name) to a subdirectory of the main filesystem, then run a program that would see only files in that subdirectory. Among other things, this was used as a way to prevent rogue programs from accidentally damaging other files on the system. But it wasn't particularly safe, especially because any program running with administrator privileges could play tricks and eventually switch its root back to the "real" root directory. Separately from security, though, it's sometimes interesting to install a different operating system variant in a subdirectory, then chroot into it and run programs that require that operating system version. For example, if you're running the latest version of Debian Linux, but you want to build an application that only builds correctly on the Debian version from 5 years ago, you can install the 5-years-ago Debian files in a directory, chroot into that, and build your application. The main limitation is that your "host" system and your chroot environment share the same kernel version, and rogue programs usually can find a way to escape the chroot, so it's not useful if your inner system is running dangerous code.

Partly in response to the limitations of chroot, "virtualization" started to gain popularity around 2001, made famous by VMware. (IBM mainframes had been doing something similar for a few decades, but not many people knew how IBM mainframes worked.) Anyway, virtualization simulates a computer's actual hardware and lets you run a different kernel on the virtual hardware, and a filesystem inside that hardware. This has several advantages, including much stricter security separation and the ability to run a different kernel or even a different "guest" operating system than the one on the host. Virtualization used to be pretty slow, but it's gotten faster and faster over the years, especially with the introduction of "paravirtualization," where we emulate special virtual-only "hardware" that needs special drivers in the guest, in exchange for better performance. On Linux, the easiest type of paravirtualization nowadays is kvm (kernel virtual machine), a variant of QEMU.

Virtual machines provide excellent security isolation, but at the expense of performance, since every VM instance needs to have its own kernel, drivers, init system, terminal emulators, memory management, swap space, and so on. In response to this, various designers decided to go back to the old chroot system and start fixing the isolation limits, one by one. The history from here gets a bit complicated, since there are many, overlapping, new APIs that vary between operating systems and versions. Eventually, this collection of features congealed into what today we call "containers," in products like OpenVZ, LXC, and (most famously) Docker.

Why are we talking about all this? Because in this tutorial, we'll use redo to build and run three kinds of containers (chroot, kvm, and docker), sharing the same app build process between all three. redo's dependency and parallelism management makes it easy to build multiple container types in parallel, share code between different builds, and use different container types (each with different tradeoffs) for different sorts of testing.

You can follow along by checking out the redo source and looking in the docs/cookbook/container/ directory.

A Hello World container

Most Docker tutorials start at the highest level of abstraction: download someone else's container, copy your program into it, and run your program in a container. In the spirit of redo's low-level design, we're going to do the opposite, starting at the very lowest level and building our way up. The lowest level is, of course, Hello World, which we compiled (with redo of course) in an earlier tutorial:

In fact, our earlier version of Hello World is a great example of redo's safe recursion. Instead of producing an app as part of this tutorial, we'll just redo-ifchange ../hello/hello from in our new project, confident that redo will figure out any locking, dependency, consistency, and parallelism issues. (This sort of thing usually doesn't work very well in make, because you might get two parallel sub-instances of make recursing into the ../hello directory simultaneously, stomping on each other.)

For our first "container," we're just going to build a usable chroot environment containing our program (/bin/hello) and the bare minimum requirements of an "operating system": a shell (/bin/sh), an init script (/init, which will just be a symlink to /bin/hello), and, for debugging purposes, the all-purpose busybox program.

Here's a .do script that will build our simple container filesystem:

There's a little surprise here. Did you see it above? In current versions of redo, the semantics of a .do script producing a directory as its output are undefined. That's because the redo authors haven't yet figured out quite what ought to happen when a .do file creates a directory. Or rather, what should happen after you create a directory?

Can people redo-ifchange on a file inside that newly created directory? If so, what if the new directory contains .do files? What if you redo-ifchange one of the sub-files before you redo-ifchange the directory that contains it, so that the sub-file's .do doesn't exist yet? And so on. We don't know. So for now, to stop you from depending on this behaviour, we intentionally made it not work.

Instead of that, you can have a .do script that produces a different directory as a side effect. So above, simple.fs.do produces a directory called simple when you run redo simple.fs. simple.fs is the (incidentally empty) target file, which is remembered by redo as a node in the dependency tree, so that other scripts can depend upon it using redo-ifchange simple.fs. The simple directory happens to materialize too, and redo doesn't know anything about it, which means it doesn't try to do anything about it, and you don't have to care what redo's semantics for it might someday be. In other words, maybe someday we'll find a more elegant way to handle .do files that create directories, but we won't break your old code when we do.

Okay?

All right, one more catch. Operating systems are complicated, and there's one more missing piece. Our Hello World program is dynamically linked, which means it depends on shared libraries elsewhere in the system. You can see exactly which ones by using the ldd command:

$ ldd ../hello/hello
    linux-vdso.so.1 (0x00007ffd1ffca000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9ddf8fd000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f9ddfe9e000)

If we chroot into our simplistic "container" and try to run hello, it won't work, because those libraries aren't available to programs inside the chroot. That's the whole point of chroot, after all!

How do we fix it? We get a list of the libraries with ldd, and then we copy the libraries into place.

Actually, for reasons we'll address below, let's make a copy of the new filesystem and copy the new libraries into that:

So now there's a directory called simple, which contains our program and some helper programs, and one called libs, which contains all that stuff, plus the supporting libraries. That latter one is suitable for use with chroot.

Running a container with `unshare` and `chroot`

So let's run it! We can teach redo how to start a program inside any chroot by using a default.do script. In this case, we'll use default.runlocal.do. With that file in place, when we run redo whatever.runlocal (for any value of whatever), redo will first construct the whatever directory (using redo-ifchange whatever.fs), and then chroot into it and run /init inside. We'll collect stdout into the redo output (ie. the file outside the chroot named whatever.runlocal). Also, the stderr will go to redo's build log, readable with redo-log or on the console at build time, and if the /init script returns a nonzero exit code, so will our script. As a result, the whole container execution will act like a single node in our build process. It can depend on other things, and other things can depend on it.

Just one more thing: once upon a time, chroot was only available to sysadmins, not normal users. And it's never a good idea to run your build scripts as root. Luckily, Linux recently got a feature called "user namespaces" (userns), which, among many other things, lets non-administrator users use chroot. This is a really great addition.

(Unfortunately, some people worry that user namespaces might create security holes. From an abundance of caution, many OSes disable user namespaces for non-administrators by default. So most of this script is just detecting those situations so it can give you a useful warning. The useful part of the script is basically just: unshare -r chroot "$2" /init >$3. Alas, the subsequent error handling makes our script look long and complicated.)

Speaking of error handling, the script above calls a script called ./need.sh, which is just a helper that prints a helpful error message and aborts right away if the listed programs are not available to run, rather than failing in a more complicated way. We'll use that script more extensively below.

And that's it! A super simple container!

$ redo libs.runlocal
redo  libs.runlocal
redo    libs.fs
redo      simple.fs

$ time redo libs.runlocal
redo  libs.runlocal

real    0m0.112s
user    0m0.060s
sys 0m0.024s

$ du libs
792 libs/bin
156 libs/lib64
1656    libs/lib/x86_64-linux-gnu
1660    libs/lib
3752    libs

$ cat libs.runlocal
Hello, world!

By the way, if this were a docker tutorial, it would still print "Hello, world!" but your container would be >100 megabytes instead of 3.7 megabytes, and it would have taken at least a couple of seconds to start instead of 0.11 seconds. But we'll get to that later. First, now that we have a container, let's do more stuff with it!

Running a container with `kvm` and `initrd`

Now you've seen chroot in action, but we can run almost the same container in kvm (kernel virtual machine) instead, with even greater isolation. kvm only runs on Linux, so for this step you'll need a Linux machine. And for our example, we'll just have it run exactly the same kernel you're already using, although kvm has the ability to use whatever kernel you want. (You could even build a kernel as part of your redo project, redo-ifchange it, and then run it with kvm. But we're not going to do that.)

Besides a kernel, kvm needs an "initial ramdisk", which is where it'll get its filesystem. (kvm can't exactly access your normal filesystem, because it's emulating hardware, and there's no such thing as "filesystem hardware." There are tools like the 9p filesystem that make this easier, but it's not available in all kernel builds, so we'll avoid it for now.)

"Initial ramdisk" (initrd) sounds fancy, but it's actually just a tarball (technically, a cpio archive) that the kernel extracts into a ramdisk at boot time. Since we already have the files, making the tarball is easy:

(Ignore that try_fakeroot.sh thing for now. We'll get to it a bit further down. In our simple.fs example, it's a no-op anyway.)

The main thing you need to know is that, unlike tar, cpio takes a list of files on stdin instead of on the command line, and it doesn't recurse automatically (so if you give it a directory name, it'll store an entry for that directory, but not its contents, unless you also provide a list of its contents). This gives us a lot of power, which we'll use later. For now we're just doing basically find | cpio -o, which takes all the files and directories and puts them in a cpio archive file.

$ redo libs.initrd
redo  libs.initrd
5163 blocks
1 block

$ cpio -t <libs.initrd
.
bin
bin/hello
bin/busybox
bin/sh
lib64
lib64/ld-linux-x86-64.so.2
lib
lib/x86_64-linux-gnu
lib/x86_64-linux-gnu/libc.so.6
init
7444 blocks

default.initrd.do also appends another file, rdinit (the "ram disk init" script), which is the first thing the kvm Linux kernel will execute after booting. We use this script to set up a useful environment for our container's /init script to run in - notably, it has to write its stdout to some virtual hardware device, so redo can capture it, and it has to save its exit code somewhere, so redo knows whether it suceeded or not. Here's a simple rdinit script that should work with any container we want to run using this technique:

Configuring a virtual machine can get a little complicated, and there are a million things we might want to do. One of the most important is setting the size of the ramdisk needed for the initrd. Current Linux versions limit the initrd to half the available RAM in the (virtual) machine, so to be safe, we'll make sure to configure kvm to provide at least 3x as much RAM as the size of the initrd. Here's a simple script to calculate that:

With all those pieces in place, actually executing the kvm is pretty painless. Notice in particular the three serial ports we create: one for the console (stderr), one for the output (stdout), and one for the exit code:

And it works!

$ redo libs.runkvm
redo  libs.runkvm
redo    libs.initrd
5163 blocks
1 block
libs: kvm memory required: 70M
[    0.306682] reboot: Power down
ok.

$ time redo libs.runkvm
redo  libs.runkvm
libs: kvm memory required: 70M
[    0.295139] reboot: Power down
ok.

real    0m0.887s
user    0m0.748s
sys 0m0.112s

$ cat libs.runkvm
Hello, world!

Virtual machines have come a long way since 1999: we managed to build an initrd, boot kvm, run our program, and shut down in only 0.9 seconds. It could probably go even faster if we used a custom-built kernel with no unnecessary drivers.

A real Docker container

Okay, that was fun, but nobody in real life cares about all these fast, small, efficient isolation systems that are possible for mortals to understand, right? We were promised a Container System, and a container system has daemons, and authorization, and quotas, and random delays, and some kind of Hub where I can download (and partially deduplicate) someone else's multi-gigabyte Hello World images that are built in a highly sophisticated enterprise-ready collaborative construction process. Come on, tell me, can redo do that?

Of course! But we're going to get there the long way.

First, let's use the big heavy Container System with daemons and delays to run our existing tiny understandable container. After that, we'll show how to build a huge incomprehensible container that does the same thing, so your co-workers will think you're normal.

Docker and layers

Normal people build their Docker containers using a Dockerfile. A Dockerfile is sort of like a non-recursive redo, or maybe a Makefile, except that it runs linearly, without the concept of dependencies or parallelization. In that sense, I guess it's more like an IBM mainframe job control script from 1970. It even has KEYWORDS in ALL CAPS, just like 1970.

Dockerfiles do provide one really cool innovation over IBM job control scripts, which is that they cache intermediate results so you don't have to regenerate it every time. Basically, every step in a Dockerfile copies a container, modifies it slightly, and saves the result for use in the next step. If you modify step 17 and re-run the Dockerfile, it can just start with the container produced by step 16, rather than going all the way back to step 1. This works pretty well, although it's a bit expensive to start and stop a container at each build step, and it's unclear when and how interim containers are expunged from the cache later. And some of your build steps are "install the operating system" and "install the compiler", so each step produces a larger and larger container. A very common mistake among Docker users is to leave a bunch of intermediate files (source code, compilers, packages, etc) installed in the output container, bloating it up far beyond what's actually needed to run the final application.

Spoiler: we're not going to do it that way.

Instead, let's use redo to try to get the same Dockerfile advantages (multi-stage cache; cheap incremental rebuilds) without the disadvantages (launching and unlaunching containers; mixing our build environment with our final output).

To understand how we'll do this, we need to talk about Layers. Unlike our kvm initrd from earlier, a Docker image is not just a single tarball; it's a sequence of tarballs, each containing the set of files changed at each step of the build process. This layering system is how Docker's caching and incremental update system works: if I incrementally build an image starting from step 17, based on the pre-existing output from step 16, then the final image can just re-use layers 1..16 and provide new layers 17..n. Usually, the first few layers (install the operating system, install the compilers, etc) are the biggest ones, so this means a new version of an image takes very little space to store or transfer to a system that already has the old one.

The inside of a docker image looks like this:

 $ tar -tf test.image
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/VERSION
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/json
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/layer.tar
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/VERSION
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/json
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/layer.tar

We could use redo to build a Docker image by simply making a single layer.tar of the filesystem (like we did with initrd), adding a VERSION and json file, and putting those three things into an outer tarball. But if we want a system that works as well as a Dockerfile, we'll have to make use of multiple layers.

Our simple container is already pretty tiny by container standards - 2.6MB - but it's still a bit wasteful. Most of that space turns out to be from the dynamic libraries we imported from the host OS. These libraries don't change when we change Hello World! They belong in their own layer.

Up above, in preparation for this moment, we created libs.fs.do to build a separate filesystem, rather than adding the libraries inside simple.fs.do, which would have been easier. Now we can make each of those filesystems its own layer.

There's one more complication: we did things a bit backwards. In a Dockerfile, you install the libraries first, and then you install your application. When you replace your application, you replace only the topmost layer. We did it the other way around: we installed our application and some debugging tools, then detected which libraries they need and added a layer on top. The most recent versions of Docker, 1.10 and above, are more efficient about handling layers changing in the middle of the stack, but not everyone is using newer Docker versions yet, so let's try to make things efficient for older Docker versions too.

Luckily, since we're starting from first principles, in redo we can do anything we want. We have to generate a tarball for each layer anyway, so we can decide what goes into each layer and then we can put those layers in whatever sequence we want.

Let's start simple. A layer is just a tarball made of a set of files (again, ignore the try_fakeroot stuff for now):

The magic, of course, is in deciding which files go into which layers. In the script above, that's provided in the .list file corresponding to each layer. The .list file is produced by default.list.do:

This requires a bit of explanation. First of all, you probably haven't seen the very old, but little-known comm program before. It's often described as "compare two sorted files" or "show common lines between two files." But it actually does more than just showing common lines: it can show the lines that are only in file #1, or only in file #2, or in both files. comm -1 -3 suppresses the output of lines that are only in #1 or that are in both, so that it will print the lines that are only in the second file.

If we want to make a libs.layer that contains only the files that are not in simple, then we can use comm -1 -3 to compare simple with libs.

Now, this script is supposed to be able to construct the file list for any layer. To do that, it has to know what parent to compare each layer against. We call that the "diffbase", and for layers that are based on other layers, we put the name of the parent layer in its diffbase file:

(If there's no diffbase, then we use /dev/null as the diffbase. Because if file #1 is empty, then all the lines are only in file #2, which is exactly what we want.)

There's just one more wrinkle: if we just compare lists of files, then we'll detect newly-added files, but we won't detect modified files. To fix this, we augment the file list with file checksums before the comparison (using fileids.py), then strip the checksums back out in default.layer.do before sending the resulting list to cpio.

The augmented file list looks like this:

$ cat simple.list
. 0040755-0-0-0
./bin 0040755-0-0-0
./bin/busybox 0100755-0-0-ba34fb34865ba36fb9655e724266364f36155c93326b6b73f4e3d516f51f6fb2
./bin/hello 0100755-0-0-22e4d2865e654f830f6bfc146e170846dde15185be675db4e9cd987cb02afa78
./bin/sh 0100755-0-0-e803088e7938b328b0511957dcd0dd7b5600ec1940010c64dbd3814e3d75495f
./init 0120777-0-0-14bdc0fb069623c05620fc62589fe1f52ee6fb67a34deb447bf6f1f7e881f32a

(Side note: the augmentation needs to be added at the end of the line, not the beginning so that the file list is still sorted afterwards. comm only works correctly if both input files are sorted.)

The script for augmenting the file list is fairly simple. It just reads a list of filenames on stdin, checksums those files, and writes the augmented list on stdout:

Just one more thing! Docker (before 1.10) deduplicates images by detecting that they contain identical layers. When using a Dockerfile, the layers are named automatically using random 256-bit numbers (UUIDs). Since Dockerfiles usually don't regenerate earlier layers, the UUIDs of those earlier layers won't change, so future images will contain layers with known UUIDs, so Docker doesn't need to deduplicate them.

We don't want to rely on never rebuilding layers. Instead, we'll adopt a technique from newer Docker versions (post 1.10): we'll name layers after a checksum of their contents. Now, we don't want to actually checksum the whatever.layer file, because it turns out that tarballs contain a bunch of irrelevant details, like inode numbers and mtimes, so they'll have a different checksum every time they're built. Instead, we'll make a digest of the whatever.list file, which conveniently already has a checksum of each file's contents, plus the interesting subset of the file's attributes.

Docker expects 256-bit layer names, so we might normally generate a sha256 digest using the sha256sum program, but that's not available on all platforms. Let's write a python script to do the job instead. To make it interesting, let's write it as a .do file, so we can generate the sha256 of anything by asking for redo-ifchange anything.sha256. This is a good example of how in redo, .do files can be written in any scripting language, not just sh.

Let's test it out:

$ redo simple.list.sha256
redo  simple.list.sha256
redo    simple.list

$ cat simple.list.sha256
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145

$ rm -rf simple

$ redo simple.list.sha256
redo  simple.list.sha256
redo    simple.list
redo      simple.fs

$ cat simple.list.sha256
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145

Consistent layer id across rebuilds! Perfect.

Combining layers: building a Docker image

We're almost there. Now that we can produce a tarball for each layer, we have to produce the final tarball that contains all the layers in the right order. For backward compatibility with older Docker versions, we also need to produce a json "manifest" for each layer. In those old versions, each layer was also its own container, so it needed to have all the same attributes as a container, including a default program to run, list of open ports, and so on. We're never going to use those values except for the topmost layer, but they have to be there, so let's just auto-generate them. Here's the script for customizing each layer's json file from a template:

And here's the empty template:

Now we just need to generate all the layers in a subdirectory, and tar them together:

This requires a list of layers for each image we might want to create. Here's the list of two layers for our simple container:

Finally, some people like to compress their Docker images for transport or uploading to a repository. Here's a nice .do script that can produce the .gz compressed version of any file:

Notice the use of --rsyncable. Very few people seem to know about this gzip option, but it's immensely handy. Normally, if a few bytes change early in a file, it completely changes gzip's output for all future bytes, which means that incremental copying of new versions of a file (eg. using rsync) is very inefficient. With --rsyncable, gzip does a bit of extra work to make sure that small changes in one part of a file don't affect the gzipped bytes later in the file, so an updated container will be able to transfer a minimal number of bytes, even if you compress it.

Let's try it out!

$ redo simple.image.gz
redo  simple.image.gz
redo    simple.image
redo      libs.list.sha256
redo        libs.list
redo          simple.list
redo      libs.layer
3607 blocks
redo      simple.list.sha256
redo      simple.layer
1569 blocks
layer: b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95 libs
layer: 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145 simple

flow:~/src/redo/docs/cookbook/container $ tar -tf simple.image.gz
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/VERSION
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/json
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/layer.tar
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/VERSION
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/json
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/layer.tar

In the above, notice how we build libs.layer first and simple.layer second, because that's the order of the layers in simple.image.layers. But to produce libs.list we need to compare the file list against simple.list, so it declares a dependency on simple.list.

The final simple.image tarball then includes the layers in reverse order (topmost to bottommost), because that's how Docker does it. The id of the resulting docker image is the id of the topmost layer, in this case 4d1fda9f.

Loading and running a Docker image

Phew! Okay, we finally have a completed Docker image in the format Docker expects, and we didn't have to execute even one Dockerfile. Incidentally, that means all of the above steps could run without having Docker installed, and without having any permissions to talk to the local Docker daemon. That's a pretty big improvement (in security and manageability) over running a Dockerfile.

The next step is to load the image into Docker, which is easy:

And finally, we can ask Docker to run our image, and capture its output like we did, so long ago, in default.runlocal.do and default.runkvm.do:

The result is almost disappointing in its apparent simplicity:

$ time redo simple.rundocker
redo  simple.rundocker
redo    simple.load

real    0m2.688s
user    0m0.068s
sys 0m0.036s

$ cat simple.rundocker
Hello, world!

Notice that, for some reason, Docker takes 2.7s to load, launch and run our tiny container. That's about 3x as long as it takes to boot and run a kvm virtual machine up above with exactly the same files. This is kind of weird, since containers are supposed to be much more lightweight than virtual machines. I'm sure there's a very interesting explanation for this phenomenon somewhere. For now, notice that you might save a lot of time by initially testing your containers using default.runlocal (0.11 seconds) instead of Docker (2.7 seconds), even if you intend to eventally deploy them in Docker.

A Debian-based container

We're not done yet! We've built and run a Docker container the hard way, but we haven't built and run an unnecessarily wastefully huge Docker container the hard way. Let's do that next, by installing Debian in a chroot, then packaging it up into a container.

As we do that, we'll recycle almost all the redo infrastructure we built earlier while creating our simple container.

Interlude: Fakeroot

It's finally time to talk about that mysterious try_fakeroot.sh script that showed up a few times earlier. It looks like this:

fakeroot is a tool, originally developed for the Debian project, that convinces your programs that they are running as root, without actually running them as root. This is mainly so that they can pretend to chown() files, without actually introducing security holes on the host operating system. Debian uses this when building packages: they compile the source, start fakeroot, install to a fakeroot directory, make a tarball of that directory, then exit fakeroot. The tarball then contains the permissions they want.

Normally, fakeroot forgets all its simulated file ownership and permissions whenever it exits. However, it has -s (save) and -i (input) options for saving the permissions to a file and reloading the permissions from that file, respectively.

As we build our container layers, we need redo to continually enter fakeroot, do some stuff, and exit it again. The try_fakeroot.sh script is a helper to make that easier.

Debootstrap

The next Debian tool we should look at is debootstrap. This handy program downloads and extracts the (supposedly) minimal packages necessary to build an operational Debian system in a chroot-ready subdirectory. Nice!

In order for debootstrap to work without being an administrator - and you should not run your build system as root - we'll use fakeroot to let it install all those packages.

Unfortunately, debootstrap is rather slow, for two reasons:

It has to download a bunch of things.
It has to install all those things.

And after debootstrap has run, all we have is a Debian system, which by itself isn't a very interesting container. (You usually want your container to have an app so it does something specific.)

Does this sound familiar? It sounds like a perfect candidate for Docker layers. Let's make three layers:

Download the packages.
Install the packages.
Install an app.

Here's step one:

On top of that layer, we run the install process:

Since both steps run debootstrap and we might want to customize the set of packages to download+install, we'll put the debootstrap options in their own shared file:

And finally, we'll produce our "application" layer, which in this case is just a shell script that counts then number of installed Debian packages:

Building the Debian container

Now that we have the three filesystems, let's actually generate the Docker layers. But with a catch: we won't actually include the layer for step 1, since all those package files will never be needed again. (Similarly, if we were installing a compiler - and perhaps redo! - in the container so we could build our application in a controlled environment, we might want to omit the "install compiler" layers from the final product.)

So we list just two layers:

And the 'debian' layer's diffbase is debootstrap, so we don't include the same files twice:

Running the Debian container

This part is easy. All the parts are already in place. We'll just run the existing default.rundocker.do:

$ time redo debian.rundocker
redo  debian.rundocker
redo    debian.load
redo      debian.image
redo        debian.list.sha256
redo          debian.list
redo        debian.layer
12 blocks
layer: a542b5976e1329b7664d79041d982ec3d9f7949daddd73357fde17465891d51d debootstrap
layer: d5ded4835f8636fcf01f6ccad32125aaa1fe9e1827f48f64215b14066a50b9a7 debian

real    0m7.313s
user    0m0.632s
sys 0m0.300s

$ cat debian.rundocker
82

It works! Apparently there are 82 Debian packages installed. It took 7.3 seconds to load and run the docker image though, probably because it had to transfer the full contents of those 82 packages over a socket to the docker server, probably for security reasons, rather than just reading the files straight from disk. Luckily, our chroot and kvm scripts also still work:

$ time redo debian.runlocal
redo  debian.runlocal

real    0m0.084s
user    0m0.052s
sys 0m0.004s

$ cat debian.runlocal
82

$ time redo debian.runkvm
redo  debian.runkvm
redo    debian.initrd
193690 blocks
1 block
debian: kvm memory required: 346M
[    0.375365] reboot: Power down
ok.

real    0m3.445s
user    0m1.008s
sys 0m0.644s

$ cat debian.runkvm
82

Testing and housekeeping

Let's finish up by providing the usual boilerplate. First, an all.do that builds, runs, and tests all the images on all the container platforms. This isn't a production build system, it's a subdirectory of the redo package, so we'll skip softly, with a warning, if any of the components are missing or nonfunctional. If you were doing this in a "real" system, you could just let it abort when something is missing.

And here's a redo clean script that gets rid of (most of) the files produced by the build. We say "most of" the files, because actually we intentionally don't delete the debdownload and debootstrap directories. Those take a really long time to build, and redo knows to rebuild them if their dependencies (or .do files) change anyway. So instead of throwing away their content on 'redo clean', we'll keep it around.

Still, we want a script that properly cleans up everything, so let's have redo xclean (short for "extra clean") wipe out the last remaining files: