Containers
"Containers" became popular a few years ago with the emergence of Docker, but they are actually the result of a long line of evolution starting with chroot, a concept which dates all the way back to 1979. The idea of a container, or a chroot, is to run a process or set of processes in a (more or less) isolated environment that's separate from your main operating system.
The first iteration, chroot, only isolated the filesystem: chroot would "change" the "root" directory (hence the name) to a subdirectory of the main filesystem, then run a program that would see only files in that subdirectory. Among other things, this was used as a way to prevent rogue programs from accidentally damaging other files on the system. But it wasn't particularly safe, especially because any program running with administrator privileges could play tricks and eventually switch its root back to the "real" root directory. Separately from security, though, it's sometimes interesting to install a different operating system variant in a subdirectory, then chroot into it and run programs that require that operating system version. For example, if you're running the latest version of Debian Linux, but you want to build an application that only builds correctly on the Debian version from 5 years ago, you can install the 5-years-ago Debian files in a directory, chroot into that, and build your application. The main limitation is that your "host" system and your chroot environment share the same kernel version, and rogue programs usually can find a way to escape the chroot, so it's not useful if your inner system is running dangerous code.
Partly in response to the limitations of chroot, "virtualization" started to gain popularity around 2001, made famous by VMware. (IBM mainframes had been doing something similar for a few decades, but not many people knew how IBM mainframes worked.) Anyway, virtualization simulates a computer's actual hardware and lets you run a different kernel on the virtual hardware, and a filesystem inside that hardware. This has several advantages, including much stricter security separation and the ability to run a different kernel or even a different "guest" operating system than the one on the host. Virtualization used to be pretty slow, but it's gotten faster and faster over the years, especially with the introduction of "paravirtualization," where we emulate special virtual-only "hardware" that needs special drivers in the guest, in exchange for better performance. On Linux, the easiest type of paravirtualization nowadays is kvm (kernel virtual machine), a variant of QEMU.
Virtual machines provide excellent security isolation, but at the expense of
performance, since every VM instance needs to have its own kernel, drivers,
init system, terminal emulators, memory management, swap space, and so on.
In response to this, various designers decided to go back to the old
chroot
system and start fixing the isolation limits, one by one. The
history from here gets a bit complicated, since there are many, overlapping,
new APIs that vary between operating systems and versions. Eventually, this
collection of features congealed into what today we call "containers," in
products like OpenVZ,
LXC, and (most famously) Docker.
Why are we talking about all this? Because in this tutorial, we'll use
redo
to build and run three kinds of containers (chroot, kvm, and docker),
sharing the same app build process between all three. redo's dependency and
parallelism management makes it easy to build multiple container types in
parallel, share code between different builds, and use different container
types (each with different tradeoffs) for different sorts of testing.
You can follow along by checking out the redo source and looking in the docs/cookbook/container/ directory.
A Hello World container
Most Docker tutorials start at the highest level of abstraction: download someone else's container, copy your program into it, and run your program in a container. In the spirit of redo's low-level design, we're going to do the opposite, starting at the very lowest level and building our way up. The lowest level is, of course, Hello World, which we compiled (with redo of course) in an earlier tutorial:
In fact, our earlier version of Hello World is a great example of redo's
safe recursion. Instead of producing an app as part of this tutorial, we'll
just redo-ifchange ../hello/hello
from in our new project, confident that
redo will figure out any locking, dependency, consistency, and parallelism
issues. (This sort of thing usually doesn't work very well in make
,
because you might get two parallel sub-instances of make
recursing into
the ../hello
directory simultaneously, stomping on each other.)
For our first "container," we're just going to build a usable chroot
environment containing our program (/bin/hello
) and the bare minimum
requirements of an "operating system": a shell (/bin/sh
), an init script
(/init
, which will just be a symlink to /bin/hello
), and, for debugging
purposes, the all-purpose busybox program.
Here's a .do script that will build our simple container filesystem:
There's a little surprise here. Did you see it above? In current versions of redo, the semantics of a .do script producing a directory as its output are undefined. That's because the redo authors haven't yet figured out quite what ought to happen when a .do file creates a directory. Or rather, what should happen after you create a directory?
Can people redo-ifchange
on
a file inside that newly created directory? If so, what if the new directory
contains .do files? What if you redo-ifchange
one of the sub-files before
you redo-ifchange
the directory that contains it, so that the sub-file's
.do doesn't exist yet? And so on. We don't know. So for now, to stop you
from depending on this behaviour, we intentionally made it not work.
Instead of that, you can have a .do script that produces a different
directory as a side effect. So above, simple.fs.do
produces a directory
called simple
when you run redo simple.fs
. simple.fs
is the
(incidentally empty) target file, which is remembered by redo as a node in
the dependency tree, so that other
scripts can depend upon it using redo-ifchange simple.fs
. The simple
directory happens to materialize too, and redo doesn't know anything about
it, which means it doesn't try to do anything about it, and you don't have
to care what redo's semantics for it might someday be. In other words,
maybe someday we'll find a more elegant way to handle .do files that create
directories, but we won't break your old code when we do.
Okay?
All right, one more catch. Operating systems are complicated, and there's
one more missing piece. Our Hello World program is dynamically linked,
which means it depends on shared libraries elsewhere in the system. You can
see exactly which ones by using the ldd
command:
$ ldd ../hello/hello
linux-vdso.so.1 (0x00007ffd1ffca000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9ddf8fd000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9ddfe9e000)
If we chroot
into our simplistic "container" and try to run hello
, it
won't work, because those libraries aren't available to programs inside the
chroot. That's the whole point of chroot, after all!
How do we fix it? We get a list of the libraries with ldd
, and then
we copy the libraries into place.
Actually, for reasons we'll address below, let's make a copy of the new filesystem and copy the new libraries into that:
So now there's a directory called simple
, which contains our program and
some helper programs, and one called libs
, which contains all that stuff,
plus the supporting libraries. That latter one is suitable for use with
chroot.
Running a container with unshare
and chroot
So let's run it! We can teach redo how to start a program inside any chroot
by using a default.do
script. In this case, we'll use
default.runlocal.do
. With that file in place, when we run redo
whatever.runlocal
(for any value of whatever
), redo will first construct
the whatever
directory (using redo-ifchange whatever.fs
), and then
chroot into it and run /init
inside. We'll collect stdout into the redo
output (ie. the file outside the chroot named whatever.runlocal
). Also,
the stderr will go to redo's build log, readable with redo-log
or on the console at build time, and if the /init
script returns a nonzero
exit code, so will our script. As a result, the whole container execution
will act like a single node in our build process. It can depend on other
things, and other things can depend on it.
Just one more thing: once upon a time, chroot
was only available to
sysadmins, not normal users. And it's never a good idea to run your build
scripts as root. Luckily, Linux recently got a feature called "user
namespaces" (userns), which, among many other things, lets non-administrator
users use chroot
. This is a really great addition.
(Unfortunately, some people worry that user namespaces might create security
holes. From an abundance of caution, many OSes disable user namespaces for
non-administrators by default. So most of this script is just detecting
those situations so it can give you a useful warning. The useful part of
the script is basically just: unshare -r chroot "$2" /init >$3
. Alas,
the subsequent error handling makes our script look long and complicated.)
Speaking of error handling, the script above calls a script called
./need.sh
, which is just a helper that prints a helpful error message and
aborts right away if the listed programs are not available to run, rather
than failing in a more complicated way. We'll use that script more
extensively below.
And that's it! A super simple container!
$ redo libs.runlocal
redo libs.runlocal
redo libs.fs
redo simple.fs
$ time redo libs.runlocal
redo libs.runlocal
real 0m0.112s
user 0m0.060s
sys 0m0.024s
$ du libs
792 libs/bin
156 libs/lib64
1656 libs/lib/x86_64-linux-gnu
1660 libs/lib
3752 libs
$ cat libs.runlocal
Hello, world!
By the way, if this were a docker tutorial, it would still print "Hello, world!" but your container would be >100 megabytes instead of 3.7 megabytes, and it would have taken at least a couple of seconds to start instead of 0.11 seconds. But we'll get to that later. First, now that we have a container, let's do more stuff with it!
Running a container with kvm
and initrd
Now you've seen chroot in action, but we can run almost the same container
in kvm
(kernel virtual machine) instead, with even greater isolation.
kvm
only runs on Linux, so for this step you'll need a Linux machine. And
for our example, we'll just have it run exactly the same kernel you're
already using, although kvm has the ability to use whatever kernel you want.
(You could even build a kernel as part of your redo project, redo-ifchange
it, and then run it with kvm. But we're not going to do that.)
Besides a kernel, kvm needs an "initial ramdisk", which is where it'll get its filesystem. (kvm can't exactly access your normal filesystem, because it's emulating hardware, and there's no such thing as "filesystem hardware." There are tools like the 9p filesystem that make this easier, but it's not available in all kernel builds, so we'll avoid it for now.)
"Initial ramdisk" (initrd) sounds fancy, but it's actually just a tarball (technically, a cpio archive) that the kernel extracts into a ramdisk at boot time. Since we already have the files, making the tarball is easy:
(Ignore that try_fakeroot.sh
thing for now. We'll get to it a bit further
down. In our simple.fs
example, it's a no-op anyway.)
The main thing you need to know is that, unlike tar, cpio takes a list of
files on stdin instead of on the command line, and it doesn't recurse
automatically (so if you give it a directory name, it'll store an entry for
that directory, but not its contents, unless you also provide a list of its
contents). This gives us a lot of power, which we'll use later. For now
we're just doing basically find | cpio -o
, which takes all the files and
directories and puts them in a cpio archive file.
$ redo libs.initrd
redo libs.initrd
5163 blocks
1 block
$ cpio -t <libs.initrd
.
bin
bin/hello
bin/busybox
bin/sh
lib64
lib64/ld-linux-x86-64.so.2
lib
lib/x86_64-linux-gnu
lib/x86_64-linux-gnu/libc.so.6
init
7444 blocks
default.initrd.do
also appends another file, rdinit
(the "ram disk init"
script), which is the first thing the kvm Linux kernel will execute after
booting. We use this script to set up a useful environment for our
container's /init
script to run in - notably, it has to write its stdout
to some virtual hardware device, so redo can capture it, and it has to save
its exit code somewhere, so redo knows whether it suceeded or not. Here's a
simple rdinit
script that should work with any container we want to run
using this technique:
Configuring a virtual machine can get a little complicated, and there are a million things we might want to do. One of the most important is setting the size of the ramdisk needed for the initrd. Current Linux versions limit the initrd to half the available RAM in the (virtual) machine, so to be safe, we'll make sure to configure kvm to provide at least 3x as much RAM as the size of the initrd. Here's a simple script to calculate that:
With all those pieces in place, actually executing the kvm is pretty painless. Notice in particular the three serial ports we create: one for the console (stderr), one for the output (stdout), and one for the exit code:
And it works!
$ redo libs.runkvm
redo libs.runkvm
redo libs.initrd
5163 blocks
1 block
libs: kvm memory required: 70M
[ 0.306682] reboot: Power down
ok.
$ time redo libs.runkvm
redo libs.runkvm
libs: kvm memory required: 70M
[ 0.295139] reboot: Power down
ok.
real 0m0.887s
user 0m0.748s
sys 0m0.112s
$ cat libs.runkvm
Hello, world!
Virtual machines have come a long way since 1999: we managed to build an initrd, boot kvm, run our program, and shut down in only 0.9 seconds. It could probably go even faster if we used a custom-built kernel with no unnecessary drivers.
A real Docker container
Okay, that was fun, but nobody in real life cares about all these fast, small, efficient isolation systems that are possible for mortals to understand, right? We were promised a Container System, and a container system has daemons, and authorization, and quotas, and random delays, and some kind of Hub where I can download (and partially deduplicate) someone else's multi-gigabyte Hello World images that are built in a highly sophisticated enterprise-ready collaborative construction process. Come on, tell me, can redo do that?
Of course! But we're going to get there the long way.
First, let's use the big heavy Container System with daemons and delays to run our existing tiny understandable container. After that, we'll show how to build a huge incomprehensible container that does the same thing, so your co-workers will think you're normal.
Docker and layers
Normal people build their Docker containers using a Dockerfile. A Dockerfile is sort of like a non-recursive redo, or maybe a Makefile, except that it runs linearly, without the concept of dependencies or parallelization. In that sense, I guess it's more like an IBM mainframe job control script from 1970. It even has KEYWORDS in ALL CAPS, just like 1970.
Dockerfiles do provide one really cool innovation over IBM job control scripts, which is that they cache intermediate results so you don't have to regenerate it every time. Basically, every step in a Dockerfile copies a container, modifies it slightly, and saves the result for use in the next step. If you modify step 17 and re-run the Dockerfile, it can just start with the container produced by step 16, rather than going all the way back to step 1. This works pretty well, although it's a bit expensive to start and stop a container at each build step, and it's unclear when and how interim containers are expunged from the cache later. And some of your build steps are "install the operating system" and "install the compiler", so each step produces a larger and larger container. A very common mistake among Docker users is to leave a bunch of intermediate files (source code, compilers, packages, etc) installed in the output container, bloating it up far beyond what's actually needed to run the final application.
Spoiler: we're not going to do it that way.
Instead, let's use redo to try to get the same Dockerfile advantages (multi-stage cache; cheap incremental rebuilds) without the disadvantages (launching and unlaunching containers; mixing our build environment with our final output).
To understand how we'll do this, we need to talk about Layers. Unlike our kvm initrd from earlier, a Docker image is not just a single tarball; it's a sequence of tarballs, each containing the set of files changed at each step of the build process. This layering system is how Docker's caching and incremental update system works: if I incrementally build an image starting from step 17, based on the pre-existing output from step 16, then the final image can just re-use layers 1..16 and provide new layers 17..n. Usually, the first few layers (install the operating system, install the compilers, etc) are the biggest ones, so this means a new version of an image takes very little space to store or transfer to a system that already has the old one.
The inside of a docker image looks like this:
$ tar -tf test.image
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/VERSION
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/json
ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/layer.tar
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/VERSION
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/json
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/layer.tar
We could use redo to build a Docker image by simply making a single
layer.tar
of the filesystem (like we did with initrd), adding a VERSION
and json file, and putting those three things into an outer tarball. But if
we want a system that works as well as a Dockerfile, we'll have to make use
of multiple layers.
Our simple
container is already pretty tiny by container standards - 2.6MB
- but it's still a bit wasteful. Most of that space turns out to be from
the dynamic libraries we imported from the host OS. These libraries don't
change when we change Hello World! They belong in their own layer.
Up above, in preparation for this moment, we created libs.fs.do
to build a
separate filesystem, rather than adding the libraries inside
simple.fs.do
, which would have been easier. Now we can make each of those
filesystems its own layer.
There's one more complication: we did things a bit backwards. In a Dockerfile, you install the libraries first, and then you install your application. When you replace your application, you replace only the topmost layer. We did it the other way around: we installed our application and some debugging tools, then detected which libraries they need and added a layer on top. The most recent versions of Docker, 1.10 and above, are more efficient about handling layers changing in the middle of the stack, but not everyone is using newer Docker versions yet, so let's try to make things efficient for older Docker versions too.
Luckily, since we're starting from first principles, in redo we can do anything we want. We have to generate a tarball for each layer anyway, so we can decide what goes into each layer and then we can put those layers in whatever sequence we want.
Let's start simple. A layer is just a tarball made of a set of files
(again, ignore the try_fakeroot
stuff for now):
The magic, of course, is in deciding which files go into which layers. In
the script above, that's provided in the .list file corresponding to each
layer. The .list file is produced by default.list.do
:
This requires a bit of explanation. First of all, you probably haven't seen
the very old, but little-known comm
program before. It's often described
as "compare two sorted files" or "show common lines between two files." But
it actually does more than just showing common lines: it can show the lines
that are only in file #1, or only in file #2, or in both files. comm -1
-3
suppresses the output of lines that are only in #1 or that are in
both, so that it will print the lines that are only in the second file.
If we want to make a libs.layer
that contains only the files that are
not in simple
, then we can use comm -1 -3
to compare simple
with
libs
.
Now, this script is supposed to be able to construct the file list for any layer. To do that, it has to know what parent to compare each layer against. We call that the "diffbase", and for layers that are based on other layers, we put the name of the parent layer in its diffbase file:
(If there's no diffbase, then we use /dev/null as the diffbase. Because if file #1 is empty, then all the lines are only in file #2, which is exactly what we want.)
There's just one more wrinkle: if we just compare lists of files, then we'll
detect newly-added files, but we won't detect modified files. To fix
this, we augment the file list with file checksums before the comparison
(using fileids.py
), then strip the checksums back out in
default.layer.do
before sending the resulting list to cpio
.
The augmented file list looks like this:
$ cat simple.list
. 0040755-0-0-0
./bin 0040755-0-0-0
./bin/busybox 0100755-0-0-ba34fb34865ba36fb9655e724266364f36155c93326b6b73f4e3d516f51f6fb2
./bin/hello 0100755-0-0-22e4d2865e654f830f6bfc146e170846dde15185be675db4e9cd987cb02afa78
./bin/sh 0100755-0-0-e803088e7938b328b0511957dcd0dd7b5600ec1940010c64dbd3814e3d75495f
./init 0120777-0-0-14bdc0fb069623c05620fc62589fe1f52ee6fb67a34deb447bf6f1f7e881f32a
(Side note: the augmentation needs to be added at the end of the line, not
the beginning so that the file list is still sorted afterwards. comm
only
works correctly if both input files are sorted.)
The script for augmenting the file list is fairly simple. It just reads a list of filenames on stdin, checksums those files, and writes the augmented list on stdout:
Just one more thing! Docker (before 1.10) deduplicates images by detecting that they contain identical layers. When using a Dockerfile, the layers are named automatically using random 256-bit numbers (UUIDs). Since Dockerfiles usually don't regenerate earlier layers, the UUIDs of those earlier layers won't change, so future images will contain layers with known UUIDs, so Docker doesn't need to deduplicate them.
We don't want to rely on never rebuilding layers. Instead, we'll adopt a
technique from newer Docker versions (post 1.10): we'll name layers after a
checksum of their contents. Now, we don't want to actually checksum the
whatever.layer
file, because it turns out that tarballs contain a bunch of
irrelevant details, like inode numbers and
mtimes, so they'll have a different
checksum every time they're built. Instead, we'll make a digest of the
whatever.list
file, which conveniently already has a checksum of each
file's contents, plus the interesting subset of the file's attributes.
Docker expects 256-bit layer names, so we might normally generate a sha256
digest using the sha256sum
program, but that's not available on all
platforms. Let's write a python script to do the job instead. To make it
interesting, let's write it as a .do file, so we can generate the sha256 of
anything
by asking for redo-ifchange anything.sha256
. This is a good
example of how in redo, .do files can be written in any scripting language,
not just sh.
Let's test it out:
$ redo simple.list.sha256
redo simple.list.sha256
redo simple.list
$ cat simple.list.sha256
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145
$ rm -rf simple
$ redo simple.list.sha256
redo simple.list.sha256
redo simple.list
redo simple.fs
$ cat simple.list.sha256
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145
Consistent layer id across rebuilds! Perfect.
Combining layers: building a Docker image
We're almost there. Now that we can produce a tarball for each layer, we have to produce the final tarball that contains all the layers in the right order. For backward compatibility with older Docker versions, we also need to produce a json "manifest" for each layer. In those old versions, each layer was also its own container, so it needed to have all the same attributes as a container, including a default program to run, list of open ports, and so on. We're never going to use those values except for the topmost layer, but they have to be there, so let's just auto-generate them. Here's the script for customizing each layer's json file from a template:
And here's the empty template:
Now we just need to generate all the layers in a subdirectory, and tar them together:
This requires a list of layers for each image we might want to create.
Here's the list of two layers for our simple
container:
Finally, some people like to compress their Docker images for transport or uploading to a repository. Here's a nice .do script that can produce the .gz compressed version of any file:
Notice the use of --rsyncable
. Very few people seem to know about this
gzip option, but it's immensely handy. Normally, if a few bytes change
early in a file, it completely changes gzip's output for all future bytes,
which means that incremental copying of new versions of a file (eg. using
rsync
) is very inefficient. With --rsyncable
, gzip does a bit of extra
work to make sure that small changes in one part of a file don't affect the
gzipped bytes later in the file, so an updated container will be able to
transfer a minimal number of bytes, even if you compress it.
Let's try it out!
$ redo simple.image.gz
redo simple.image.gz
redo simple.image
redo libs.list.sha256
redo libs.list
redo simple.list
redo libs.layer
3607 blocks
redo simple.list.sha256
redo simple.layer
1569 blocks
layer: b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95 libs
layer: 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145 simple
flow:~/src/redo/docs/cookbook/container $ tar -tf simple.image.gz
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/VERSION
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/json
4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/layer.tar
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/VERSION
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/json
b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/layer.tar
In the above, notice how we build libs.layer first and simple.layer second,
because that's the order of the layers in simple.image.layers
. But to
produce libs.list
we need to compare the file list against simple.list
,
so it declares a dependency on simple.list
.
The final simple.image
tarball then includes the layers in reverse order
(topmost to bottommost), because that's how Docker does it. The id of the
resulting docker image is the id of the topmost layer, in this case
4d1fda9f.
Loading and running a Docker image
Phew! Okay, we finally have a completed Docker image in the format Docker expects, and we didn't have to execute even one Dockerfile. Incidentally, that means all of the above steps could run without having Docker installed, and without having any permissions to talk to the local Docker daemon. That's a pretty big improvement (in security and manageability) over running a Dockerfile.
The next step is to load the image into Docker, which is easy:
And finally, we can ask Docker to run our image, and capture its output like
we did, so long ago, in default.runlocal.do
and default.runkvm.do
:
The result is almost disappointing in its apparent simplicity:
$ time redo simple.rundocker
redo simple.rundocker
redo simple.load
real 0m2.688s
user 0m0.068s
sys 0m0.036s
$ cat simple.rundocker
Hello, world!
Notice that, for some reason, Docker takes 2.7s to load, launch and run our
tiny container. That's about 3x as long as it takes to boot and run a kvm
virtual machine up above with exactly the same files. This is kind of
weird, since containers are supposed to be much more lightweight than
virtual machines. I'm sure there's a very interesting explanation for this
phenomenon somewhere. For now, notice that you might save a lot of time by
initially testing your containers using default.runlocal
(0.11 seconds)
instead of Docker (2.7 seconds), even if you intend to eventally deploy them
in Docker.
A Debian-based container
We're not done yet! We've built and run a Docker container the hard way, but we haven't built and run an unnecessarily wastefully huge Docker container the hard way. Let's do that next, by installing Debian in a chroot, then packaging it up into a container.
As we do that, we'll recycle almost all the redo infrastructure we built
earlier while creating our simple
container.
Interlude: Fakeroot
It's finally time to talk about that mysterious try_fakeroot.sh
script
that showed up a few times earlier. It looks like this:
fakeroot is a tool, originally developed for the Debian project, that convinces your programs that they are running as root, without actually running them as root. This is mainly so that they can pretend to chown() files, without actually introducing security holes on the host operating system. Debian uses this when building packages: they compile the source, start fakeroot, install to a fakeroot directory, make a tarball of that directory, then exit fakeroot. The tarball then contains the permissions they want.
Normally, fakeroot forgets all its simulated file ownership and permissions
whenever it exits. However, it has -s
(save) and -i
(input) options for
saving the permissions to a file and reloading the permissions from that
file, respectively.
As we build our container layers, we need redo to continually enter
fakeroot, do some stuff, and exit it again. The try_fakeroot.sh
script is
a helper to make that easier.
Debootstrap
The next Debian tool we should look at is debootstrap. This handy program downloads and extracts the (supposedly) minimal packages necessary to build an operational Debian system in a chroot-ready subdirectory. Nice!
In order for debootstrap to work without being an administrator - and you should not run your build system as root - we'll use fakeroot to let it install all those packages.
Unfortunately, debootstrap is rather slow, for two reasons:
- It has to download a bunch of things.
- It has to install all those things.
And after debootstrap has run, all we have is a Debian system, which by itself isn't a very interesting container. (You usually want your container to have an app so it does something specific.)
Does this sound familiar? It sounds like a perfect candidate for Docker layers. Let's make three layers:
- Download the packages.
- Install the packages.
- Install an app.
Here's step one:
On top of that layer, we run the install process:
Since both steps run debootstrap and we might want to customize the set of packages to download+install, we'll put the debootstrap options in their own shared file:
And finally, we'll produce our "application" layer, which in this case is just a shell script that counts then number of installed Debian packages:
Building the Debian container
Now that we have the three filesystems, let's actually generate the Docker layers. But with a catch: we won't actually include the layer for step 1, since all those package files will never be needed again. (Similarly, if we were installing a compiler - and perhaps redo! - in the container so we could build our application in a controlled environment, we might want to omit the "install compiler" layers from the final product.)
So we list just two layers:
And the 'debian' layer's diffbase is debootstrap
, so we don't include the
same files twice:
Running the Debian container
This part is easy. All the parts are already in place. We'll just run
the existing default.rundocker.do
:
$ time redo debian.rundocker
redo debian.rundocker
redo debian.load
redo debian.image
redo debian.list.sha256
redo debian.list
redo debian.layer
12 blocks
layer: a542b5976e1329b7664d79041d982ec3d9f7949daddd73357fde17465891d51d debootstrap
layer: d5ded4835f8636fcf01f6ccad32125aaa1fe9e1827f48f64215b14066a50b9a7 debian
real 0m7.313s
user 0m0.632s
sys 0m0.300s
$ cat debian.rundocker
82
It works! Apparently there are 82 Debian packages installed. It took 7.3 seconds to load and run the docker image though, probably because it had to transfer the full contents of those 82 packages over a socket to the docker server, probably for security reasons, rather than just reading the files straight from disk. Luckily, our chroot and kvm scripts also still work:
$ time redo debian.runlocal
redo debian.runlocal
real 0m0.084s
user 0m0.052s
sys 0m0.004s
$ cat debian.runlocal
82
$ time redo debian.runkvm
redo debian.runkvm
redo debian.initrd
193690 blocks
1 block
debian: kvm memory required: 346M
[ 0.375365] reboot: Power down
ok.
real 0m3.445s
user 0m1.008s
sys 0m0.644s
$ cat debian.runkvm
82
Testing and housekeeping
Let's finish up by providing the usual boilerplate. First, an all.do
that
builds, runs, and tests all the images on all the container platforms.
This isn't a production build system, it's a subdirectory of the redo
package, so we'll skip softly, with a warning, if any of the components are
missing or nonfunctional. If you were doing this in a "real" system, you
could just let it abort when something is missing.
And here's a redo clean
script that gets rid of (most of) the files
produced by the build. We say "most of" the files, because actually we
intentionally don't delete the debdownload and debootstrap directories.
Those take a really long time to build, and redo knows to rebuild them if
their dependencies (or .do files) change anyway. So instead of throwing
away their content on 'redo clean', we'll keep it around.
Still, we want a script that properly cleans up everything, so let's have
redo xclean
(short for "extra clean") wipe out the last remaining
files: