R plots and LaTeX to pdf (side effects, multiple outputs)

A LaTeX typesetting example

LaTeX is a typesetting system that's especially popular in academia. Among other things, it lets you produce postscript and pdf files from a set of (mostly text) input files.

LaTeX documents often include images and charts. In our example, we'll show how to auto-generate a chart for inclusion using an R script with ggplot2.

To play with this code on your own machine, get the redo source code and look in the docs/cookbook/latex/ directory.

Generating a plot from an R script

First, let's tell redo how to generate our chart. We'll use the R language, and ask it to plot some of its sample data (the mpg, "miles per gallon" data set) and save it to an eps (encapsulated postscript) file. eps files are usually a good format for LaTeX embedded images, because they scale to any printer or display resolution.

First, let's make an R script that generates a plot:

And then a .do file to tie that into redo:

We can build and view the image:

$ redo mpg.eps 
redo  mpg.eps

# View the file on Linux
$ evince mpg.eps

# View the file on MacOS
$ open mpg.eps

Running the LaTeX processor

Here's the first draft of our very important scientific paper:

Notice how it refers to the chart from above, mpg.eps, and a text file, discovery.txt. Let's create the latter as a static file.

With all the parts of our document in places, we can now compile it directly using pdflatex:

$ pdflatex paper.latex 
This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
...[a lot of unnecessary diagnostic messages]...
Output written on paper.pdf (2 pages, 68257 bytes).
Transcript written on paper.log.

But this has a few problems. First of all, it doesn't understand dependencies; if mpg.R changes, it won't know to rebuild mpg.eps. Secondly, the TeX/LaTeX toolchain has an idiosyncracy that means you might have to rebuild your document more than once. In our example, we generate a table of contents, but it ends up getting generated before processing the rest of the content in the document, so it's initially blank. As it continues, LaTeX produces a file called paper.aux with a list of the references needed by the table of contents, and their page numbers. If we run LaTeX over again, it'll use that to build a proper of table of contents.

Of course, life is not necessarily so easy. Once the table of contents isn't blank, it might start to push content onto the next page. This will change all the page numbers! So we'd have to do it one more time. And that might lead to even more subtle problems, like a reference to page 99 changing to page 100, which pushes a word onto the next page, which changes some other page number, and so on. Thus, we need a script that will keep looping, re-running LaTeX until paper.aux stabilizes.

The whole script we'll use is below. Instead of running pdflatex directly, we'll use the regular latex command, which produces a .dvi (DeVice Independent) intermediate file which we can later turn into a pdf or ps file.

LaTeX produces a bunch of clutter files (like paper.aux) that can be used in future runs, but which also make its execution nondeterministic. To avoid that problem, we tell it to use a temporary --output-directory that we delete and recreate before each build (although we might need to run latex multiple times in one build, to get paper.aux to converge).

Virtual targets, side effects, and multiple outputs

Why did we call our script default.runtex.do? Why not default.pdf.do or default.dvi.do, depending what kind of file we ask LaTeX to produce?

The problem is that the latex command actually produces several files in that temporary directory, and we might want to keep them around. If we name our .do file after only one of those outputs, things get messy.

The biggest problem is that redo requires a .do file to write its output to $3 (or stdout), so that it can guarantee the output gets replaced atomically. When there is more than one output, at most one file can be sent to $3; how do you choose which one? Even worse, some programs don't even have the ability to choose the output filename; for an input of paper.latex, the latex command just writes a bunch of files named paper.* directly. You can't ask it to put just one of them in $3.

The easiest way to handle this situation in redo is to use a "virtual target", which is a target name that doesn't actually get created has a file, and has only side effects. You've seen these before: when we use all.do or clean.do, we don't expect to produce a file named all or clean. We expect redo to run a collection of other commands. In make, these are sometimes called ".PHONY rules" because of the way they are declared in a Makefile. But the rules aren't phony, they really are executed; they just don't produce output. So in redo we call them "virtual."

When we redo paper.runtex, it builds our virtual target. There is no paper.runtex file or directory generated. But as a side effect, a directory named paper.tmp is created.

(Side note: it's tempting to name the directory the same as the target. So we could have a paper.runtex directory instead of paper.tmp. This is not inherently a bad idea, but currently redo behaviour is undefined if you redo-ifchange a directory. Directories are weird. If one file in that directory disappears, does that mean you "modified" the output by hand? What if two redo targets modify the same directory? Should we require scripts to only atomically replace an entire output directory via $3? And so on. We might carefully define this behaviour eventually, but for now, it's better to use a separate directory name and avoid the undefined behaviour.)

Depending on side effects produced by virtual targets

Next, we want to produce .pdf and .ps files from the collection of files produced by the latex command, particularly paper.tmp/paper.dvi. To do that, we have to bring our files back from the "virtual target" world into the real world.

Depending on virtual targets is easy; we'll just redo-ifchange paper.runtex. Then we want to materialize paper.dvi from the temporary files in paper.tmp/paper.dvi, which we can do with an efficient hardlink (rather than making an unnecessary copy), like this:

Notice that we don't do redo-ifchange paper.tmp/paper.dvi. That's because redo has no knowledge of that file. If you ask redo to build that file for you, it doesn't know how to do it. You have to ask for paper.runtex, which you know - but redo doesn't know - will produce the input file you want. Then you can safely use it.

Once we have a .do file that produces the "real" (non-virtual, non-side-effect) paper.dvi file, however, it's safe to depend directly on it. Let's use that to produce our .ps and .pdf outputs:

(As above, we include exec >&2 lines because LaTeX tools incorrectly write their log messages to stdout. We need to redirect it all to stderr. That way redo-log can handle all the log output appropriately.)

Explicit dependencies

We've made a generalized script, default.runtex.do, that can compile any .latex file and produce a .tmp directory with its output. But that's not quite enough: different .latex files might have extra dependencies that need to exist before the compilation can continue. In our case, we need the auto-generated mpg.eps that we discussed above.

To make that work, default.runtex.do looks for a .deps file with the same name as the .latex file being processed. It contains just a list of extra dependencies that need to be built. Here's ours:

You can use this same ".deps" technique in various different places in redo. For example, you could have a default.do that can link a C program from any set of .o files. To specify the right set of .o files for target X, default.do might look in an X.deps or X.list file. If you later want to get even fancier, you could make an X.deps.do that programmatically generates the list of dependencies; for example, it might include one set of files on win32 platforms and a different set on unix platforms.

Autodependencies

Our paper.latex file actually includes two files: mpg.eps, which we explicitly depended upon above, and discovery.txt, which we didn't. The latter is a static source file, so we can let redo discover it automatically, based on the set of files that LaTeX opens while it runs. The latex command has a --record option to do this; it produces a file called paper.tmp/paper.fls (.fls is short for "File LiSt").

One of redo's best features is that you can declare dependencies after you've done your build steps, when you have the best knowledge of which files were actually needed. That's why in default.runtex.do, we parse the .fls file and then redo-ifchange on its contents right at the end.

(This brings up a rather subtle point about how redo works. When you run redo-ifchange, redo adds to the list of files which, if they change, mean your target needs to be rebuilt. But unlike make, redo will not actually rebuild those files merely because they're listed as a dependency; it just knows to rebuild your target, which means to run your .do file, which will run redo-ifchange again if it still needs those input files to be fresh.

This avoids an annoying problem in make where you can teach it about which .h files your C program depended on last time, but if you change A.c to no longer include X.h, and then delete X.h, make might complain that X.h is missing, because A.c depended on it last time. redo will simply notice that since X.h is missing, A.c needs to be recompiled, and let your compilation .do script report an error, or not.)

Anyway, this feature catches not just our discovery.txt dependency, but also the implicit dependencies on various LaTeX template and font files, and so on. If any of those change, our LaTeX file needs to be rebuilt.

$ redo --no-detail paper.pdf
redo  paper.pdf
redo    paper.dvi
redo      paper.runtex
redo        mpg.eps

$ redo --no-detail paper.pdf
redo  paper.pdf

$ touch discovery.txt 

$ redo --no-detail paper.pdf
redo  paper.pdf
redo    paper.dvi
redo      paper.runtex

$ redo --no-detail paper.pdf
redo  paper.pdf

Housekeeping

As usual, to polish up our project, let's create an all.do and clean.do.

Because this project is included in the redo source and we don't want redo to fail to build just because you don't have LaTeX or R installed, we'll have all.do quit politely if the necessary tools are missing.