How to Produce a Beautiful Book from the Command Line

Book Production Framework and Examples on GitHub

Introduction

Over the last couple of years, a number of people have asked me how I produce my books.  Most self-published (excuse me, ‘indie-published’) books have an amateurish quality that is easy to spot, and the lack of attention to detail detracts from the reading experience.  Skimping on cover art can be a culprit, but it rarely bears sole blame — or even the majority of it.   Indie-published interiors often are sloppy, even in books with well-designed covers.  For some reason, many authors give scant attention to the interior layout of their books.  Of course, professional publishers know better.    People judge books not just by their covers, but by their interiors as well.  If the visual appeal of your book does not concern you, then read no further.  Your audience most likely will not.

Producing a visually-pleasing book is not an insurmountable problem for the indie-publisher, nor a particularly difficult one.  It just requires a bit of attention.  Even subject to the constraints of print-on-demand publishing, it is quite possible to produce beautiful looking books.  Ebooks prove more challenging because one has less control over them (due to the need for reflowable text), but it is possible to do as well as the major publishers by using some of their tricks.  Moreover, all this can be accomplished from the command-line and without the use of proprietary software.

Now that I’ve finished my fifth book of fiction (and second novel), I figure it’s a good time to describe how I produce my books. I have automated almost the entire process of book and ebook production from the command-line. My process uses only free, open-source software that is well-established, well-documented, and well-maintained.

Though I use Linux, the same toolchain could be employed on a Mac or Windows box with a tiny bit of adaptation. To my knowledge, all the tools I use (or obvious counterparts) are available on both those platforms. In fact, MacOS is built on a flavor of unix, and the tools can be installed via Homebrew or other methods. Windows now has a unix subsystem which allows command-line access as well.

I have made available a full implementation of the system for both novels and collections of poetry, stories, or flash-fiction.   Though I discuss some general aspects below, most of the nitty gritty appears in the github project’s README file and in the in-code documentation.   The code is easily adaptable, and you should not feel constrained to the design choices I made.  The framework is intended as a proof of concept (though I use it regularly myself), and should serve as a point of departure for your own variant.  If you encounter any bugs or have any questions, I encourage you to get in touch.  I will do my best to address them in a timely fashion.

Examples

First, let’s see some examples of output (unfortunately, wordpress does not allow epub uploads, but you can generate epubs from the repository and view them in something like Sigil).  The novel and collection pdfs are best viewed in dual-page mode since they have a notion of recto and verso pages.

Who would be interested in this

If you’re interested in producing a fiction book from the command-line, it is fair to assume that (1) you’re an author or aspiring author and (2) you’re at least somewhat conversant with shell and some simple scripting. For scripting, I use Python 3, but Perl, Ruby, or any comparable language would work. Even shell scripting could be used.

At the time of this writing, I have produced a total of six books (five fiction books and one mathematical monograph) and have helped friends produce several more. All the physical  versions were printed through Ingram, and the ebook versions were distributed on Amazon. Ingram is a major distributor as well, so the print versions also are sold through Amazon, Barnes & Noble, and can be ordered through other bookstores. In the past I used Smashwords to port and distribute the ebook through other platforms (Kobi, Barnes & Noble, etc), but frankly there isn’t much point these days unless someone (ex. Bookbub) demands it. We’re thankfully past the point where most agents and editors demand Word docs (though a few still do), but producing one for the purpose of submission is possible with a little adaptation using pandoc and docx templates. However, most people accept PDFs these days.

My books so far include two novels, three collections of poetry & flash-fiction, and a mathematical monograph.  I have three other fiction books in the immediate pipeline (another collection of flash-fiction, a short story collection, and a fantasy novel), and several others in various stages of writing.  I do not say this to toot my own horn, but to make clear that the method I describe is not speculative.  It is my active practice.

The main point of this post is to demonstrate that it  is quite possible to produce a beautiful literary book using command-line, open-source tools in a reproducible way.  The main point of the github project is to show you precisely how to do so.   In fact, not only can you produce a lovely book that way, but I would argue it is the best way to go about it! This is true whether your book is a novel or a collection of works.

One reason why such a demonstration is necessary is the dearth of online examples. There are plenty of coding and computer-science books produced from markdown via pandoc. There are plenty of gorgeous mathematics books produced using LaTeX.   But there are very few examples in the literary realm, despite the typesetting power of LaTeX, and the presence of the phenomenal Memoir LaTeX class for precisely this purpose.  This post is intended to fill that gap.

A couple of caveats.

Lest I oversell, here are a couple of caveats.

  • When I speak of an automated build process, I mean for the interiors of books. I hire artists to produce the covers. Though I have toyed with creating covers from the command-line in the past (and it is quite doable), there are reasons to prefer professional help. First, it allows artistic integration of other cover elements such as the title and author. Three of my books exhibit such integration, but I added those elements myself for the rest (mainly because I lacked the prescience to request them when I commissioned the art early on). I’ll let you guess which look better. The second big reason to use a professional artist comes down to appeal. The book cover is the first thing to grab a potential reader’s eye, and can make the sale. It also is a key determinant in whether your book looks amateurish or professional. I am no expert on cover design, and am far from skilled as an artist. A professional is much more likely to create an appealing cover. Of course, plenty of professionals do schlocky work, and I strongly advise putting in the effort and money to find a quality freelancer.  In my experience, it should cost anywhere from $300-800 in today’s dollars.  I’ve paid more and gotten less, and I’ve paid less and gotten more.   My best experiences were with artists who did not specialize in cover design.
  • The framework I provide on github is intended as a guide, not as pristine code for commercial use. I am not a master of any of the tools involved. I learned them to the extent necessary and no more. I make no representation that my code is elegant, and I wouldn’t be surprised if you could find better and simpler ways to accomplish the same things. This should encourage rather than discourage you from exploring my code. If I can do it, so can you. All you need is basic comfort with the command-line and some form of scripting. All the rest can be learned easily. I did not have to spend hundreds of hours learning Python, make, pandoc, and so on. I learned the basics, and googled whatever issues arose. It was quite feasible, and took a tiny fraction of the time involved in writing a novel.

The benefits of a command-line approach

If you’ve come this far, I expect that listing the benefits of a command-line approach is unnecessary. They are roughly the same as for any software project: stability, reproducibility, recovery, and easy maintenance. Source files are plain text, and we can bring to bear a huge suite of relevant tools.

A suggestion vis-a-vis code reuse

One suggestion: resist the urge to unify code. Centralizing scripts to avoid code duplication or creating a single “universal” script for all your books may be enticing propositions. I am sorely tempted to do so whenever I start a new project. My experience is that this wastes more time than it saves. Each project has unforeseeable idiosyncrasies which require adaptation, and changing centralized or universal scripts risks breaking backward compatibility with other projects. By having each book stand on its own, reproducibility is much easier, and we are free to customize the build process for a new book without fear of  unexpected consequences. It also is easier to encapsulate the complete project for timestamping and other purposes. It’s never pleasant to discover that a backup of your project is missing some dependency that you forgot to include.

A typical author produces new books or revises old ones infrequently. The ratio of time spent maintaining the publication machinery to writing and editing the book is relatively small. On average, it takes me around 500 hours to write and edit a 100,000 word novel, and around 100 hours for a 100 page collection of flash-fiction and poetry. Adapting the framework from my last book typically takes only a few hours, much of which is spent on adjustments to the cover art.

Even if porting the last book’s framework isn’t that time consuming, why trouble with it at all? Why not centralize common code? The problem is that this produces a dependency on code outside the project. If we change the relevant library or script, then we must worry about the reproducibility of all past books which depend on it. This is a headache.

Under other circumstances, my advice would be different. For example, a small press using this machinery to produce hundreds of books may benefit from code unification. The improved maintainability and time savings from code centralization would be significant. In that case, backward-compatibility issues would be dealt with in the same manner as for software: through regression tests. These could be strict (MD5 checksums) or soft (textual heuristics) depending on the toolchain and how precise the reproducibility must be. For example, non-visual changes such as an embedded date would alter the hash but not textual heuristics. The point is that this is doable, but would require stricter coding standards and carefully considered change-metrics.

The other reason to avoid code reuse is the need for flexibility. Unanticipated issues may arise with new projects (ex. unusually formatted poems), and your stylistic taste may change as well. You also may just want to mix things up a bit, so all your books don’t look the same. Copying the framework to a new book would be done a few times a year at most, and probably far less.

Again, if the situation is different my advice will be too. For example, a publisher producing books which vary only in a known set of layout parameters may benefit from a unified framework. Even in this case, it would be wise to wait until a number of books have been published, to see which elements need to be unified and which parameters vary book to book.

Tools

Here is a list of some tools I use. Most appear in the project but others serve more of a support function.

Core tools

  • pandoc: This is used to convert from markdown to epub and LaTeX. It is an extremely powerful conversion tool written in Haskell. It often requires some configuration to get things to work as desired, but it can do most of what we want.  And no, you do not need to know Haskell to use it.
  • make: The entire process is governed by a plain old Makefile. This allows complete reproducibility.
  • pdfLaTeX: The interior of the print book is compiled from LaTeX into a pdf file via pdfLaTeX. LaTeX affords us a great way to achieve near-total control over the layout. You need not know much LaTeX unless extensive changes to the interior layout are desired. The markdown source text is converted via pandoc to LaTeX through templates. These templates contain the relevant layout information.
  • memoir LaTeX class: This is the LaTeX class I use for everything. It is highly customizable, relatively easy to use, and ideally suited to book production. It has been around for a long time, is well-maintained, has a fantastic (albeit long) manual, and boasts a large user community. As with LaTeX, you need not learn its details unless customization of the book layout is desired.  Most simple things will be obvious from the templates I provide.

Essential Programs, but can be swapped with comparables

  • python3: I write my scripts in python 3, but any comparable scripting language will do.
  • aspell: This is the command-line spell-checker I use, but any other will do too. It helps if it has a markdown-recognition mode.
  • emacs: I use this as my text editor, but vim or any other text editor will do just fine. As long as it can output plain text files (ascii or unicode, though I personally stick to ascii) you are fine. I also use emacs org-mode for the organizational aspects of the project. One tweak I found very useful is to have the editor highlight anything in quotes. This makes conversation much easier to parse when editing.
  • pdftools (poppler-utils): Useful tools for splitting out pages of pdfs, etc. Used for ebook production. I use the pdfseparate utility, which allows extraction of a single page from a PDF file. Any comparable utility will work.

Useful Programs, but not essential

  • git: I use this for version control. Strictly speaking, version control isn’t needed. However, I highly recommended it. From a development standpoint, I treat writing as I do a software project. This has served me well. Any comparable tool (such as mercury) is fine too. Note that the needs of an author are relatively rudimentary. You probably won’t need branching or merging or rebasing or remote repos. Just “git init”, “git commit -a”, “git status”, “git log”, “git diff”, and maybe “git checkout” if you need access to an old version.
  • wdiff, color-diff: I find word diff and color-diff very useful for highlighting changes.
  • imagemagick: I use the “convert” tool for generating small images from the cover art. These can be used for the ebook cover or for advertising inserts in other books. “identify” also can be useful when examining image files.
  • pdftk (free version): Useful tools for producing booklets, etc. I don’t use it in this workflow, but felt it was worth mentioning.
  • ebook-convert: Calibre command-line tool for conversion. Pandoc is far better than calibre for most conversions, in my experience. However, ebook-convert can produce mobi and certain other ebook formats more easily.
  • sigil: This the only non-command-line tool listed, but it is open-source. Before you scoff and stop reading, let me point out that this is the aforementioned “almost” when it comes to automation. However, it is a minor exception. Sigil is not used for any manual intervention or editing. I simply load the epub which pandoc produces into sigil, click an option to generate the TOC, and then save it. The reason for this little ritual is that Amazon balks at the pandoc-produced TOC for some reason, but seems ok with Sigil’s. It is the same step for every ebook, and literally takes 1 minute. Unfortunately, sigil offers no command-line interface, and there is no other tool (to my knowledge) to do this. Sigil also is useful to visually examine the epub output if you wish. I find that it gives the most accurate rendering of epubs.
  • eog: I use this for viewing images, though any image viewer will do. It may be necessary to scale and crop (and perhaps color-adjust) images for use as book covers or interior images. imageMagick’s “identify” and “convert” commands are very useful for such adjustments, and eog lets me see the results.

How I write

All my files are plain text. I stick to ascii, but these days unicode is fine too. However, rich-text is not.  Things like italics and boldface are accomplished through markdown.

Originally, I wrote most of my pieces (poems, chapters, stories) in LaTeX, and had scripts which stitched them together into a book or produced them individually for drafts or submissions to magazines. These days, I do everything in markdown  — and a very simple form of markdown at that.

Why not just stick with LaTeX for the source files? It requires too much overhead and gets in the way. For mathematical writing, this overhead is a small price to pay, and the formatting is inextricably tied to the text. But for most fiction and poetry, it is not.

I adhere to the belief that separating format and content is a wise idea, and this has been borne out by my experience. Some inline formatting is inescapable (bold, italics, etc), and markdown is quite capable of accommodating this. On the rare occasions when more is needed (ex. a specially formatted poem), the markdown can be augmented with html or LaTeX directly as desired. Pandoc can handle all this and more. It is a very powerful program.

I still leave the heavy formatting (page layout, headers, footers, etc) to LaTeX, but it is concentrated in a few templates, rather than the text source files themselves.

There also is another reason to prefer markdown. From markdown, I more easily can generate epubs or other formats. Doing so from LaTeX is possible but more trouble than it’s worth (I say this from experience).

What all this means is that I can focus on writing. I produce clear, concise ascii files with minimal format information, and let my scripts build the book from these.

To see a concrete example, as well as all the scripts involved, check out the framework on github.

Book Production Framework and Examples on GitHub

Be Careful Interpreting Covid-19 Rapid Home Test Results

Now that Covid-19 rapid home tests are widely available, it is important to consider how to interpret their results. In particular, I’m going to address two common misconceptions.

To keep things grounded, let’s use some actual data. We’ll assume a false positive rate of 1% and a false negative rate of 35%. These numbers are consistent with a March, 2021 metastudy [1]. We’ll denote the false positive rate E_p=0.01 and the false negative rate E_n=0.35.

It may be tempting to assume from these numbers that a positive rapid covid test result means you’re 99% likely to be infected, and a negative result means you’re 65% likely not to be. Neither need be the case. In particular,

  1. A positive result does increase the probability you have Covid, but by how much depends on your previous prior. This in turn depends on how you are using the test. Are you just randomly testing yourself, or do you have some strong reason to believe you may be infected?

  2. A negative result has little practical impact on the probability you have Covid.

These may seem counterintuitive or downright contradictory. Nonetheless, both are true. They follow from Bayes’ thm.

Note that when I say that the test increases or decreases “the probability you have Covid,” I refer to knowledge not fact. You either have or do not have Covid, and taking the test obviously does not change this fact. The test simply changes your knowledge of it.

Also note that the limitations on inference I will describe do not detract from the general utility of such tests. Used correctly, they can be extremely valuable. Moreover, from a behavioral standpoint, even a modest non-infinitesimal probability of being infected may be enough to motivate medical review, further testing, or self-quarantine.

Let’s denote by C the event of having Covid, and by T the event of testing positive for it. P(C) is the prior probability of having covid. It is your pre-test estimate based on everything you know. For convenience, we’ll often use \mu to denote P(C).

If you have no information and are conducting a random test, then it may be reasonable to use the general local infection rate as P(C). If you have reason to believe yourself infected, a higher rate (such as the fraction of symptomatic people who test positive in your area) may be more suitable. P(C) should reflect the best information you have prior to taking the test.

The test adds information to your prior P(C), updating it to a posterior probability of infection P(C|O), where O denotes the outcome of the test: either T or \neg T.

In our notation, P(\neg T|C)= E_n and P(T|\neg C)= E_p. These numbers are properties of the test, independent of the individuals being tested. For example, the manufacturer could test 1000 swabs known to be infected with covid from a petri dish, and E_n would be the number which tested negative divided by 1000. Similarly, they could test 1000 clean swabs, and E_p would be the number which tested positive divided by 1000.

What we care about are the posterior probabilities: (1) the probability P(C|T) that you are infected given that you tested positive, and (2) the probability that you are not infected given that you tested negative P(\neg C|\neg T). I.e. the probabilities that the test correctly reflects your infection status.

Bayes’ Thm tells us that P(A|B)= \frac{P(B|A)P(A)}{P(B)}, a direct consequence of the fact that P(A|B)P(B)= P(B|A)P(A)= P(A\cap B).

If you test positive, what is the probability you have Covid? P(C|T)= \frac{P(T|C)P(C)}{P(T|C)P(C)+P(T|\neg C)P(\neg C)}, which is \frac{(1-E_n)\mu}{(1-E_n)\mu+E_p(1-\mu)}. The prior of infection was \mu, so you have improved your knowledge by a factor of \frac{(1-E_n)}{(1-E_n)\mu+E_p(1-\mu)}. For \mu small relative to E_p, this is approximately \frac{E_p}{1-E_n}.

Suppose you randomly tested yourself in MA. According to data from Johns Hopkins [2], at the time of this writing there have been around 48,000 new cases reported in MA over the last 28 days. MA has a population of around 7,000,000. It is reasonable to assume that the actual case rate is twice that reported (in the early days of Covid, the unreported factor was much higher, but let’s assume it presently is only 1\times).

Le’ts also assume that any given case tests positive for 14 days. I.e., 24,000 of those cases would test positive at any given time in the 4 week period (of course, not all fit neatly into the 28 day window, but if we assume similar rates before and after, this approach is fine). Including the unreported cases, we then have 48,000 active cases at any given time. We thus have a state-wide infection rate of \frac{48000}{7000000}\approx 0.00685, or about 0.7%. We will define \mu_{MA}\equiv 0.00685.

Using this prior, a positive test means you are 45\times more likely to be infected post-test than pre-test. This seems significant! Unfortunately, the actual probability is P(C|T)= 0.31.

This may seem terribly counterintuitive. After all, the test had a 1% false positive rate. Shouldn’t you be 99% certain you have Covid if you test positive? Well, suppose a million people take the test. With a 0.00685 unconditional probability of infection, we expect 6850 of those people to be infected. E_n=0.35, so only 4453 of those will test positive.

However, even with a tiny false positive rate of E_p=0.01, 9932 people who are not infected also will test positive. The problem is that there are so many more uninfected people being tested that E_p=0.01 still generates lots of false positives. If you test positive, you could be in the 9932 people or the 4453 people. Your probability of being infected is \frac{4453}{9932+4453}= 0.31.

Returning to the general case, suppose you test negative. What is the probability you do not have Covid? P(\neg C|\neg T)= \frac{P(\neg T|\neg C)P(\neg C)}{P(\neg T|\neg C)P(\neg C)+P(\neg T|C)P(C)}= \frac{(1-E_p)(1-\mu)}{(1-E_p)(1-\mu)+E_n\mu}. For small \mu this is approximately 1 unless E_p is very close to 1. Specifically, it expands to 1-\frac{E_n}{(1-E_p)}\mu+O(\mu^2).

Under \mu_{MA} as the prior, the probability of being uninfected post-test is 0.99757 vs 0.9932 pre-test. For all practical purposes, our knowledge has not improved.

This too may seem counterintuitive. As an analogy, suppose in some fictional land earthquakes are very rare. Half of them are preceded by a strong tremor the day before (and such a tremor always heralds a coming earthquake), but the other half are unheralded.

If you feel a strong tremor, then you know with certainty than an earthquake is coming the next day. Suppose you don’t feel a strong tremor. Does that mean you should be more confident that an earthquake won’t hit the next day? Not really. Your chance of an earthquake has not decreased by a factor of two. Earthquakes were very rare to begin with, so the default prediction that there wouldn’t be one only is marginally changed by the absence of a tremor the day before.

Of course, \mu_{MA} generally is not the correct prior to use. If you take the test randomly or for no particular reason, then your local version of \mu_{MA} may be suitable. However, if you have a reason to take the test then your \mu is likely to be much higher.

Graphs 1 and 2 below illustrate the information introduced by a positive or negative test result as a function of the choice of prior. In each, the difference in probability is the distance between the posterior and prior graphs. The prior obviously is a straight line since we are plotting it against itself (as the x-axis). Note that graph 1 has an abbreviated x-axis because P(C|T) plateaus quickly.

From graph 1, it is clear that except for small priors (such as the general infection rate in an area with very low incidence), a positive result adds a lot of information. For \mu>0.05, it provides near certainty of infection.

From graph 2, we see that a negative result never adds terribly much information. When the prior is 1 or 0, we already know the answer, and the Bayesian update does nothing. The largest gain is a little over 0.2, but that’s only attained when the prior is quite high. In fact, there’s not much improvement at all until the prior is over 0.1. If you’re 10% sure you already have covid, a home test will help but you probably should see a doctor anyway.

Note that these considerations are less applicable to PCR tests, which can have sufficiently small E_p and E_n to result in near-perfect information for any realistic prior.

One last point should be addressed. How can tests with manufacturer-specific false positive and false negative rates depend on your initial guess at your infection probability? If you pick an unconditional local infection rate as your prior, how could they depend on the choice of locale (such as MA in our example)? That seems to make no sense. What if we use a smaller locale or a bigger one?

The answer is that the outcome of the test does not depend on such things. It is a chemical test being performed on a particular sample from a particular person. Like any other experiment, it yields a piece of data. The difference arises in what use we make of that data. Bayesian probability tells us how to incorporate the information into our previous knowledge, converting a prior to a posterior. This depends on that knowledge — i.e. the prior. How we interpret the result depends on our assumptions.

A couple of caveats to our analysis:

  1. The irrelevance of a negative result only applies when you have no prior information other than some (low) general infection rate. If you do have symptoms or have recently been exposed or have any other reason to employ a higher prior probability of infection, then a negative result can convey significantly more information. Our dismissal of its worth was contingent on a very low prior.

  2. Even in the presence of a very low prior probability of infection, general testing of students or other individuals is not without value. Our discussion applies only to the interpretation of an individual test result. In aggregate, the use of such tests still would produce a reasonable amount of information. Even if only a few positive cases are caught as a result and the overall exposure rate is lowered only a little, the effect can be substantial. Pathogen propagation is a highly nonlinear process, and a small change in one of the parameters can have a very large effect. One caution, however. If the results aren’t understood for what they are, overconfidence can result. The aggregate use of testing can have a substantial negative effect if it results in relaxation of other precautions due to overconfidence resulting from a misunderstanding of the information content of those test results.

References:

[1] Rapid, point‐of‐care antigen and molecular‐based tests for diagnosis of SARS‐CoV‐2 infection — Dinnes, et al. Note that “specificity” refers to 1-E_p and “sensitivity” refers to 1-E_n. See wikipedia for further details

[2] Johns Hopkins Covid-19 Dashboard