I’m compiling interesting public domain books from Project Gutenberg into PDFs using LaTeX. This is mainly for me to improve my LaTeX skills while expanding my knowledge of literature, language, history, and culture. It can be time-consuming, but it’s well worth it and, in my opinion, very satisfying! I’ve compiled the following books so far:

Getting eBooks

The main Project Gutenberg website is “intended for human users only”. However, there are exceptions, as outlined here, through the use of wget. One can alternatively use a Project Gutenberg mirror site.

Humans can access the text file for an eBook using the URL: ${domain}/files/${ebooknumber}/${ebooknumber}${encoding}.txt, where:

  • ${domain} is https://www.gutenberg.org
  • ${ebooknumber} is the eBook number
  • ${encoding} is the text file’s character set encoding; e.g. -0 for UTF-8, -8 for ISO-8859-1; not applicable to ASCII files

If using a mirror site, eBooks can be obtained through the following URL formats:

  • for a 5-digit eBook number; e.g. 12345: ${domain}/1/2/3/4/12345${encoding}.txt
  • for a 1-digit eBook number; e.g. 9: ${domain}/0/9${encoding}.txt

${domain} here refers to the mirror’s domain. See the list of mirror sites here.

Converting eBooks to LaTeX

For eBook text files with a character set encoding other than UTF-8, I first converted them to UTF-8 using iconv. To convert ISO-8859-1 to UTF-8:

iconv -f ISO-8859-1 -t UTF-8 ${ebooknumber}-8.txt -o ${ebooknumber}-0.txt

The text files are then parsed into .tex using mostly find and replace; e.g. replace each odd and even occurrence of _, which is used to format italics, with \textit{ and }, respectively. This can be achieved using sed. This is the current sed code, which could be improved in the future:

sed ':a;N;$!ba;s|_\([^_]*\)_|\\textit{\1}|g' ${ebooknumber}-0.txt > ${ebooknumber}.tex

sed -i -e ':a;N;$!ba;s|"\([^"]*\)"|``\1'"''"'|g' -e 's|#|\\#|g' -e 's|&|\\&|g' ${ebooknumber}.tex

Other forms of modification are done manually.

LaTeX requirements and compilation

All required packages are available on CTAN. It is recommended to use a TeX distribution, such as TeX Live, to ensure all requirements are satisfied. The book document class is used.

The PDF files are built using either XeLaTeX or LuaLaTeX via Arara:

cd books
for dir in */;
do cd ${dir%%/} &&
arara ${dir%%/}.tex &&
cd ..;
done

The Arara directives used are as follows (replace xelatex with lualatex, if necessary):

% arara: xelatex
% arara: xelatex

Note that Arara requires a Java virtual machine.

An alternative is to use the latest TeX Live Docker image by Island of TeX, which can also be used with GitLab CI. The following is a minimal example of a valid .gitlab-ci.yml configuration:

image: registry.gitlab.com/islandoftex/images/texlive:latest

build:
  before_script:
    - cd books
  script:
    - for dir in */;
      do cd ${dir%%/} &&
      arara ${dir%%/}.tex &&
      cd ..;
      done
  artifacts:
    paths:
      - "**/*.pdf"

The resulting PDF files will be available as artefacts once the build is complete. See this TUGboat article for more information.

To-do

  • fix underfull / overfull boxes
  • miscellaneous formatting
  • compile more books (suggestions are welcome!)

Licence

Leave a comment

Your email address will not be published. Required fields are marked *.

Loading...