I’m compiling interesting public domain books from Project Gutenberg into PDFs using LaTeX. This is mainly for me to improve my LaTeX skills while expanding my knowledge of literature, language, history, and culture. It can be time-consuming, but it’s well worth it and, in my opinion, very satisfying! I’ve compiled the following books so far:
- [Draft] Birds of the Indian Hills [download]
- Ivanhoe [download]
- Robin Hood [download]
- Viking Tales [download]
Getting eBooks
The main Project Gutenberg website is “intended for human users only”. However, there are exceptions, as outlined here, through the use of wget
. One can alternatively use a Project Gutenberg mirror site.
Humans can access the text file for an eBook using the URL: ${domain}/files/${ebooknumber}/${ebooknumber}${encoding}.txt
, where:
${domain}
ishttps://www.gutenberg.org
${ebooknumber}
is the eBook number${encoding}
is the text file’s character set encoding; e.g.-0
for UTF-8,-8
for ISO-8859-1; not applicable to ASCII files
If using a mirror site, eBooks can be obtained through the following URL formats:
- for a 5-digit eBook number; e.g.
12345
:${domain}/1/2/3/4/12345${encoding}.txt
- for a 1-digit eBook number; e.g.
9
:${domain}/0/9${encoding}.txt
${domain}
here refers to the mirror’s domain. See the list of mirror sites here.
Converting eBooks to LaTeX
For eBook text files with a character set encoding other than UTF-8, I first converted them to UTF-8 using iconv
. To convert ISO-8859-1 to UTF-8:
iconv -f ISO-8859-1 -t UTF-8 ${ebooknumber}-8.txt -o ${ebooknumber}-0.txt
The text files are then parsed into .tex
using mostly find and replace; e.g. replace each odd and even occurrence of _
, which is used to format italics, with \textit{
and }
, respectively. This can be achieved using sed
. This is the current sed
code, which could be improved in the future:
sed ':a;N;$!ba;s|_\([^_]*\)_|\\textit{\1}|g' ${ebooknumber}-0.txt > ${ebooknumber}.tex
sed -i -e ':a;N;$!ba;s|"\([^"]*\)"|``\1'"''"'|g' -e 's|#|\\#|g' -e 's|&|\\&|g' ${ebooknumber}.tex
Other forms of modification are done manually.
LaTeX requirements and compilation
All required packages are available on CTAN. It is recommended to use a TeX distribution, such as TeX Live, to ensure all requirements are satisfied. The book document class is used.
The PDF files are built using either XeLaTeX or LuaLaTeX via Arara:
cd books
for dir in */;
do cd ${dir%%/} &&
arara ${dir%%/}.tex &&
cd ..;
done
The Arara directives used are as follows (replace xelatex
with lualatex
, if necessary):
% arara: xelatex
% arara: xelatex
Note that Arara requires a Java virtual machine.
An alternative is to use the latest TeX Live Docker image by Island of TeX, which can also be used with GitLab CI. The following is a minimal example of a valid .gitlab-ci.yml
configuration:
image: registry.gitlab.com/islandoftex/images/texlive:latest
build:
before_script:
- cd books
script:
- for dir in */;
do cd ${dir%%/} &&
arara ${dir%%/}.tex &&
cd ..;
done
artifacts:
paths:
- "**/*.pdf"
The resulting PDF files will be available as artefacts once the build is complete. See this TUGboat article for more information.
To-do
- fix underfull / overfull boxes
- miscellaneous formatting
- compile more books (suggestions are welcome!)
Licence
- Project Gutenberg eBooks are licensed under the Project Gutenberg License. See the Project Gutenberg website for more information.
- Code is licensed under the MIT License.