Exorcising Ghostscript: 7x speed, superior image quality

I needed to download a PDF of the lecture slides of the week for a course I was taking. But, with it weighing in at 170.9MB, of which 170.3MB taken by images, Firefox was having none of it as I tried, in vain, to quickly jump around from topic to topic – more than 20 seconds to load the last page was truly an experience! Now, the normal solution would have been to drop the file into a native PDF reader, such as MuPDF, but I took the situation as a challenge: could I improve the experience for everyone?

I am afraid of Ghosts…cript, the tool for this sort of business, so I asked ChatGPT. WAIT DON’T LEAVE, my own brain will enter the scene soon. We’re starting with the slop:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dNOPAUSE -dQUIET -dBATCH \
   -dDetectDuplicateImages=true \
   -dCompressFonts=true \
   -dSubsetFonts=true \
   -dColorImageDownsampleType=/Bicubic \
   -dColorImageResolution=150 \
   -dGrayImageDownsampleType=/Bicubic \
   -dGrayImageResolution=150 \
   -dMonoImageDownsampleType=/Subsample \
   -dMonoImageResolution=300 \
   -dColorImageFilter=/DCTEncode \
   -dGrayImageFilter=/DCTEncode \
   -dJPEGQ=75 \
   -sOutputFile=output.pdf \
   input.pdf

That is basically the same as can be found with a quick search online. I put it in a script, executed it and, 151 seconds later, got out a 69MB file. Not bad at all, but, uh oh, there are jpegs of text in the slides, and they aren’t happy:

We can take the size further down by tinkering with the values, but this problem remains and in fact gets worse. Each iteration taking minutes is also not great. There must be a better way.

My realisation was that this is not really a PDF issue, but an image compression issue “disguised” as one. What I need is a tool to extract the images and then put them back once cjpegli and oxipng have done their magic. Here, I must admit, my addiction of pulling the slop machine took over and I asked it to do it for me, but it was unable to comply. I know the right tool for the job, the amazing pdfcpu, so I hinted it towards that, but the vibes were off or something. I had to do the unimaginable: read the manual 😟.

Luckily for me, not only is pdfcpu a sublime piece of software, its documentation is just wonderful. Armed with the right knowledge, I gave the LLM (yeah yeah) the exact commands to put together, and… it didn’t work! Long story short, Form XObjects were causing issues. Their extracted names differ by having two extensions, so I thought of a stupid way to skip them, and the final solution is as follows:

#!/usr/bin/env bash
set -euo pipefail

input="$1"
tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' EXIT

pdfcpu extract -mode image "$input" "$tmpdir"

find "$tmpdir" -type f \
  \( -iname '*.jpg' -o -iname '*.jpeg' \) -print0 \
  | while IFS= read -r -d '' file; do
    cjpegli -q 60 "$file" "$file"
  done

find "$tmpdir" -type f \
  -iname '*.png' -print0 \
  | while IFS= read -r -d '' file; do
    oxipng -o 6 --strip safe --preserve --quiet "$file"
  done

find "$tmpdir" -type f \
  \( -iname '*.jpg' -o -iname '*.jpeg' -o -iname '*.png' \) -print0 \
  | while IFS= read -r -d '' img; do
      basename=$(basename -- "$img")
      # Some pdfs have something called Form XObjects,
      # which crash and need to be skipped. This is one way?
      dots=${basename//[^.]}
      if (( ${#dots} >= 2 )); then
        echo "Skipping nested resource (has >=2 dots): $basename"
        continue
      fi

      pdfcpu images update "$input" "$img"
  done

The squeeze ended up at 33MB, an 80.7% reduction in size with no discernible drop in visual quality. Since both cjpegli and oxipng are multithreaded, the whole process dropped down to 21.68 seconds, which is 85.6% or 7x faster.

One of these is the original, the other the output.

So, there you go. Maybe after this post the crawlers will pick up this new trick. I’m doing my part.

Thanks for reading. Click here to go to back to main page, which may or may not have anything on it.