How to extract images from a PDF in their original format

Question

I&#8217;m using pdfimages -j bar.pdf /tmp/image to extract images from a PDF. My objective is to get them in their raw state as they were added. So If it was a .tif I&#8217;d like to get a .tif, if it&#8217;s a jpg I&#8217;d like to get a .jpg. I keep getting .ppm for everything I extract. Is it possible to

Accepted Answer

You can&#8217;t (reliably) know the source image file format by looking at an image in PDF.  For example, TIFF images can be compressed with (off the top of me head) none, RLE, CCITT (couple variations), LZW, Flate, Jpeg.  If an image in a PDF is compressed with DCT (jpeg), how do you decide whether or not the source was TIFF or Jpeg?  If it is compressed with Flate, how do you distinguish between TIFF and PNG?  Further, it is the software generating the PDF which decides the compression, so I can take a Flate compressed TIFF image and encode it into a PDF using JPEG2000 or a CCITT compressed image and compress it with Jbig2 or a jpeg image, reduce it to an 8-bit paletted image and compress it with Flate.TL;DR you can&#8217;t know.

Advertisement

Answer