.NET Framework - Extract Image From PDF

Asked By Steve on 21-Aug-08 09:27 PM
Hi all

Does anybody please know a way to extract an Image from a pdf file and save
it as a TIFF?

I have used a scanner to scan documents which are then placed on a server,
but I need to extract the image of the document (just the first page if
there are multiple pages) and save it as a TIFF so I can then use the
Tesseract OCR to get the text in the image.

I think there may be a license of Adobe Acrobat Professional in the company
I am working for if they provide a way to do this in my .NET application.

Thank you for your help.

Kind Regards,
Steve




Rick replied on 22-Aug-08 07:10 AM
I don't know of a Net way exactly, however you can check out Ghostscript
which will allow you to read a Pdf and save it as a Tiff. I think you can
specify page numbers to convert.  You can call Ghostscript from a command
line with your params with Process.Start.

hth,

Rick
Steve Amey replied on 23-Aug-08 04:33 PM
Hi Rick

Thanks for that. I have downloaded and installed Ghostscript. I have a demo
app that can execute Ghostscript with command line parameters, and at the
moment I can only get the revision number and a thumbnail view of the first
page (JPEG) based on the content I have found.

Do you know the parameters I would need to extract the image on the first
page to a TIFF please? I can't seem to find these amywhere :o(

Here are the args I found to generate a jpeg based on a pdf document:

Dim astrArgs(7) As String
astrArgs(0) = "pdf2jpg" 'The First Parameter is Ignored
astrArgs(1) = "-dNOPAUSE"
astrArgs(2) = "-dBATCH"
astrArgs(3) = "-dSAFER"
astrArgs(4) = "-sDEVICE=jpeg"
astrArgs(5) = "-sOutputFile=C:\Thumbnail.jpg"
astrArgs(6) = "C:\MyPDFDoc.pdf"

Thanks for your help!

Regards,
Steve
Rick replied on 23-Aug-08 08:08 AM
I run mine from Net process.start like this:

process.StartInfo.Arguments =
String.Format("-dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg3 -sOutputFile=""{0}""


if you want to read only the first page you would add -dFirstPage=1
and -dLastPage=1 (see http://web.mit.edu/ghostscript/www/Use.htm )

I am slightly confused about what you really want.  I understood you want to
convert the entire first page to a tiff file and then use an OCR program to
read text.  If you want to only extract an image from the first page, I'm
not sure this would work.  I don't know of a facility to extract an image
from a pdf.  You might check iTextSharp which can create and read pdf's.  If
you know the name of the image you may be able to extract it.

Also, if you want a Tiff file why are you extracting to a jpeg below?

Rick
Steve Amey replied on 23-Aug-08 05:48 PM
Thank you, I'm generating tiff files now.

The pdf is an image of a scanned document. I would like to get the text of
the scanned image. I looked into OCR, and came across Tesseract. To my
knowledge, Tesseract can (only) read a tiff file and extract the text. If I
open up a document in Adobe Pro and save the scanned image as a tiff,
tesseract does read most of it quite well, but my problem is that I have to
automate the process and can't open up the documents and manually save the
images, so I need something to extract the scanned image in the pdf file and
save it as a tiff so tesseract can read it. I tried iTextSharp already but I
get an error "PDF header signature not found", which I'm guessing is a
problem with the way the scanner creates the pdf files and iTextSharp can't
open it.

I found some sample code that creates a jpeg, which is what I posted, but I
didn't know how to create a tiff file, but I see that it's just a case of
changing the -sDEVICE parameter to the one you are using.

Unfortunately, the resulting tiff image is not great quality and tesseract
makes many errors when trying to read it, so I have to find another way or
give up :o(

Thank you for your help, if you know of any other way to do what I'm trying
then I'd love to know! I don't mind paying a small amount for some
commercial software that can extract images from pdf docs that I can use in
.NET, but I haven't found any yet that don't cost hundreds or even thousands
of dollars.
Gillard replied on 23-Aug-08 08:48 AM
http://www.foolabs.com/xpdf/

pdfimages  -  Portable  Document  Format (PDF) image extractor
(version       3.02)