How to Remove Corrupt OCR Data from a PDF

Tags: , ,

Bottom Line: This posts shows how to remove corrupt OCR data from a .pdf with all free-as-in-beer software.

One of the most frustrating things I’ve ever tried to do on my computer is remove corrupt or partial OCR text from a .pdf file. You can kind of think of a .pdf file as a “picture” of a document — and like any other picture, you can’t highlight, select, or copy a “picture” of a word. To get around this, you can embed an invisible layer of text data on top of the picture, which can be selected, etc., and looks like a regular document. The problem is when this text data gets messed up… because it’s invisible, you can’t tell that anything is wrong. But when you go to copy something from it… you end up with gobbledegook. And it ends up being really difficult to get rid of this invisible text, and you can’t re-OCR the document until the old text is gone.

I have gone through every step and workflow I could find using Acrobat, Photoshop, Skim, CUPS… you name it. I tried all the Acrobat tips about discarded invisible layers and embedded data and preflight… nothing. The closest I could get was by batch printing to an image file and recompiling that into a .pdf (which I could then re-OCR)… but when you’re talking about documents hundreds of pages long, this just isn’t a practical option.

Luckily, I finally figured out a way that works for me, using a “print-to-pdf” utility called PDFwriter and good old Adobe Reader. Basically, you tell Adobe Reader to print the .pdf “as an image,” and you print to PDFWriter. Alone, Reader won’t let you print to .pdf using the system .pdf printer — like Acrobat, just tells you to “save” instead of printing a .pdf to .pdf — and the system .pdf printer won’t print “as an image” — instead, it just gives you another exact copy of the .pdf (with the invisible text intact). Together, though, it works out.

Here’s the workflow:

  1. Download and install PDFwriter
  2. Open a print dialog (i.e. print something from e.g. Preview), Printer (drop down menu) -> “add printer” -> PDFwriter
  3. Open the .pdf to be converted in Adobe Reader
  4. Print -> Advanced -> “Print as Image”
  5. By default the file ends up in a strange location,
    /Users/Shared/pdfwriter, which simlinks to
    /private/var/spool/pdfwriter/[username]/
  6. You’re done and can re-OCR the file with whatever OCR tool you prefer.
  • Mark

    Thanks!!

    That Print -> Advanced -> “Print as Image” (in any pdf app, including FoxitReader or Adobe Acrobat itself) -> another PDF, then just re-OCR is the key step I’ve been looking for :)

    • http://n8henrie.com/ Nathan Henrie

      Glad to hear it helped, this problem drove me nuts! For me, the bigger problem was that Acrobat wouldn’t let me “print to file” if I was printing to image, and wouldn’t let me print as an image if I was saving to .pdf, so installing a virtual printer (instead of OSX’s built-in pdf printer) was the step I needed.

  • James

    Can’t thank you enough for this! I was despairing of finding a solution until Google led me to you. One thing to keep in mind is that the image quality suffers (slightly but noticeably) when you save as image (I guess the image gets resampled). But it’s a great workaround. Thanks again!

    • http://n8henrie.com/ Nathan Henrie

      Awesome, glad to hear you got it to work! Yes, unfortunately it does have to resample the images. If you use Acrobat (instead of Reader), you can choose the resolution for the “printed” image, so if it’s something really critical you could crank that up for ostensibly better results. Unfortunately Reader doesn’t seem to have that option.

      • James

        Can’t afford Acrobat, I’m using PDFpen, which (as far as I can tell) doesn’t have any of these options. But even with the reduced image quality coming out of Reader, I was still able to re-OCR my document with better results than before. Thanks again.

        • http://n8henrie.com/ Nathan Henrie

          Gotcha. Well either way, glad it worked!

  • Mike

    My concern is with losing the quality of the contents. My PDFs contain scans of images. Every operation that decodes and re-encodes pages causes some amount of data loss. When they are full-color images to be printed into a book then losing the quality isn’t worth it. I try to keep 2 copies at all times, with OCR and without. I tried this printing to PDF / images technique some time back and it does work… it works very well! I do appreciate you taking the time to write it, post it, and now it’s been indexed for everyone to find.

    I was hoping for a complete “Remove Metadata” kind of thing without my images being touched.

    I recalled an Acrobat Pro X that under the “Protection” section, there is a “Remove Hidden Information”. When I tried it, I did see in the left panel all of the data that’s hidden, including the search text. BUT, when I tried it, not only did it remove all of the OCR data, but it too degraded the quality of the images! Not only that, the size of the file almost doubled. It would appear that even Acrobat does a re-encode for some reason.

    I’ll continue to watch this link and thread. It’s a great one. It seems so trivial but I suppose it’s not.

    • http://n8henrie.com/ Nathan Henrie

      Thanks for your input, and I feel your pain. It does indeed degrade the image quality (although if you’re printing to image with Acrobat instead of Reader you can increase the output image quality and minimize your losses at the expensive of file size). I’ve read numerous old threads on how to discard hidden text data without losing the image quality, and I’ve spent dozens of hours trying out all those methods unsuccessfully. If you find a free solution that works, please let me know. Until then, this is the best I can do. Thanks for your input and kind words!

  • Evan Allen

    I was looking for a way to just strip the ocr from a pdf… and now I feel dumb. I appreciate the help, but it seems so obvious that you could just print to pdf and that pdf creator would re-sample the document with no metadata. Image quality aside I’m grateful for someone to slap some sense into me.

    • http://n8henrie.com/ Nathan Henrie

      Ha — well, it sure took me a long while! The problem I had was finding an app that would really “print to image” — unfortunately, many apps I tried were still retaining the text info and ruining that strategy. It wasn’t until I found Reader’s hidden option that things finally worked. Glad to hear that it was helpful!

  • Pingback: Scraping PDFs with Python - paulsolin.com

  • Thomas Julou

    What about using Preview’s export function? It allows you to export to a multipage tiff file and then to create a pdf from there:

    1. open your pdf with corrupted pdf with preview.
    2. select all pages and use File>Export, selecting TIFF as an export format (using LZW compression might save some disk space…)
    3. open the newly created TIFF file in preview, select all pages and use File>Export, selecting PDF as an export format (compression by a quartz filter is again a good idea)