I'm working through a few workflow scenarios with Acrobat X's scanning and optical character recognition (OCR) for an upcoming comparison study. Along the way, though, I have two quick insights in to Acrobat X.
First, it does a great job of shrinking the file size automatically after the text recognition is run. In many cases, I don't even have to run the "reduce file size" feature to bring multi-page documents to a manageable file size.
Second, the issue that cropped up under Apple's Preview in Acrobat 9, after I'd applied the "reduced file size" to shrink the file size, remains. The image will look blocky and jagged, as if a great bit of information has been abandoned.
I initially chalked the issue up to an error in the file reduction algorithms, but the advent of Acrobat X—with its better-than-average standard compression—brings the issue back to the forefront. Even without running the "reduce file size" feature in Acrobat X, any file that's been OCRed will appear jagged and blocky within Apple's Preview application.
Researching the topic online didn't turn up any clues, probably because Acrobat X is so new; since I couldn't ignore this issue, as it affects all the scans I was applying OCR to, I turned to my contact at Adobe's PR agency to get insight in to the issue.
It turns out that Adobe X's new rendering engine may steal a few pages from Acrobat 9's file size reduction score. In doing so, this presents a potential rendering issue in Apple's Preview. According to my contact:
Acrobat X's new scan compression technology divides the image into 3 layers - Background (BG), Foreground (FG) & Mask. BG, FG images are highly down sampled while mask is kept at a higher resolution (to maintain text readability).
The layering makes sense, as Adobe has always had the ability to choose Image-Text (where the image overlays the underlying OCR text) or Text-Image (where the text attempts to lay out in a patter closely resembling the image, but the text is the top layer).
I've always opted for Image-Text, as it allows the human looking at the document to read what's actually on the page, should he or she find the OCR text they copied from the PDF a bit, well, lacking.
My contact went on to provide some reasoning behind the miss-match of Preview and Acrobat Reader X, the latter of which seems to display the images in a much higher quality output:
Here is our hypothesis on the reason of low quality rendering by Preview. In order to render a page, it down samples mask to the resolution of FG (or maybe BG), so it loses on text crispness or quality that a high resolution mask provides. Adobe Reader/Acrobat, on the other hand, up samples images to the highest of the resolution of BG, FG and mask to get the rendered bitmap.
Is this the reality? I guess we'll have to see whether Apple issues an update to Preview in the near term. If not, I'm stuck either suggesting that every client upgrade to Acrobat Reader X, or choosing to completely forego the workflow of using QuickLook to view text-heavy OCRed documents.