Recently while reading Dune: The Battle of Corrin, a 2004 science fiction novel, from a PDF e-book using Foxit Reader, I realized that I could not seem to copy text from the document. Only garbled characters, and not readable text, will be stored on the clipboard. For this reason, I also could not search the document to find specific words or phrases.
What is the issue here? For one thing, the PDF is not copy/print restricted as otherwise I would not have been able to copy text at all. Such restrictions also can be easily removed using several PDF unlocker utilities, or by using a reader such as Evince which does not honor these permission settings and simply allows user to perform all operations.
In my case, the Copy option clearly shows upon selected text being right-clicked:
By switching to Text Viewer mode in Foxit Reader, I could see that all the text in this PDF file has perhaps been encoded, or at least stored using a custom character set, as no readable text can be found:
At this point, I decided to further investigate the PDF file (download an extract from here) using PDFStreamDumper and see what character sets are embedded. Not surprisingly, PDFStreamDumper was also not able to retrieve any readable text and just showed garbled characters:
By navigating through the streams in the PDF file, I was able to locate one that seems to be responsible for the custom character encoding:
It seems as if the PDF was not generated using standard character encoding such as Unicode or ANSI. Instead, the author has decided to use CID fonts, Adobe’s custom font and character set format, to store the document text. While using CID fonts can have many advantages, especially when displaying Eastern language text, in this case I believe it was purely a deliberate attempt to make copying text from the document a hassle, as most reader applications would just copy the original encoded characters, resulting in garbled text being pasted.
Restoring the original text
So how would you go about copying text from such a document? The first obvious way is to study the CID fonts being used (read here for some hints), compare the decoded text and the characters being stored, reverse engineer the mapping rules and write a program to restore the original text from the PDF file based on the rules. Not an easy task for the average computer user, I assume.
Another possible approach is to use a window text capture tool such as TextCatch which tries to grab the text from the selected window as it is displayed. However, for most PDF files which I tested, even for those without special character encoding, TextCatch does not seen to be able to retrieve any text from the PDF viewer window.
Is there an easier way? How about performing optical character recognition on the PDF file? In Adobe Acrobat, this can be performed via View > Tools > Text Recognition menu:
And this is the result:
Nope, it didn’t work because the stored text, although encoded, still appears renderable to Adobe Acrobat, which refuses to work unless the page is an image that can be OCR’ed.
I tried again by using Foxit Reader PDF printer to print the document into a another PDF file. This way, the resulting file has each page stored as an image and the custom character encoding removed. As a result, Adobe Acrobat OCR now worked properly:
If Adobe Acrobat still says your file contains renderable text, print the printed document as another PDF file and try again, which will ensure that any left-over text would also be converted to graphics.
After saving the resulting file and opening it with Foxit Reader, I could see that the document now contained readable text:
Text inside the PDF can now be searched without issues:
For some reasons the characters in the final PDF file, after the OCR process, seem to be thicker compared with the original, but this should not be an issue for most people. There could also be some misspelled words as a result of the optical character recognition algorithm. However, overall the quality is satisfactory and most text inside the original document has been restored to facilitate copying and searching. The final document that contains searchable text can be downloaded here.
I believe this trick to perform OCR on the PDF-printed copy of the original document will help other people with similar problems. I hope somebody with some expertise on PDF format can help me do a better job by proposing a method to restore the original text in a more faithful manner without using OCR.