Sunday 17 June 2018

PDF - Gripes with PDF file format

So I wanted to extract graphics from a Bank of England report but it turned out to be very involved. I began to get drawn into the PDF file format. Here are some notes on its difficulty.

Firstly, a pdf is massive and as a first step I recommend breaking into single page pdf files. I have written a blog post here which shows how.

Secondly, we need to say that pdf files can be encrypted, the Bank of England report is but with a password of an empty string "" which is a little tedious. Luckily the Python library PyPDF2 can decrypt a file with the following code

def DecryptPdf(pdfFileReader,password):
    if pdfFileReader.isEncrypted:
        try:
            pdfFileReader.decrypt(password)
            print ('File decrypted')
        except Exception as e:
            print ('File decryption failed:' + str(e))
    else:
        print ('File not encrypted')

Thirdly, we have to deal with compression. So even after decrypting the next problem is compression, certain portions of a pdf document will be compressed and so read as gibberish in a text editor. Because of this I had great difficulty scratching the surface of the pdf file format.

What is needed is a good program that will help you explore the structure and thankfully I found PDFXplorer. Here is a screenshot showing a single page of the report being explored, it shows a compressed stream in decompressed view. Also it has a Save stream to disk button which allows the stream to be exported and then viewable in a text editor.

Fourthly, the pdf file format is unlike any xml, json or other standard file. So after using PDFXplorer to save a stream to disk and examining it in a text editor I found a key section...

/Figure <</MCID 88 >>BDC 
/PlacedGraphic /MC0 BDC 
EMC 
q
39.686 83.091 223.603 129.731 re
W n
0 0 0 1 K
0.5 w 4 M 
/GS0 gs
252.534 204.865 -212.599 -113.386 re
S
Q
0.96 0.53 0.05 0.27 k
/GS0 gs
241.666 133.557 2.364 -25.886 re
f
234.472 126.822 2.416 -19.152 re
f
227.335 128.493 2.416 -20.823 re

So to interpret this language one needs to reference Appendix A of this 756 page document . Here is a table of some of the operators signified by the letters

BDC=Begin marked-contentEMC=End marked-contentq=Save graphics statere=Append rectangle to pathW=Set clipping...
n=End path without filling...K=Set CMYK color for stroking opsw=Set line widthM=Set miter limitgs=Set ... graphics state...
S=Stroke pathQ=Restore graphics statek=Set CMYK color for nonstroking opsf=Fill path using nonzero winding/=start of a name

So the line highlighted in blue 0.96 0.53 0.05 0.27 k caught my eye as I was looking for the path data of some blue rectangles in the following graph. The k operator sets the colour using a CMYK (Cyan Magenta Yellow Key) color code, to convert to RGB see this web site. The lines that follow on from the CMYK line draw rectangles, they are part of this graphic taken from page 6 of a Bank of England report. The first blue rectangle is shown selected with double arrow handles...

So, in my opinion the pdf file format is difficult to work with. I cannot imagine how to begin parsing this document. it is true that there will probably be Python libraries to help but one still needs to browse the document and figure out what are the right questions to ask any such Python library.

In a future post, I'll show how converting the page to an SVG file faciliates navigation, as a preview taster I can show you that the selected blue rectangle gets converted into the following SVG/XML which whilst it maybe verbose is clearly selectable with some XPath...

<path
    id="path5759"
    style="fill:#19518b;fill-opacity:1;fill-rule:nonzero;stroke:none"
    d="m 241.666,133.557 h 2.364 v -25.886 h -2.364 z" />

Final Thoughts

I didn't much like my dive into PDF file formats and I'd like not to revisit them again any time soon. But whether they can be dispensed with depends on one goals and the alternative technologies available.

No comments:

Post a Comment