Sunday, 17 June 2018

Python - PDF - Split large file into single pages

So last post I showed how to extract text from a PDF using Python and the PyPDF2 library. My example was a Bank of England report. I wanted next to extract the graphics from the report. This turned out to be non-trivial and requires a number of steps. The first step I'd recommend is to break a large document into single page documents.

Python PDF Splitter

So for reference the test document is at BofE Inflation Report May 2018.

Luckily some code existed on Stack Overflow to break up the pages...

# with thanks to user26294 at Stack Overflow
# https://stackoverflow.com/questions/490195/split-a-multi-page-pdf-file-into-multiple-pdf-files-with-python#answer-490203

from PyPDF2 import PdfFileWriter, PdfFileReader

def DecryptPdf(pdfFileReader,password):
    if pdfFileReader.isEncrypted:
        try:
            pdfFileReader.decrypt(password)
            print ('File decrypted')
        except Exception as e:
            print ('File decryption failed:' + str(e))
    else:
        print ('File not enrypted')

def SuffixFilename(fileName, suffix):
    import os.path
    filePath = os.path.split(fileName)
    
    filePath2 = filePath[1].split('.')
    return  filePath[0] + '\' +filePath2[0] + suffix + '.' + filePath2[1]


def OutputPage(pdfFileNameSrc,pdfFileNamePage, pageNum):
    #print (pdfFileNameSrc)
    #print (pdfFileNamePage)
    pdfFileSrc = open(pdfFileNameSrc, "rb")
    pdfFileReaderSrc = PdfFileReader(pdfFileSrc)
    DecryptPdf(pdfFileReaderSrc,'')
    
    pageOutput = PdfFileWriter()
    pageOutput.addPage(pdfFileReaderSrc.getPage(pageNum))

    with open(pdfFileNamePage, "wb") as outputStream:
        pageOutput.write(outputStream)
        print('written page%s' % pageNum)
    pdfFileSrc.close #tidy up


if __name__ == "__main__": 

    pdfFileNameInflation = "n:\\pdf_skunkworks\\inflation-report-may-2018.pdf"
    pdfFileInflation = open(pdfFileNameInflation, "rb")

    pdfFileReaderInflation = PdfFileReader(pdfFileInflation)

    DecryptPdf(pdfFileReaderInflation,'')

    for i in range(pdfFileReaderInflation.numPages):
        pdfFileNamePage=SuffixFilename(pdfFileNameInflation,"-page%s" % i)
        
        OutputPage(pdfFileNameInflation,pdfFileNamePage,i)

I have triaged the original code given in a StackOverflow answer because it would not behave in a loop. this must have been some file handle tidfyup issue. My (heavy-handed) solution was to re-initialise the PdfFileReader class in OutputPage() for each iteration. I'm sure a better solution exists and if you know better then feel free to comment below.

Now I have single page pdfs, I can move on ...

No comments:

Post a Comment