Saturday 16 June 2018

VBA - Python - Using Python to read text from a PDF

So June 2018 is Python month where I explore the what the Python libraries can bring to an Excel VBA Developer. Here I show a library which opens a pdf and grabs text.

PDF Text Extractor

So the use case is a newly released market sensitive pdf is published by a central bank and market participants want to scan the contents as quickly as possible. But we need to get the text contents out of the pdf file. In the code below I have downloaded a Bank of England Inflation report as a test pdf.

use pip install PyPDF2 to ensure installation of the required PyPDF2 library. Run this Python script once to register it in the registry and then it is invokable from Excel VBA and some sample client code is given below.

# importing required modules
import PyPDF2

class PythonPDFComClass(object):
  
    _reg_clsid_ = "{72BF0D44-56FC-4ADB-B565-1AF16A502F0F}"
    _reg_progid_= 'PythonInVBA.PythonPDFComClass'
    _public_methods_ = ['Initialize','numPages','extractPageText','tidyUp']

    def Initialize(self,pdfFileName):
        
        self.pdfFileName=pdfFileName
        self.pdfFileObj = open(pdfFileName, 'rb')
        self.pdfReader = PyPDF2.PdfFileReader(self.pdfFileObj)
        return str(self.pdfReader)

    def numPages(self):
        return self.pdfReader.numPages

    def extractPageText(self,pageNum):
        # creating a page object
        pageObj = self.pdfReader.getPage(pageNum)
        return pageObj.extractText()

    def tidyUp(self):
        # closing the pdf file object
        self.pdfFileObj.close()

if __name__=='__main__':
    print ("Registering COM server...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(PythonPDFComClass)

And now some sample VBA code...

Option Explicit

Sub TestPythonPDFComClass()

    Dim pdfInflationReport As Object
    Set pdfInflationReport = CreateObject("PythonInVBA.PythonPDFComClass")
    
    Call pdfInflationReport.Initialize("N:\inflation-report-may-2018.pdf")
    
    Debug.Print pdfInflationReport.numPages
    Debug.Print pdfInflationReport.extractPageText(5) '* Page 6, 0-based
    
    Stop
    Dim lPageLoop As Long
    For lPageLoop = 0 To pdfInflationReport.numPages - 1
        Debug.Print pdfInflationReport.extractPageText(lPageLoop) '* 0-based
    Next
    
    pdfInflationReport.tidyUp

End Sub

and the output before the Stop statement reads


 50 
 In˜ation Report May 2018   Monetary Policy Summary   iiperiod has reduced the degree to which it is appropriate for the MPC to accommodate an 
extended period of in˜ation above the target. The Committee™s best collective judgement therefore remains that, were the economy to develop broadly 
in line with the May In˜ation Report projections, an ongoing tightening of monetary policy over the forecast period would be appropriate to return 
in˜ation sustainably to its target at a conventional horizon. As previously, however, that judgement relies on the economic data evolving broadly 
in line with the Committee™s projections. For the majority of members, an increase in Bank Rate was not required at this meeting. All members agree 
that any future increases in Bank Rate are likely to be at a gradual pace and to a limited extent.

So it appears the Python library has a few glitches reading the text, inflation is spelt not with fl but with a "dingbat" character. Other typos needs to be tidied up. A simple VBA.Replace and other VBA string processing could tidy this up easily. In general, if you look at the screenshot below you'll see that it is an accurate extract. Other pages with charts and diagrams are more jumbled.

No comments:

Post a Comment