Saturday, 16 June 2018

VBA - Python - Using Python to read text from a PDF

So June 2018 is Python month where I explore the what the Python libraries can bring to an Excel VBA Developer. Here I show a library which opens a pdf and grabs text.

PDF Text Extractor

So the use case is a newly released market sensitive pdf is published by a central bank and market participants want to scan the contents as quickly as possible. But we need to get the text contents out of the pdf file. In the code below I have downloaded a Bank of England Inflation report as a test pdf.

use pip install PyPDF2 to ensure installation of the required PyPDF2 library. Run this Python script once to register it in the registry and then it is invokable from Excel VBA and some sample client code is given below.

# importing required modules
import PyPDF2

class PythonPDFComClass(object):
  
    _reg_clsid_ = "{72BF0D44-56FC-4ADB-B565-1AF16A502F0F}"
    _reg_progid_= 'PythonInVBA.PythonPDFComClass'
    _public_methods_ = ['Initialize','numPages','extractPageText','tidyUp']

    def Initialize(self,pdfFileName):
        
        self.pdfFileName=pdfFileName
        self.pdfFileObj = open(pdfFileName, 'rb')
        self.pdfReader = PyPDF2.PdfFileReader(self.pdfFileObj)
        return str(self.pdfReader)

    def numPages(self):
        return self.pdfReader.numPages

    def extractPageText(self,pageNum):
        # creating a page object
        pageObj = self.pdfReader.getPage(pageNum)
        return pageObj.extractText()

    def tidyUp(self):
        # closing the pdf file object
        self.pdfFileObj.close()

if __name__=='__main__':
    print ("Registering COM server...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(PythonPDFComClass)

And now some sample VBA code...

Option Explicit

Sub TestPythonPDFComClass()

    Dim pdfInflationReport As Object
    Set pdfInflationReport = CreateObject("PythonInVBA.PythonPDFComClass")
    
    Call pdfInflationReport.Initialize("N:\inflation-report-may-2018.pdf")
    
    Debug.Print pdfInflationReport.numPages
    Debug.Print pdfInflationReport.extractPageText(5) '* Page 6, 0-based
    
    Stop
    Dim lPageLoop As Long
    For lPageLoop = 0 To pdfInflationReport.numPages - 1
        Debug.Print pdfInflationReport.extractPageText(lPageLoop) '* 0-based
    Next
    
    pdfInflationReport.tidyUp

End Sub

and the output before the Stop statement reads


 50 
 In˜ation Report May 2018   Monetary Policy Summary   iiperiod has reduced the degree to which it is appropriate for the MPC to accommodate an 
extended period of in˜ation above the target. The Committee™s best collective judgement therefore remains that, were the economy to develop broadly 
in line with the May In˜ation Report projections, an ongoing tightening of monetary policy over the forecast period would be appropriate to return 
in˜ation sustainably to its target at a conventional horizon. As previously, however, that judgement relies on the economic data evolving broadly 
in line with the Committee™s projections. For the majority of members, an increase in Bank Rate was not required at this meeting. All members agree 
that any future increases in Bank Rate are likely to be at a gradual pace and to a limited extent.

So it appears the Python library has a few glitches reading the text, inflation is spelt not with fl but with a "dingbat" character. Other typos needs to be tidied up. A simple VBA.Replace and other VBA string processing could tidy this up easily. In general, if you look at the screenshot below you'll see that it is an accurate extract. Other pages with charts and diagrams are more jumbled.

2 comments:

  1. I found this through a Google search and it appears to be just what I need. On a Windows 10 Enterprise machine I installed (64 bit) Python 3.13, installed git, installed pypdf, installed pywin32. Then I used the python script above with the only change being I reference pypdf rather than pyPDF2. Everything worked up until the last couple of lines. These are the errors I get:

    Traceback (most recent call last):
    File "C:\...\Python\Python313\Lib\site-packages\win32com\server\register.py", line 448, in RegisterClasses
    win32api.FindFiles(sys.argv[0])[0][8]
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
    pywintypes.error: (3, 'FindFirstFile', 'The system cannot find the path specified.')

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "", line 4, in
    win32com.server.register.UseCommandLine(PythonPDFComClass)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
    File "C:\...\Python\Python313\Lib\site-packages\win32com\server\register.py", line 648, in UseCommandLine
    RegisterClasses(*classes, **flags)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
    File "C:\...Python\Python313\Lib\site-packages\win32com\server\register.py", line 452, in RegisterClasses
    raise TypeError(
    "Can't locate the script hosting the COM object - please set _reg_class_spec_ in your object"
    )
    TypeError: Can't locate the script hosting the COM object - please set _reg_class_spec_ in your object

    If this post is still monitored, I would appreciate any suggestions as to what I could do to solve the problem. I know nothing about Python, but am trying to get a solution (pypdf) to parse PDF files in Excel.

    ReplyDelete
  2. I figured it out. I was enter it directly into the python editor, but I needed to save it as a .py file first and run it from the command prompt as "py .py"

    ReplyDelete