So June 2018 is Python month where I explore the what the Python libraries can bring to an Excel VBA Developer. Here I show a library which opens a pdf and grabs text.
PDF Text Extractor
So the use case is a newly released market sensitive pdf is published by a central bank and market participants want to scan the contents as quickly as possible. But we need to get the text contents out of the pdf file. In the code below I have downloaded a Bank of England Inflation report as a test pdf.
use pip install PyPDF2 to ensure installation of the required PyPDF2 library. Run this Python script once to register it in the registry and then it is invokable from Excel VBA and some sample client code is given below.
# importing required modules
import PyPDF2
class PythonPDFComClass(object):
_reg_clsid_ = "{72BF0D44-56FC-4ADB-B565-1AF16A502F0F}"
_reg_progid_= 'PythonInVBA.PythonPDFComClass'
_public_methods_ = ['Initialize','numPages','extractPageText','tidyUp']
def Initialize(self,pdfFileName):
self.pdfFileName=pdfFileName
self.pdfFileObj = open(pdfFileName, 'rb')
self.pdfReader = PyPDF2.PdfFileReader(self.pdfFileObj)
return str(self.pdfReader)
def numPages(self):
return self.pdfReader.numPages
def extractPageText(self,pageNum):
# creating a page object
pageObj = self.pdfReader.getPage(pageNum)
return pageObj.extractText()
def tidyUp(self):
# closing the pdf file object
self.pdfFileObj.close()
if __name__=='__main__':
print ("Registering COM server...")
import win32com.server.register
win32com.server.register.UseCommandLine(PythonPDFComClass)
And now some sample VBA code...
Option Explicit
Sub TestPythonPDFComClass()
Dim pdfInflationReport As Object
Set pdfInflationReport = CreateObject("PythonInVBA.PythonPDFComClass")
Call pdfInflationReport.Initialize("N:\inflation-report-may-2018.pdf")
Debug.Print pdfInflationReport.numPages
Debug.Print pdfInflationReport.extractPageText(5) '* Page 6, 0-based
Stop
Dim lPageLoop As Long
For lPageLoop = 0 To pdfInflationReport.numPages - 1
Debug.Print pdfInflationReport.extractPageText(lPageLoop) '* 0-based
Next
pdfInflationReport.tidyUp
End Sub
and the output before the Stop statement reads
50
In˜ation Report May 2018 Monetary Policy Summary iiperiod has reduced the degree to which it is appropriate for the MPC to accommodate an
extended period of in˜ation above the target. The Committee™s best collective judgement therefore remains that, were the economy to develop broadly
in line with the May In˜ation Report projections, an ongoing tightening of monetary policy over the forecast period would be appropriate to return
in˜ation sustainably to its target at a conventional horizon. As previously, however, that judgement relies on the economic data evolving broadly
in line with the Committee™s projections. For the majority of members, an increase in Bank Rate was not required at this meeting. All members agree
that any future increases in Bank Rate are likely to be at a gradual pace and to a limited extent.
So it appears the Python library has a few glitches reading the text, inflation is spelt not with fl but with a "dingbat" character. Other typos needs to be tidied up. A simple VBA.Replace and other VBA string processing could tidy this up easily. In general, if you look at the screenshot below you'll see that it is an accurate extract. Other pages with charts and diagrams are more jumbled.
I found this through a Google search and it appears to be just what I need. On a Windows 10 Enterprise machine I installed (64 bit) Python 3.13, installed git, installed pypdf, installed pywin32. Then I used the python script above with the only change being I reference pypdf rather than pyPDF2. Everything worked up until the last couple of lines. These are the errors I get:
ReplyDeleteTraceback (most recent call last):
File "C:\...\Python\Python313\Lib\site-packages\win32com\server\register.py", line 448, in RegisterClasses
win32api.FindFiles(sys.argv[0])[0][8]
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
pywintypes.error: (3, 'FindFirstFile', 'The system cannot find the path specified.')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 4, in
win32com.server.register.UseCommandLine(PythonPDFComClass)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "C:\...\Python\Python313\Lib\site-packages\win32com\server\register.py", line 648, in UseCommandLine
RegisterClasses(*classes, **flags)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "C:\...Python\Python313\Lib\site-packages\win32com\server\register.py", line 452, in RegisterClasses
raise TypeError(
"Can't locate the script hosting the COM object - please set _reg_class_spec_ in your object"
)
TypeError: Can't locate the script hosting the COM object - please set _reg_class_spec_ in your object
If this post is still monitored, I would appreciate any suggestions as to what I could do to solve the problem. I know nothing about Python, but am trying to get a solution (pypdf) to parse PDF files in Excel.
I figured it out. I was enter it directly into the python editor, but I needed to save it as a .py file first and run it from the command prompt as "py .py"
ReplyDelete