So June 2018 is Python month where I explore the what the Python libraries can bring to an Excel VBA Developer. Here I show a library which opens a pdf and grabs text.
PDF Text Extractor
So the use case is a newly released market sensitive pdf is published by a central bank and market participants want to scan the contents as quickly as possible. But we need to get the text contents out of the pdf file. In the code below I have downloaded a Bank of England Inflation report as a test pdf.
use pip install PyPDF2 to ensure installation of the required PyPDF2 library. Run this Python script once to register it in the registry and then it is invokable from Excel VBA and some sample client code is given below.
# importing required modules
import PyPDF2
class PythonPDFComClass(object):
_reg_clsid_ = "{72BF0D44-56FC-4ADB-B565-1AF16A502F0F}"
_reg_progid_= 'PythonInVBA.PythonPDFComClass'
_public_methods_ = ['Initialize','numPages','extractPageText','tidyUp']
def Initialize(self,pdfFileName):
self.pdfFileName=pdfFileName
self.pdfFileObj = open(pdfFileName, 'rb')
self.pdfReader = PyPDF2.PdfFileReader(self.pdfFileObj)
return str(self.pdfReader)
def numPages(self):
return self.pdfReader.numPages
def extractPageText(self,pageNum):
# creating a page object
pageObj = self.pdfReader.getPage(pageNum)
return pageObj.extractText()
def tidyUp(self):
# closing the pdf file object
self.pdfFileObj.close()
if __name__=='__main__':
print ("Registering COM server...")
import win32com.server.register
win32com.server.register.UseCommandLine(PythonPDFComClass)
And now some sample VBA code...
Option Explicit
Sub TestPythonPDFComClass()
Dim pdfInflationReport As Object
Set pdfInflationReport = CreateObject("PythonInVBA.PythonPDFComClass")
Call pdfInflationReport.Initialize("N:\inflation-report-may-2018.pdf")
Debug.Print pdfInflationReport.numPages
Debug.Print pdfInflationReport.extractPageText(5) '* Page 6, 0-based
Stop
Dim lPageLoop As Long
For lPageLoop = 0 To pdfInflationReport.numPages - 1
Debug.Print pdfInflationReport.extractPageText(lPageLoop) '* 0-based
Next
pdfInflationReport.tidyUp
End Sub
and the output before the Stop statement reads
50
In˜ation Report May 2018 Monetary Policy Summary iiperiod has reduced the degree to which it is appropriate for the MPC to accommodate an
extended period of in˜ation above the target. The Committee™s best collective judgement therefore remains that, were the economy to develop broadly
in line with the May In˜ation Report projections, an ongoing tightening of monetary policy over the forecast period would be appropriate to return
in˜ation sustainably to its target at a conventional horizon. As previously, however, that judgement relies on the economic data evolving broadly
in line with the Committee™s projections. For the majority of members, an increase in Bank Rate was not required at this meeting. All members agree
that any future increases in Bank Rate are likely to be at a gradual pace and to a limited extent.
So it appears the Python library has a few glitches reading the text, inflation is spelt not with fl but with a "dingbat" character. Other typos needs to be tidied up. A simple VBA.Replace and other VBA string processing could tidy this up easily. In general, if you look at the screenshot below you'll see that it is an accurate extract. Other pages with charts and diagrams are more jumbled.
No comments:
Post a Comment