Thursday 5 July 2018

VBA - Python - OCR - Optical Character Recognition

It was Python month on this blog last month but still plenty of ideas of how to leverage the huge Python ecosystem and bring functionality to the feet of VBA Developers. In this blog I play with Optical Character Recognition (OCR) and get it callable from VBA using a COM gateway class.

Tesseract

The OCR Python library I use here is Tesseract which has a long pedigree and happily has Python bindings. But it needs some care to install properly.

Tesseract Installation

I found that using  pip install pytesseract  falsely reported success. Instead, what was necessary was the following steps...

  1. Find a site with a Tesseract Windows binary installer. I found Tesseract at UB Mannheim
  2. Run the Tesseract Windows binary installer. I ran https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.02-20180621.exe
  3. Add the Tessearcht directory (for me 'C:\Program Files (x86)\Tesseract-OCR') to PATH environment variable
  4. Close down and restart all potential client processes, Visual Studio, Excel, Command Windows (cmd.exe), Powershell, any process.

We'll also use an Image library, Pillow, to load the image file, so use  pip install pillow 

OCRBatch Com Gateway Class

I have given this pattern many times over the last month here is a Python class with enough extra code to allow it to be registered as a COM class and thus callable from VBA. Here is the code, which must be run at least once under admin privileges to enable registration (thereafter not required).

## pip install pillow      ## succeeded

## https://github.com/UB-Mannheim/tesseract/wiki
## https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.02-20180621.exe
## add C:\Program Files (x86)\Tesseract-OCR to PATH
## restart Visual Studio so it picks up changes to PATH



class OCRBatch(object):
    _reg_clsid_ = "{A7C1275F-7ABD-4AA2-90E5-462392D821DF}"
    _reg_progid_ = 'PythonInVBA.OCRBatch'
    _public_methods_ = ['RunBatch', 'RunOCR']

    def RunBatch(self, rootDir):
        import os
        for subdir, dirs, files in os.walk(rootDir):
            for file in files:
                #print os.path.join(subdir, file)
                filepath = subdir + os.sep + file

                if file.endswith(".jpeg") and file.startswith("File_"):
                    ocrFile = filepath + ".txt"
                    self.RunOCR (filepath,ocrFile)

    def RunOCR(self,imageFile, ocrFile):
        import io
        from PIL import Image
        import pytesseract

        img = Image.open(imageFile )
        text = pytesseract.image_to_string(img)

        with io.open(ocrFile,'w', encoding="utf-8") as f:
            f.write(text)
            f.close()

if __name__ == '__main__':
    print ("Registering COM servers...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(OCRBatch)

    rootdir="C:\\Users\\Simon\\Downloads\\ocr"
    test = OCRBatch()
    test.RunBatch(rootdir)

The code has two methods RunBatch and RunOCR. The latter operates on a single file whilst the former operates on a folder and its subfolders. The folder is scanned for filenames that start with "File_" and end with ".jpeg" because that is the naming convention of the input device, you should change it to suit your input device.

When run on each file Tesseract will scan the Image and generate some text, the text is then saved to a file in the same directory with very similar name (simply suffixed with ".txt"). For me, the output can be quite jumbled and typically needs cleaning up in a text editor but it is better than typing in from scratch.

Test Script

Just quickly point out that I've added some test script code to the tail of the script, this is where the __main__ procedure runs the COM registration. I find this a useful place to write some test code and will probably continue this convention.

OS Walk

Again, quickly point another (smaller) Python nicety in that os.walk recursively walks a directory structure and its subfolders. In VBA one would need to write some code to do that. Here the Python ecosystem delivers another time saving.

Client VBA Code

So thanks to COM registration the client VBA is trivial...

Option Explicit

Sub Test()

    Dim obj As Object
    Set obj = VBA.CreateObject("PythonInVBA.OCRBatch")
    
    obj.RunBatch "C:\Users\Simon\Downloads\ocr"

End Sub

No comments:

Post a Comment