Wednesday 1 August 2018

Python - HTML Tidy - Script to restructure old help pages

So previously, I gave a Python script that decompiles a compiled help file (*.chm) into its constituent HTML help pages. Many of the resultant pages were very dated, poorly structured and had absolutely no hope of being well-formed enough to be parsable by an Xml parser (HTML 3.2 is to blame). I did write some VBA code to rewrite the files some time ago but I have found a much better technology called HTMLTidy. I came across HTMLTidy during my Python travels because there is a Python veneer library but actually I found that simple shelling a subprocess to be a much better approach.

The script below follows on from help file (*.chm) decompiler and it assumes that script has been run and has deposited files into a subdirectory of the Temp folder. This script also will form part of an overall workflow/pipeline.

Install HTMLTidy

Do please install HTMLTidy before attempting to run the code below. And set an environment variable %HTMLTidy% to point to the executable's parent folder. My installation is still in the Downloads folder (as determined by my browser) which shows how unfussy the install is (good).

Returning Tuples To VBA

As usual, the code is callable from Excel VBA by virtue of its COM registration but there is a slight problem and that is intrinsic Python types such as tuples do not return to VBA correctly. Attempting to return a tuple to VBA only returns the first element and not the whole list. I am happy to report that copying the values from a tuple into a Scripting.Dictionary (COM Type Library: Microsoft Scripting Runtime) and returning the dictionary to VBA solves the problem.

The Python pattern of returning a tuple is lovely and I would like to use it without restriction but if I want to make a method callable from VBA then I need to ship a second method that converts tuple to Scripting.Dictionary. I imagined wanting to do this in so many scenarios that it felt appropriate to write a little helper class...

class Win32DictConverter(object):
    def ConvertTupleToWin32Dict(self, tup):
        import win32com.client
        win32dict = win32com.client.Dispatch("Scripting.Dictionary")

        n=0
        for x in tup:
            win32dict.Add(n,x)
            n=n+1
        return  win32dict

And here is a usage example taken from this post's script.

    def TidiedFilenameWin32Dict(self, subdir, file):
        return  Win32DictConverter().ConvertTupleToWin32Dict(self.TidiedFilename(subdir, file))

Other than this little trick the rest of the code here is straightforward.

Code Walkthrough

So the code below essentially shell's to HTML Tidy in a manner similar to previous posts. [I have not used the Python veneer library to HTMLTidy as it gave counter-intuitive (to me at least) defaults.]

The only design decision to highlight is that I have broken out the code for the renaming of files into separate class, HTMLTidiedChmFileNamer, as I will probably need to call the logic therein from a later script. This is because this is meant to be part of a workflow/pipeline application.

You can see the code passing arguments to HTMLTidy from a reference of potential arguments see http://tidy.sourceforge.net/docs/quickref.html There are plenty to choose from

The key class only ships with one method, TidyBatch(), which takes a directory; this directory is recursively walked (a nice feature in Python) and each file that meets the naming rules will be tidied.

This script is part of a larger pipeline/workflow application which will decompile compiled help files (*.chm), tidy them and do further triage so meet the HTML5 standard and become convertible to ebooks version 3 (which is strict about HTML5). So we need some logic to handle compiled help file artefacts such as content (*.hhc) files and index (*.hhk) files. I will probably need to call that logic later so I put it in a class of its own, HTMLTidiedChmFileNamer.

import subprocess
import os
import os.path


class HTMLTidiedChmFileNamer(object):
    _reg_clsid_ = "{8807D2B9-C83F-4AEB-A71D-15DBE8EFED9A}"
    _reg_progid_ = 'PythonInVBA.HTMLTidiedChmFileNamer'
    _public_methods_ = ['TidiedFilenameWin32Dict']

    def TidiedFilename(self, subdir, file):
        file2 = file.lower()
        tidiedFile = ""
        errorfile = ""

        if ".tidied." not in file:
            if file2.endswith((".hhc", ".hhk")):
                tidiedFile = subdir + os.sep + file + ".tidied.html"
                errorfile = subdir + os.sep + file + ".tidied.errors.txt"

            if file2.endswith((".htm", ".html")):
                tidiedFile = (subdir + os.sep +
                              file.split('.')[0] + ".tidied.html")
                errorfile = (subdir + os.sep +
                             file.split('.')[0] + ".tidied.errors.txt")
        return (tidiedFile, errorfile)

    def TidiedFilenameWin32Dict(self, subdir, file):
        return Win32DictConverter().ConvertTupleToWin32Dict(
            self.TidiedFilename(subdir, file))


class Win32DictConverter(object):
    def ConvertTupleToWin32Dict(self, tup):
        import win32com.client
        win32dict = win32com.client.Dispatch("Scripting.Dictionary")

        n = 0
        for x in tup:
            win32dict.Add(n, x)
            n = n + 1
        return win32dict


class HTMLTidyChmFiles(object):
    _reg_clsid_ = "{20C361FF-1826-4673-A30D-FABA87FF7910}"
    _reg_progid_ = 'PythonInVBA.HTMLTidyChmFiles'
    _public_methods_ = ['TidyBatch']

    def TidyBatch(self, rootDir):

        if "HTMLTidy" not in os.environ:
            raise Exception(
                "HTMLTidy environment variable not defined, "
                "please define as HTMLTidy's bin folder")

        sHTMLTidyExe = os.path.join(os.environ["HTMLTidy"], "tidy.exe")
        FileNamer = HTMLTidiedChmFileNamer()

        for subdir, dirs, files in os.walk(rootDir):
            for file in files:
                tidiedFile, errorfile = FileNamer.TidiedFilename(subdir, file)
                fullPath = os.path.join(subdir, file)
                if not tidiedFile == "":
                    # http://tidy.sourceforge.net/docs/quickref.html
                    subprocess.run([sHTMLTidyExe, '-output', tidiedFile,
                                    '--doctype', 'html5', '--clean', 'yes',
                                    '--error-file', errorfile, fullPath])


if __name__ == '__main__':
    print ("Registering COM servers...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(HTMLTidyChmFiles)
    win32com.server.register.UseCommandLine(HTMLTidiedChmFileNamer)

    rootdir = os.path.join(os.environ["tmp"], 'HelpFileDecompiler', "vblr6")
    test = HTMLTidyChmFiles()
    test.TidyBatch(rootdir)

The above script needs running at least once with administrator rights in order to register the COM classes. Once that is run you can call from VBA with the code below. The main() function also runs a test.

VBA Client Code

So as this is an Excel blog I should show you some client VBA code. This helped testing. Also, I want to demonstrate how it is fine to write code in Python (and other languages) and make them callable from VBA, this helps to expand the horizons of the VBA developer.

Sub TestHTMLTidyChmFiles()
    
    '* assumes vblr6.chm has been decompiled previously (TestHelpFileDecompiler)
    
    Dim objHTMLTidyChmFiles As Object
    Set objHTMLTidyChmFiles = VBA.CreateObject("PythonInVBA.HTMLTidyChmFiles")
    
    Call objHTMLTidyChmFiles.TidyBatch(Environ$("tmp") & "\HelpFileDecompiler\VBLR6\")
    
End Sub


Sub TestHTMLTidiedChmFileNamer()
    
    Dim objFileNamer As Object
    Set objFileNamer = VBA.CreateObject("PythonInVBA.HTMLTidiedChmFileNamer")
    
    Dim dictResults As Scripting.Dictionary
    Set dictResults = objFileNamer.TidiedFilenameWin32Dict(Environ$("tmp") & "\HelpFileDecompiler\VBLR6\", "vblr6.hhc")
    
    Dim vRet As Variant
    vRet = dictResults.Items
    
    Debug.Print vRet(0)
    Debug.Print vRet(1)
    'Stop

End Sub

Final Thoughts

It is with joy that I found HTMLTidy restructures the mal-formed HTML 3.2 files buried in some *.chm files. There were some howlers such as duplicate opening <BODY>l tags etc and I am glad that the resulting files can now be further triaged by loading into an Xml parser. This is necessary because there is more work to be done to get these files up to HTML5 standard.

HTMLTidy does the restructuring for me and no doubt if I looked long enough in the documentation some command line options could help but the next task is to further validate the files to give a schedule of further triage options. HTMLTidy in its output recommends an HTML validator and I will look at that next.

I thoroughly recommend HTMLTidy over any other known VBA compatible solution for mal-formed HTML files (yes even HTML Agility pack!)

No comments:

Post a Comment