So previously, I gave a Python script that decompiles a compiled help file (*.chm) into its constituent HTML help pages. Many of the resultant pages were very dated, poorly structured and had absolutely no hope of being well-formed enough to be parsable by an Xml parser (HTML 3.2 is to blame). I did write some VBA code to rewrite the files some time ago but I have found a much better technology called HTMLTidy. I came across HTMLTidy during my Python travels because there is a Python veneer library but actually I found that simple shelling a subprocess to be a much better approach.
The script below follows on from help file (*.chm) decompiler and it assumes that script has been run and has deposited files into a subdirectory of the Temp folder. This script also will form part of an overall workflow/pipeline.
Install HTMLTidy
Do please install HTMLTidy before attempting to run the code below. And set an environment variable %HTMLTidy% to point to the executable's parent folder. My installation is still in the Downloads folder (as determined by my browser) which shows how unfussy the install is (good).
Returning Tuples To VBA
As usual, the code is callable from Excel VBA by virtue of its COM registration but there is a slight problem and that is intrinsic Python types such as tuples do not return to VBA correctly. Attempting to return a tuple to VBA only returns the first element and not the whole list. I am happy to report that copying the values from a tuple into a Scripting.Dictionary (COM Type Library: Microsoft Scripting Runtime) and returning the dictionary to VBA solves the problem.
The Python pattern of returning a tuple is lovely and I would like to use it without restriction but if I want to make a method callable from VBA then I need to ship a second method that converts tuple to Scripting.Dictionary. I imagined wanting to do this in so many scenarios that it felt appropriate to write a little helper class...
class Win32DictConverter(object):
def ConvertTupleToWin32Dict(self, tup):
import win32com.client
win32dict = win32com.client.Dispatch("Scripting.Dictionary")
n=0
for x in tup:
win32dict.Add(n,x)
n=n+1
return win32dict
And here is a usage example taken from this post's script.
def TidiedFilenameWin32Dict(self, subdir, file):
return Win32DictConverter().ConvertTupleToWin32Dict(self.TidiedFilename(subdir, file))
Other than this little trick the rest of the code here is straightforward.
Code Walkthrough
So the code below essentially shell's to HTML Tidy in a manner similar to previous posts. [I have not used the Python veneer library to HTMLTidy as it gave counter-intuitive (to me at least) defaults.]
The only design decision to highlight is that I have broken out the code for the renaming of files into separate class, HTMLTidiedChmFileNamer, as I will probably need to call the logic therein from a later script. This is because this is meant to be part of a workflow/pipeline application.
You can see the code passing arguments to HTMLTidy from a reference of potential arguments see http://tidy.sourceforge.net/docs/quickref.html There are plenty to choose from
The key class only ships with one method, TidyBatch(), which takes a directory; this directory is recursively walked (a nice feature in Python) and each file that meets the naming rules will be tidied.
This script is part of a larger pipeline/workflow application which will decompile compiled help files (*.chm), tidy them and do further triage so meet the HTML5 standard and become convertible to ebooks version 3 (which is strict about HTML5). So we need some logic to handle compiled help file artefacts such as content (*.hhc) files and index (*.hhk) files. I will probably need to call that logic later so I put it in a class of its own, HTMLTidiedChmFileNamer.
import subprocess
import os
import os.path
class HTMLTidiedChmFileNamer(object):
_reg_clsid_ = "{8807D2B9-C83F-4AEB-A71D-15DBE8EFED9A}"
_reg_progid_ = 'PythonInVBA.HTMLTidiedChmFileNamer'
_public_methods_ = ['TidiedFilenameWin32Dict']
def TidiedFilename(self, subdir, file):
file2 = file.lower()
tidiedFile = ""
errorfile = ""
if ".tidied." not in file:
if file2.endswith((".hhc", ".hhk")):
tidiedFile = subdir + os.sep + file + ".tidied.html"
errorfile = subdir + os.sep + file + ".tidied.errors.txt"
if file2.endswith((".htm", ".html")):
tidiedFile = (subdir + os.sep +
file.split('.')[0] + ".tidied.html")
errorfile = (subdir + os.sep +
file.split('.')[0] + ".tidied.errors.txt")
return (tidiedFile, errorfile)
def TidiedFilenameWin32Dict(self, subdir, file):
return Win32DictConverter().ConvertTupleToWin32Dict(
self.TidiedFilename(subdir, file))
class Win32DictConverter(object):
def ConvertTupleToWin32Dict(self, tup):
import win32com.client
win32dict = win32com.client.Dispatch("Scripting.Dictionary")
n = 0
for x in tup:
win32dict.Add(n, x)
n = n + 1
return win32dict
class HTMLTidyChmFiles(object):
_reg_clsid_ = "{20C361FF-1826-4673-A30D-FABA87FF7910}"
_reg_progid_ = 'PythonInVBA.HTMLTidyChmFiles'
_public_methods_ = ['TidyBatch']
def TidyBatch(self, rootDir):
if "HTMLTidy" not in os.environ:
raise Exception(
"HTMLTidy environment variable not defined, "
"please define as HTMLTidy's bin folder")
sHTMLTidyExe = os.path.join(os.environ["HTMLTidy"], "tidy.exe")
FileNamer = HTMLTidiedChmFileNamer()
for subdir, dirs, files in os.walk(rootDir):
for file in files:
tidiedFile, errorfile = FileNamer.TidiedFilename(subdir, file)
fullPath = os.path.join(subdir, file)
if not tidiedFile == "":
# http://tidy.sourceforge.net/docs/quickref.html
subprocess.run([sHTMLTidyExe, '-output', tidiedFile,
'--doctype', 'html5', '--clean', 'yes',
'--error-file', errorfile, fullPath])
if __name__ == '__main__':
print ("Registering COM servers...")
import win32com.server.register
win32com.server.register.UseCommandLine(HTMLTidyChmFiles)
win32com.server.register.UseCommandLine(HTMLTidiedChmFileNamer)
rootdir = os.path.join(os.environ["tmp"], 'HelpFileDecompiler', "vblr6")
test = HTMLTidyChmFiles()
test.TidyBatch(rootdir)
The above script needs running at least once with administrator rights in order to register the COM classes. Once that is run you can call from VBA with the code below. The main() function also runs a test.
VBA Client Code
So as this is an Excel blog I should show you some client VBA code. This helped testing. Also, I want to demonstrate how it is fine to write code in Python (and other languages) and make them callable from VBA, this helps to expand the horizons of the VBA developer.
Sub TestHTMLTidyChmFiles()
'* assumes vblr6.chm has been decompiled previously (TestHelpFileDecompiler)
Dim objHTMLTidyChmFiles As Object
Set objHTMLTidyChmFiles = VBA.CreateObject("PythonInVBA.HTMLTidyChmFiles")
Call objHTMLTidyChmFiles.TidyBatch(Environ$("tmp") & "\HelpFileDecompiler\VBLR6\")
End Sub
Sub TestHTMLTidiedChmFileNamer()
Dim objFileNamer As Object
Set objFileNamer = VBA.CreateObject("PythonInVBA.HTMLTidiedChmFileNamer")
Dim dictResults As Scripting.Dictionary
Set dictResults = objFileNamer.TidiedFilenameWin32Dict(Environ$("tmp") & "\HelpFileDecompiler\VBLR6\", "vblr6.hhc")
Dim vRet As Variant
vRet = dictResults.Items
Debug.Print vRet(0)
Debug.Print vRet(1)
'Stop
End Sub
Final Thoughts
It is with joy that I found HTMLTidy restructures the mal-formed HTML 3.2 files buried in some *.chm files. There were some howlers such as duplicate opening <BODY>l tags etc and I am glad that the resulting files can now be further triaged by loading into an Xml parser. This is necessary because there is more work to be done to get these files up to HTML5 standard.
HTMLTidy does the restructuring for me and no doubt if I looked long enough in the documentation some command line options could help but the next task is to further validate the files to give a schedule of further triage options. HTMLTidy in its output recommends an HTML validator and I will look at that next.
I thoroughly recommend HTMLTidy over any other known VBA compatible solution for mal-formed HTML files (yes even HTML Agility pack!)
No comments:
Post a Comment