Thursday 2 August 2018

Python - Java - Nu Html Checker - Running an HTML validator on old help pages

So in previous post I showed how to use HTMLTidy to restructure old HTML help pages in that case decompiled from a help file (*.chm) but could apply to any old HTML files. To raise compliance to HTML5, it is still necessary to further triage them. In its output messages, HTMLTidy recommends validating at http://validator.w3.org/nu/ but in this post we show how one can run this logic locally by downloading the java jar that drives that web site.

The Nu Html Checker

So HTML Tidy recommends the useful web site Nu Html Checker, https://validator.w3.org/nu/#textarea but before you feel tempted to write code to script against this page be advised you can run your own copy of the Nu Html Checker from a command line so long as you have Java installed.

Install Java

Do please install Java before attempting the code below

Install Nu Html Checker

Instructions as to how to get your own copy of the tool are here. So I navigated to Nu Html Checker version 18.7.23 and downloaded vnu.jar_18.7.23.zip . When the download completed, I unzipped it and extracted contained files to a subdirectory in my Downloads folder. For later use, I defined an environment variable %vnu% to point to vnu.jar's parent folder, %userprofile%\Downloads\vnu.jar_18.7.23\dist . A better long run place to install would be somewhere in Program Files.

Running Nu Html Checker from command line

With the environment variable %vnu% defined I can test the install is working (both java and the downloaded jar file) with ...

C:\>java -jar %vnu%\vnu.jar --version
18.7.23

You can see the Nu Html Checker version number is returned, so all is installed correctly.

Running Nu Html Checker from command line on a single file

Installation confirmed, we can confidently advance to running the tool on an HTML file, I have some files resulting from a previous post. So I will try this file

C:\>java -jar %vnu%\vnu.jar --no-langdetect --format xml %Temp%\HelpFileDecompiler\VBLR6\vblr6.hhc.tidied.html
<?xml version='1.0' encoding='utf-8'?>
<messages xmlns="http://n.validator.nu/messages/">
<error url="file:/C:/Users/Simon/AppData/Local/Temp/HelpFileDecompiler/VBLR6/vblr6.hhc.tidied.html" last-line="8" last-column="15" first-column="8">
<message>Element <code xmlns="http://www.w3.org/1999/xhtml">title</code> must not be empty.</message>
<extract>-&gt;
&lt;title&gt;<m>&lt;/title&gt;</m>
&lt;/hea</extract>
</error>

</messages>

C:\>

So we get a report. In this case one message only complaining about an empty title element; the message carries text file co-ordinates (line, column) so we can locate easily. Some of the message is itself entitized HTML and so reads a little cryptically but the other output formats are not much better.

Running Nu Html Checker from command line on a directory

Running on a whole directory created a huge massive file. I'd prefer a report file per HTML file. Fortunately we can write some Python code to do this.

Python Script to walk a folder and run Nu Html Checker on each file

If you have been reading my Python posts then the next script follows a familiar pattern. The script has a COM callable class so Excel VBA can call into it but it also stands alone and is callable by running Python from the command line. This is an Excel blog and I feel obliged to tie non VBA code back to VBA. In fact, there are two classes, I am working on a series of posts and would like to reuse the naming logic so that explains the HTMLTidiedChmFileNamer class.

The ValidatorReporter class runs the Nu Html Checker validation checker. It walks a folder as found in previous scripts. It shells a process using subprocess as in previous scripts. One thing that is new here is that we are shelling to java. Another things that is new here is that we are capturing the stderr by specifying PIPE in subprocess.run() arguments; this allows us to read the stderr stream and then we write it to a file.

import os
import subprocess
from subprocess import PIPE
import codecs


class HTMLTidiedChmFileNamer(object):
    _reg_clsid_ = "{8807D2B9-C83F-4AEB-A71D-15DBE8EFED9A}"
    _reg_progid_ = 'PythonInVBA.HTMLTidiedChmFileNamer'
    _public_methods_ = ['TidiedFilenameWin32Dict']

    def TidiedFilename(self, subdir, file):
        file2 = file.lower()
        tidiedFile = ""
        errorfile = ""
        validationErrorsFile = ""

        if ".tidied." not in file:
            if file2.endswith((".hhc", ".hhk")):
                tidiedFile = subdir + os.sep + file + ".tidied.html"
                errorfile = subdir + os.sep + file + ".tidied.errors.txt"
                validationErrorsFile = (subdir + os.sep + file +
                                        ".tidied.validationErrors.txt")

            if file2.endswith((".htm", ".html")):
                tidiedFile = (subdir + os.sep +
                              file.split('.')[0] + ".tidied.html")
                errorfile = (subdir + os.sep +
                             file.split('.')[0] + ".tidied.errors.txt")
                validationErrorsFile = (subdir + os.sep +
                                        file.split('.')[0] +
                                        ".tidied.validationErrors.txt")
        return (tidiedFile, errorfile, validationErrorsFile)


class ValidatorReporter(object):
    _reg_clsid_ = "{321F338F-75AE-460B-85A2-5C553A39CDE1}"
    _reg_progid_ = 'PythonInVBA.ValidatorReporter'
    _public_methods_ = ['ValidateBatch']

    def ValidateBatch(self, rootDir):

        if "vnu" not in os.environ:
            raise Exception(
                "vnu environment variable not defined, "
                "please define as vnu jar's parent folder")

        sVNUExe = os.path.join(os.environ["vnu"], "vnu.jar")
        FileNamer = HTMLTidiedChmFileNamer()

        for subdir, dirs, files in os.walk(rootDir):
            for file in files:
                tidiedFile, errorfile, validationErrorsFile = 
                        FileNamer.TidiedFilename(subdir, file)
                if not tidiedFile == "":
                    # https://github.com/validator/validator#user-content-usage
                    args = ['java', '-jar', sVNUExe, '--no-langdetect',
                            '--format', 'xml', tidiedFile]
                    proc = subprocess.run(args, stderr=PIPE)

                    file = codecs.open(validationErrorsFile, "w", "utf-8")
                    file.write(proc.stderr.decode("utf-8"))
                    file.close()

if __name__ == '__main__':
    print ("Registering COM servers...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(ValidatorReporter)
    win32com.server.register.UseCommandLine(HTMLTidiedChmFileNamer)
    
    rootdir = os.path.join(os.environ["tmp"], 'HelpFileDecompiler', "vblr6")
    test = ValidatorReporter()
    test.ValidateBatch(rootdir)

The portion of code that registers the COM classes require administrator rights. You can comment them out and run the script from command line instead in a purely Pythonic way.

The code assumes you have a folder with HTML files in it. For me I have given the code a folder of HTML files extracted from a decompiled *.chm file and the code takes a good while.

Client VBA Code

To prove we can call this Python script from VBA here is the client code

Sub TestValidatorReporter()
    
    Dim objValidatorReporter As Object
    Set objValidatorReporter = VBA.CreateObject("PythonInVBA.ValidatorReporter")
    
    objValidatorReporter.ValidateBatch Environ$("tmp") & "\HelpFileDecompiler\VBLR6\"

End Sub

Final Thoughts

For me, the resultant output is huge and will take time to comb through but it looks like I'll need to load HTML files into Xml parsers and rearrange attributes etc. More Python code to come in this series. So look out for that.

No comments:

Post a Comment