Thursday 16 August 2018

VBA - JSON - REST APIs - Atomic vs Document-Driven

Very interesting article, Replace RESTful APIs with JSON-Pure saying that REST Apis as strictly described by Roy Fielding are probably not a good idea. Definitely worth a read and has influenced my design.

I want to write a web service that takes information scanned for a VBA project and uploads it to a web server for analysis. Initially, I had thought I'd make a network call for each method I'd find but then that adds up to tons of ntwork calls, also I'd have to expose my URLs to the user and finally I'd have to force fit my data to a REST url paradigm which I found quite challenging. This initial approach could be described as an atomic approach making many individual network calls. An alternative approach is to build up a whole document and post this document all in one network call.

In case you're interested the code that was used as the subject of the experiment is available here, Creating an SVG file with VBA .

So the following is a program that will use the Microsoft Visual Basic for Applications Extensibility 5.3 (VBIDE) library to scan a VBA project attached a workbook for classes and modules and procedures found therein.

The code uses simple string concatenation to build the JSON document but still uses the ScriptControl to debug malformed JSON.

The resultant JSON document is found below as well as the list of atomic calls for comparison. I'm opting for document driven approach for now on.


Option Explicit
Option Private Module

'* Tools->References
' MSScriptControl  Microsoft Script Control 1.0                        C:\Windows\SysWow64\msscript.ocx
' MSXML2           Microsoft XML, v6.0                                 C:\Windows\SysWOW64\msxml6.dll
' Scripting        Microsoft Scripting Runtime                         C:\Windows\SysWOW64\scrrun.dll
' VBIDE            Microsoft Visual Basic for App's Extensibility 5.3  C:\Program Files (x86)\Common Files\Microsoft Shared\VBA\VBA6\VBE6EXT.OLB

'******************************************************************************
'* This block implements the debug boolean (plus a default)
Private Const mbDEBUGJSONDEFAULT As Boolean = True
Private mvDebugJSON As Variant
Private Property Let DebugJSON(ByVal bRHS As Boolean)
    mvDebugJSON = bRHS
End Property
Private Property Get DebugJSON() As Boolean
    If IsEmpty(mvDebugJSON) Then mvDebugJSON = mbDEBUGJSONDEFAULT
    DebugJSON = mvDebugJSON
End Property
'******************************************************************************


Private Sub TestConvertArtefactsToModellerJSON()
    '*
    '* Main entry point, tests the code and shows the JSON parsed into an object
    '* of course one does not have to parse it here, one can just pass JSON string on to a webservice
    '*
    
    Dim wb As Excel.Workbook
    Set wb = Workbooks.Item("SVG.xlsm") '<--- put your own workbook in here!
    
    DebugJSON = False  '* this controls interim parsing, useful for debugging the document's components
    
    Dim sJSON As String
    sJSON = ReadComponentsWithVBIDE(wb)

    Debug.Print sJSON

    Dim objParsed As Object
    Set objParsed = SC.Run("JSON_parse", sJSON)
    
    Stop '* inspect objParsed in Locals Window
    
End Sub

Private Function ReadComponentsWithVBIDE(ByVal wb As Excel.Workbook) As String
    '*
    '* This function scans through all the classes and modules and reports on procedures found therein
    '*

    If wb Is Nothing Then Err.Raise vbObjectError, , "#Null wb!"
    
    Dim vbp As VBIDE.VBProject
    Set vbp = wb.VBProject
    
    Dim sJSON As String
    sJSON = VBA.Replace("{'projectName':'%foo%'", "%foo%", vbp.Name)
    
    Dim sJSONClasses As String
    sJSONClasses = "["
    
    Dim sJSONModules As String
    sJSONModules = "["
    
    Dim vbcLoop As VBIDE.VBComponent
    For Each vbcLoop In vbp.VBComponents
        
        If vbcLoop.Type = vbext_ct_ClassModule Or vbcLoop.Type = vbext_ct_StdModule Then
        
            Dim sJSONComponent As String
            sJSONComponent = ScanComponent(vbcLoop)
            
            Select Case vbcLoop.Type
            
            Case vbext_ct_ClassModule:
                sJSONClasses = sJSONClasses & VBA.IIf(Len(sJSONClasses) > 1, ",", "") & sJSONComponent
                
            Case vbext_ct_StdModule:
                sJSONModules = sJSONModules & VBA.IIf(Len(sJSONModules) > 1, ",", "") & sJSONComponent
            End Select

        End If
    Next
    
    sJSONClasses = sJSONClasses & "]"
    sJSONModules = sJSONModules & "]"
    
    Call ParseAndStringify(sJSONClasses)
    Call ParseAndStringify(sJSONModules)
    
    sJSON = sJSON & ",'classes':" & sJSONClasses
    sJSON = sJSON & ",'modules':" & sJSONModules
    sJSON = sJSON & "}"
        
    ReadComponentsWithVBIDE = ParseAndStringify(sJSON)
    
End Function

Private Function ScanComponent(ByVal vbc As VBIDE.VBComponent) As String
    '*
    '* This function will scan the source code of a component and report on any procedures found therein
    '*
    Dim dicProcs As Scripting.Dictionary
    Set dicProcs = New Scripting.Dictionary
    
    Dim dotnetlistProcs As Object
    Set dotnetlistProcs = VBA.CreateObject("System.Collections.ArrayList") '* need for sorting
    
    Dim cm As VBIDE.CodeModule
    Set cm = vbc.CodeModule
    
    Dim lLineLoop As Long
    For lLineLoop = 1 To cm.CountOfLines
        Dim sProc As String
        
        Dim eProcKind As VBIDE.vbext_ProcKind
        
        sProc = cm.ProcOfLine(lLineLoop, eProcKind)
        
        If Not dicProcs.Exists(sProc) Then
            dicProcs.Add sProc, VBA.Switch(eProcKind = vbext_pk_Get, "get", eProcKind = vbext_pk_Let, "let", _
                                           eProcKind = vbext_pk_Set, "set", eProcKind = vbext_pk_Proc, "proc")
        End If
        
        If Not dotnetlistProcs.contains(sProc) Then
            dotnetlistProcs.Add sProc
        End If
        
    Next lLineLoop
    
    dotnetlistProcs.Sort
    
    Dim sJSON As String
    sJSON = VBA.Replace("{'compName':'%foo%','procs':[", "%foo%", vbc.Name)
    
    Dim l As Long
    For l = 0 To dotnetlistProcs.Count - 1
        sJSON = sJSON & "{'procName':" & "'" & VBA.IIf(Len(dotnetlistProcs.Item(l)) = 0, "(Declarations)", dotnetlistProcs.Item(l)) & "'," 
        sJSON = sJSON & "'procKind':" & "'" & dicProcs.Item(dotnetlistProcs.Item(l)) & "'}" & VBA.IIf(l <> dotnetlistProcs.Count - 1, ",", "")
        
    Next l
    
    sJSON = sJSON & "]}"
    
    ScanComponent = ParseAndStringify(sJSON)

End Function

Private Function ParseAndStringify(ByVal sJSON As String) As String
    '*
    '* this function callable by any code that concatenates JSON will parse and restringify (to reformat) if the module level debug
    '* flag is set to True
    '*
    If DebugJSON Then

        Dim oSC As ScriptControl
        Set oSC = SC
        
        Dim objParsed As Object
        Set objParsed = oSC.Run("JSON_parse", VBA.Replace(sJSON, "'", """")) '* JSON strictly has double quotes not single quotes
        
        Dim sReStringified As String
        ParseAndStringify = oSC.Run("JSON_stringify", objParsed)
        Debug.Print ParseAndStringify
    Else
        ParseAndStringify = VBA.Replace(sJSON, "'", """")
    End If

End Function

Private Function SC() As ScriptControl
    '*
    '* This ScriptControl hosts javascript fragments, some added here, some downloaded from web
    '*
    Static soSC As ScriptControl
    If soSC Is Nothing Then

        Set soSC = New ScriptControl
        soSC.Language = "JScript"

        soSC.AddCode GetJavaScriptLibrary("https://raw.githubusercontent.com/douglascrockford/JSON-js/master/json2.js")
        soSC.AddCode "function JSON_stringify(jsonObj) { return JSON.stringify(jsonObj); } "
        soSC.AddCode "function JSON_parse(sJson) { return JSON.parse(sJson); } "
        soSC.AddCode "function deleteValueByKey(obj,keyName) { delete obj[keyName]; } "
        soSC.AddCode "function setValueByKey(obj,keyName, newValue) { obj[keyName]=newValue; } "
        soSC.AddCode "function enumKeysToMsDict(jsonObj,msDict) { for (var i in jsonObj) { msDict.Add(i,0); }  } "
        

    End If
    Set SC = soSC
End Function

Private Function GetJavaScriptLibrary(ByVal sURL As String) As String
    '*
    '* This assists the ScriptControl to download javascript library
    '*
    Dim xHTTPRequest As MSXML2.XMLHTTP60
    Set xHTTPRequest = New MSXML2.XMLHTTP60
    xHTTPRequest.Open "GET", sURL, False
    xHTTPRequest.send
    GetJavaScriptLibrary = xHTTPRequest.responseText

End Function

So the output of the program is a JSON string, if I pretty print this with the help of https://jsonformatter.curiousconcept.com/ .

 
{  
   "projectName":"SVGHelper",
   "classes":[  
      {  
         "compName":"Point",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"SetPoint",
               "procKind":"proc"
            },
            {  
               "procName":"x",
               "procKind":"get"
            },
            {  
               "procName":"y",
               "procKind":"get"
            }
         ]
      },
      {  
         "compName":"Points",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"AddPoint",
               "procKind":"proc"
            },
            {  
               "procName":"Class_Initialize",
               "procKind":"proc"
            },
            {  
               "procName":"Count",
               "procKind":"proc"
            },
            {  
               "procName":"CreatePoint",
               "procKind":"proc"
            },
            {  
               "procName":"Item",
               "procKind":"proc"
            },
            {  
               "procName":"LastPoint",
               "procKind":"proc"
            }
         ]
      },
      {  
         "compName":"SVGPath",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"AddPoint",
               "procKind":"proc"
            },
            {  
               "procName":"Class_Initialize",
               "procKind":"proc"
            },
            {  
               "procName":"ClosePath",
               "procKind":"proc"
            },
            {  
               "procName":"D_Attribute",
               "procKind":"proc"
            },
            {  
               "procName":"ReflectInBothXAndY",
               "procKind":"proc"
            },
            {  
               "procName":"ReflectInX",
               "procKind":"proc"
            },
            {  
               "procName":"ReflectInY",
               "procKind":"proc"
            },
            {  
               "procName":"SetMove",
               "procKind":"proc"
            }
         ]
      },
      {  
         "compName":"UnionJack",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"BlueTriangle",
               "procKind":"proc"
            },
            {  
               "procName":"Class_Initialize",
               "procKind":"proc"
            },
            {  
               "procName":"EnglishCross",
               "procKind":"proc"
            },
            {  
               "procName":"MyLargerBlueTriangle",
               "procKind":"proc"
            },
            {  
               "procName":"MySmallerBlueTriangle",
               "procKind":"proc"
            },
            {  
               "procName":"StPatricksCrossBlade",
               "procKind":"proc"
            }
         ]
      },
      {  
         "compName":"SVGTextMessageLengthCalculator",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"CalculateChunks",
               "procKind":"proc"
            },
            {  
               "procName":"Class_Initialize",
               "procKind":"proc"
            },
            {  
               "procName":"CycleThroughTextMessage",
               "procKind":"proc"
            },
            {  
               "procName":"Initialise",
               "procKind":"proc"
            },
            {  
               "procName":"NavigateToTextMessageAndMeasureWidth",
               "procKind":"proc"
            },
            {  
               "procName":"Terminate",
               "procKind":"proc"
            },
            {  
               "procName":"WriteSVGTextFile",
               "procKind":"proc"
            }
         ]
      },
      {  
         "compName":"SVGTextMessageHeightCalculator",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"ComputeHeight",
               "procKind":"proc"
            }
         ]
      }
   ],
   "modules":[  
      {  
         "compName":"modUnionJack",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"CreateFromScratch",
               "procKind":"proc"
            }
         ]
      },
      {  
         "compName":"tstSVGTextMessageCal",
         "procs":[  
            {  
               "procName":"(Declarations)",
               "procKind":"proc"
            },
            {  
               "procName":"TestHeight",
               "procKind":"proc"
            },
            {  
               "procName":"TestLengthCalculator",
               "procKind":"proc"
            }
         ]
      },
      {  
         "compName":"Module1",
         "procs":[  

         ]
      }
   ]
}

The previous iteration of code took each class method and converted into to a REST url but this means (i) tons more network calls; (ii) exposing URLs ; (iii) forcing into REST url paradigm. (Code not given).

http://localhost:1337/VBAModeller/SVGHelper/Classes/Point/
http://localhost:1337/VBAModeller/SVGHelper/Classes/Point/SetPoint
http://localhost:1337/VBAModeller/SVGHelper/Classes/Point/x_propget
http://localhost:1337/VBAModeller/SVGHelper/Classes/Point/y_propget
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/AddPoint
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/Class_Initialize
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/Count
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/CreatePoint
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/Item
http://localhost:1337/VBAModeller/SVGHelper/Classes/Points/LastPoint
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/AddPoint
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/Class_Initialize
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/ClosePath
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/D_Attribute
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/ReflectInBothXAndY
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/ReflectInX
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/ReflectInY
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGPath/SetMove
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/BlueTriangle
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/Class_Initialize
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/EnglishCross
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/MyLargerBlueTriangle
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/MySmallerBlueTriangle
http://localhost:1337/VBAModeller/SVGHelper/Classes/UnionJack/StPatricksCrossBlade
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/CalculateChunks
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/Class_Initialize
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/CycleThroughTextMessage
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/Initialise
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/NavigateToTextMessageAndMeasureWidth
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/Terminate
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageLengthCalculator/WriteSVGTextFile
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageHeightCalculator/
http://localhost:1337/VBAModeller/SVGHelper/Classes/SVGTextMessageHeightCalculator/ComputeHeight
http://localhost:1337/VBAModeller/SVGHelper/Modules/modUnionJack/
http://localhost:1337/VBAModeller/SVGHelper/Modules/modUnionJack/CreateFromScratch
http://localhost:1337/VBAModeller/SVGHelper/Modules/tstSVGTextMessageCal/
http://localhost:1337/VBAModeller/SVGHelper/Modules/tstSVGTextMessageCal/TestHeight
http://localhost:1337/VBAModeller/SVGHelper/Modules/tstSVGTextMessageCal/TestLengthCalculator

Saturday 4 August 2018

Chrome Extension - Stop Stray CORS requests

If I type in a web address such as the British Newspaper theguardian.com then I might naively expect all resources to be delivered from that domain name. But the modern web page has all sorts of cross network calls to web analytics and constant delivery networks (CDNs). Web analytics are a fact of modern life; they help firms reach their customers with better targeted adverts. No more scatter-gun adverts, we can now have pertinent products pitched to us individually. This helps allocative efficiency which is a good thing.

If web analytics were restricted to economic transactions then I'm confident there would be no problem. Sadly, some web analytics have been put to political use which is naughty. How does one opt out of a naughty analytics provider?

Use Hosts file to Block WebAnalytics

In the past I have altered my computer's hosts file at C:\Windows\System32\drivers\etc\hosts so a name resolves to loopback interface 127.0.0.1 meaning data destined for an address never leaves your computer. But this is like a sledgehammer to crack a nut.

Chrome Extension CORS filter

The precise technical term for cross network calls is Cross-origin resource sharing (CORS). CORS requires an exchange and interaction between (browser) client and (web) server.

On the server side, by default web servers disallow CORS and programmers have to actively change their code to permit CORS requests and actually even on this blog you'll find an example enabling CORS.

But the loopholes opened on the server side can be closed on the client side.

On the client side, we can write a Chrome Extension to disable CORS requests. I have given Chrome Extension examples before on this blog. This post's extension is slightly different in that it runs as a background script instead of a context script.

So in our example we are going block requests to Facebook domains because at the time of writing they are 'on the naughty step', being criticized by a British Parliamentary Oversight committee. U.S. Congressional oversight committees' reports are currently pending. But the code could be tweaked to apply to all manner of naughtiness.

manifest.json

Here is the manifest.json file. Create a directory, I called mine N:\CORS Chrome Extension\ and copy this there. This is a standard manifest file, it asks for permissions to block web requests.

  {
    "name": "Cross Origin Filter",
    "version": "0.0.6",
    "description": "Helps you stop stray CORS requests.",
    "permissions": [
      "webRequest",
      "webRequestBlocking",
      "*://*/*"
    ],
    "background": {
      "scripts": [
        "bgp.js"
      ],
      "persistent": true
    },
    "manifest_version": 2
  }

bgp.js

Below is the background page script, I called mine bgp.js (it must match entry in manifest.json) and saved this again in folder N:\CORS Chrome Extension\

The code adds a listener to the event onBeforeSendHeaders but whilst other events are available we need to scan through the request headers looking for the Referer so we can establish if the request is cross domain.

The code parses URLs using the URL object ;we only need the hostnames e.g. www.theguardian.com, www.facebook.com so we throw away the parameter string. Once we have the hostnames we can compare them against a list of domain names to block. There are two list matching sections, one compares exactly and the other compares the tail of the domain name.

If a domain matches one we want to block then we create a blockingResponse object and set its cancel property to true. This cancels the webrequest. We print to the console when we've blocked a domain.

chrome.webRequest.onBeforeSendHeaders.addListener(function (details) {
  
  var myVars = {};
  myVars.urlsPresent = false;

  try {
    myVars.requestURL = (new URL(details.url)).hostname;
    for (var i = 0, l = details.requestHeaders.length; i < l; ++i) {
      if (details.requestHeaders[i].name == 'Referer') {
        referer = details.requestHeaders[i].value;
        myVars.refererURL = (new URL(referer)).hostname;
        myVars.crossOrigin = (myVars.refererURL !== myVars.requestURL);
        myVars.urlsPresent = true;
        break;
      }
    }
  }
  catch (err) {
    console.log("Error whilst determining URLs, err.message: " + err.message);
  }

  if (myVars.urlsPresent === true) {
    try {
      myVars.block = false;

      if (myVars.crossOrigin === true) {

        {
          var aBlockCrossOriginEndsWithList = [".fbcdn.net"];
          for (var i = 0, l = aBlockCrossOriginEndsWithList.length; i < l; ++i) {
            if (myVars.requestURL.endsWith( aBlockCrossOriginEndsWithList[i])) {
              //debugger;
              myVars.block = true;
              break;
            }
          }
        }

        {
          var aBlockCrossOriginList = ["connect.facebook.net", "www.facebook.com"];
          for (var i = 0, l = aBlockCrossOriginList.length; i < l; ++i) {
            if (aBlockCrossOriginList[i] == myVars.requestURL) {
              myVars.block = true;
            }
          }
        }


      }
    }
    catch (err) {
      console.log("Error whilst determining blocking, err.message: " + err.message);
    }
  }

  if (myVars.block === true) {
    try {
      console.log("CORS Filter v.0.0.6, blocking " + myVars.requestURL + " from " + myVars.refererURL);
      //debugger;
      blockingResponse = {};
      blockingResponse.cancel = true
      return blockingResponse;
    }
    catch (err) {
      console.log("Error whilst returning blocking response, err.message: " + err.message);
    }
  }

}, { urls: ["*://*/*"] }, ['requestHeaders', 'blocking']);

Here is the console output showing how a web page from the Guardian is having cross domain calls to Facebook blocked.

Thursday 2 August 2018

Python - Java - Nu Html Checker - Running an HTML validator on old help pages

So in previous post I showed how to use HTMLTidy to restructure old HTML help pages in that case decompiled from a help file (*.chm) but could apply to any old HTML files. To raise compliance to HTML5, it is still necessary to further triage them. In its output messages, HTMLTidy recommends validating at http://validator.w3.org/nu/ but in this post we show how one can run this logic locally by downloading the java jar that drives that web site.

The Nu Html Checker

So HTML Tidy recommends the useful web site Nu Html Checker, https://validator.w3.org/nu/#textarea but before you feel tempted to write code to script against this page be advised you can run your own copy of the Nu Html Checker from a command line so long as you have Java installed.

Install Java

Do please install Java before attempting the code below

Install Nu Html Checker

Instructions as to how to get your own copy of the tool are here. So I navigated to Nu Html Checker version 18.7.23 and downloaded vnu.jar_18.7.23.zip . When the download completed, I unzipped it and extracted contained files to a subdirectory in my Downloads folder. For later use, I defined an environment variable %vnu% to point to vnu.jar's parent folder, %userprofile%\Downloads\vnu.jar_18.7.23\dist . A better long run place to install would be somewhere in Program Files.

Running Nu Html Checker from command line

With the environment variable %vnu% defined I can test the install is working (both java and the downloaded jar file) with ...

C:\>java -jar %vnu%\vnu.jar --version
18.7.23

You can see the Nu Html Checker version number is returned, so all is installed correctly.

Running Nu Html Checker from command line on a single file

Installation confirmed, we can confidently advance to running the tool on an HTML file, I have some files resulting from a previous post. So I will try this file

C:\>java -jar %vnu%\vnu.jar --no-langdetect --format xml %Temp%\HelpFileDecompiler\VBLR6\vblr6.hhc.tidied.html
<?xml version='1.0' encoding='utf-8'?>
<messages xmlns="http://n.validator.nu/messages/">
<error url="file:/C:/Users/Simon/AppData/Local/Temp/HelpFileDecompiler/VBLR6/vblr6.hhc.tidied.html" last-line="8" last-column="15" first-column="8">
<message>Element <code xmlns="http://www.w3.org/1999/xhtml">title</code> must not be empty.</message>
<extract>-&gt;
&lt;title&gt;<m>&lt;/title&gt;</m>
&lt;/hea</extract>
</error>

</messages>

C:\>

So we get a report. In this case one message only complaining about an empty title element; the message carries text file co-ordinates (line, column) so we can locate easily. Some of the message is itself entitized HTML and so reads a little cryptically but the other output formats are not much better.

Running Nu Html Checker from command line on a directory

Running on a whole directory created a huge massive file. I'd prefer a report file per HTML file. Fortunately we can write some Python code to do this.

Python Script to walk a folder and run Nu Html Checker on each file

If you have been reading my Python posts then the next script follows a familiar pattern. The script has a COM callable class so Excel VBA can call into it but it also stands alone and is callable by running Python from the command line. This is an Excel blog and I feel obliged to tie non VBA code back to VBA. In fact, there are two classes, I am working on a series of posts and would like to reuse the naming logic so that explains the HTMLTidiedChmFileNamer class.

The ValidatorReporter class runs the Nu Html Checker validation checker. It walks a folder as found in previous scripts. It shells a process using subprocess as in previous scripts. One thing that is new here is that we are shelling to java. Another things that is new here is that we are capturing the stderr by specifying PIPE in subprocess.run() arguments; this allows us to read the stderr stream and then we write it to a file.

import os
import subprocess
from subprocess import PIPE
import codecs


class HTMLTidiedChmFileNamer(object):
    _reg_clsid_ = "{8807D2B9-C83F-4AEB-A71D-15DBE8EFED9A}"
    _reg_progid_ = 'PythonInVBA.HTMLTidiedChmFileNamer'
    _public_methods_ = ['TidiedFilenameWin32Dict']

    def TidiedFilename(self, subdir, file):
        file2 = file.lower()
        tidiedFile = ""
        errorfile = ""
        validationErrorsFile = ""

        if ".tidied." not in file:
            if file2.endswith((".hhc", ".hhk")):
                tidiedFile = subdir + os.sep + file + ".tidied.html"
                errorfile = subdir + os.sep + file + ".tidied.errors.txt"
                validationErrorsFile = (subdir + os.sep + file +
                                        ".tidied.validationErrors.txt")

            if file2.endswith((".htm", ".html")):
                tidiedFile = (subdir + os.sep +
                              file.split('.')[0] + ".tidied.html")
                errorfile = (subdir + os.sep +
                             file.split('.')[0] + ".tidied.errors.txt")
                validationErrorsFile = (subdir + os.sep +
                                        file.split('.')[0] +
                                        ".tidied.validationErrors.txt")
        return (tidiedFile, errorfile, validationErrorsFile)


class ValidatorReporter(object):
    _reg_clsid_ = "{321F338F-75AE-460B-85A2-5C553A39CDE1}"
    _reg_progid_ = 'PythonInVBA.ValidatorReporter'
    _public_methods_ = ['ValidateBatch']

    def ValidateBatch(self, rootDir):

        if "vnu" not in os.environ:
            raise Exception(
                "vnu environment variable not defined, "
                "please define as vnu jar's parent folder")

        sVNUExe = os.path.join(os.environ["vnu"], "vnu.jar")
        FileNamer = HTMLTidiedChmFileNamer()

        for subdir, dirs, files in os.walk(rootDir):
            for file in files:
                tidiedFile, errorfile, validationErrorsFile = 
                        FileNamer.TidiedFilename(subdir, file)
                if not tidiedFile == "":
                    # https://github.com/validator/validator#user-content-usage
                    args = ['java', '-jar', sVNUExe, '--no-langdetect',
                            '--format', 'xml', tidiedFile]
                    proc = subprocess.run(args, stderr=PIPE)

                    file = codecs.open(validationErrorsFile, "w", "utf-8")
                    file.write(proc.stderr.decode("utf-8"))
                    file.close()

if __name__ == '__main__':
    print ("Registering COM servers...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(ValidatorReporter)
    win32com.server.register.UseCommandLine(HTMLTidiedChmFileNamer)
    
    rootdir = os.path.join(os.environ["tmp"], 'HelpFileDecompiler', "vblr6")
    test = ValidatorReporter()
    test.ValidateBatch(rootdir)

The portion of code that registers the COM classes require administrator rights. You can comment them out and run the script from command line instead in a purely Pythonic way.

The code assumes you have a folder with HTML files in it. For me I have given the code a folder of HTML files extracted from a decompiled *.chm file and the code takes a good while.

Client VBA Code

To prove we can call this Python script from VBA here is the client code

Sub TestValidatorReporter()
    
    Dim objValidatorReporter As Object
    Set objValidatorReporter = VBA.CreateObject("PythonInVBA.ValidatorReporter")
    
    objValidatorReporter.ValidateBatch Environ$("tmp") & "\HelpFileDecompiler\VBLR6\"

End Sub

Final Thoughts

For me, the resultant output is huge and will take time to comb through but it looks like I'll need to load HTML files into Xml parsers and rearrange attributes etc. More Python code to come in this series. So look out for that.

Wednesday 1 August 2018

Python - HTML Tidy - Script to restructure old help pages

So previously, I gave a Python script that decompiles a compiled help file (*.chm) into its constituent HTML help pages. Many of the resultant pages were very dated, poorly structured and had absolutely no hope of being well-formed enough to be parsable by an Xml parser (HTML 3.2 is to blame). I did write some VBA code to rewrite the files some time ago but I have found a much better technology called HTMLTidy. I came across HTMLTidy during my Python travels because there is a Python veneer library but actually I found that simple shelling a subprocess to be a much better approach.

The script below follows on from help file (*.chm) decompiler and it assumes that script has been run and has deposited files into a subdirectory of the Temp folder. This script also will form part of an overall workflow/pipeline.

Install HTMLTidy

Do please install HTMLTidy before attempting to run the code below. And set an environment variable %HTMLTidy% to point to the executable's parent folder. My installation is still in the Downloads folder (as determined by my browser) which shows how unfussy the install is (good).

Returning Tuples To VBA

As usual, the code is callable from Excel VBA by virtue of its COM registration but there is a slight problem and that is intrinsic Python types such as tuples do not return to VBA correctly. Attempting to return a tuple to VBA only returns the first element and not the whole list. I am happy to report that copying the values from a tuple into a Scripting.Dictionary (COM Type Library: Microsoft Scripting Runtime) and returning the dictionary to VBA solves the problem.

The Python pattern of returning a tuple is lovely and I would like to use it without restriction but if I want to make a method callable from VBA then I need to ship a second method that converts tuple to Scripting.Dictionary. I imagined wanting to do this in so many scenarios that it felt appropriate to write a little helper class...

class Win32DictConverter(object):
    def ConvertTupleToWin32Dict(self, tup):
        import win32com.client
        win32dict = win32com.client.Dispatch("Scripting.Dictionary")

        n=0
        for x in tup:
            win32dict.Add(n,x)
            n=n+1
        return  win32dict

And here is a usage example taken from this post's script.

    def TidiedFilenameWin32Dict(self, subdir, file):
        return  Win32DictConverter().ConvertTupleToWin32Dict(self.TidiedFilename(subdir, file))

Other than this little trick the rest of the code here is straightforward.

Code Walkthrough

So the code below essentially shell's to HTML Tidy in a manner similar to previous posts. [I have not used the Python veneer library to HTMLTidy as it gave counter-intuitive (to me at least) defaults.]

The only design decision to highlight is that I have broken out the code for the renaming of files into separate class, HTMLTidiedChmFileNamer, as I will probably need to call the logic therein from a later script. This is because this is meant to be part of a workflow/pipeline application.

You can see the code passing arguments to HTMLTidy from a reference of potential arguments see http://tidy.sourceforge.net/docs/quickref.html There are plenty to choose from

The key class only ships with one method, TidyBatch(), which takes a directory; this directory is recursively walked (a nice feature in Python) and each file that meets the naming rules will be tidied.

This script is part of a larger pipeline/workflow application which will decompile compiled help files (*.chm), tidy them and do further triage so meet the HTML5 standard and become convertible to ebooks version 3 (which is strict about HTML5). So we need some logic to handle compiled help file artefacts such as content (*.hhc) files and index (*.hhk) files. I will probably need to call that logic later so I put it in a class of its own, HTMLTidiedChmFileNamer.

import subprocess
import os
import os.path


class HTMLTidiedChmFileNamer(object):
    _reg_clsid_ = "{8807D2B9-C83F-4AEB-A71D-15DBE8EFED9A}"
    _reg_progid_ = 'PythonInVBA.HTMLTidiedChmFileNamer'
    _public_methods_ = ['TidiedFilenameWin32Dict']

    def TidiedFilename(self, subdir, file):
        file2 = file.lower()
        tidiedFile = ""
        errorfile = ""

        if ".tidied." not in file:
            if file2.endswith((".hhc", ".hhk")):
                tidiedFile = subdir + os.sep + file + ".tidied.html"
                errorfile = subdir + os.sep + file + ".tidied.errors.txt"

            if file2.endswith((".htm", ".html")):
                tidiedFile = (subdir + os.sep +
                              file.split('.')[0] + ".tidied.html")
                errorfile = (subdir + os.sep +
                             file.split('.')[0] + ".tidied.errors.txt")
        return (tidiedFile, errorfile)

    def TidiedFilenameWin32Dict(self, subdir, file):
        return Win32DictConverter().ConvertTupleToWin32Dict(
            self.TidiedFilename(subdir, file))


class Win32DictConverter(object):
    def ConvertTupleToWin32Dict(self, tup):
        import win32com.client
        win32dict = win32com.client.Dispatch("Scripting.Dictionary")

        n = 0
        for x in tup:
            win32dict.Add(n, x)
            n = n + 1
        return win32dict


class HTMLTidyChmFiles(object):
    _reg_clsid_ = "{20C361FF-1826-4673-A30D-FABA87FF7910}"
    _reg_progid_ = 'PythonInVBA.HTMLTidyChmFiles'
    _public_methods_ = ['TidyBatch']

    def TidyBatch(self, rootDir):

        if "HTMLTidy" not in os.environ:
            raise Exception(
                "HTMLTidy environment variable not defined, "
                "please define as HTMLTidy's bin folder")

        sHTMLTidyExe = os.path.join(os.environ["HTMLTidy"], "tidy.exe")
        FileNamer = HTMLTidiedChmFileNamer()

        for subdir, dirs, files in os.walk(rootDir):
            for file in files:
                tidiedFile, errorfile = FileNamer.TidiedFilename(subdir, file)
                fullPath = os.path.join(subdir, file)
                if not tidiedFile == "":
                    # http://tidy.sourceforge.net/docs/quickref.html
                    subprocess.run([sHTMLTidyExe, '-output', tidiedFile,
                                    '--doctype', 'html5', '--clean', 'yes',
                                    '--error-file', errorfile, fullPath])


if __name__ == '__main__':
    print ("Registering COM servers...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(HTMLTidyChmFiles)
    win32com.server.register.UseCommandLine(HTMLTidiedChmFileNamer)

    rootdir = os.path.join(os.environ["tmp"], 'HelpFileDecompiler', "vblr6")
    test = HTMLTidyChmFiles()
    test.TidyBatch(rootdir)

The above script needs running at least once with administrator rights in order to register the COM classes. Once that is run you can call from VBA with the code below. The main() function also runs a test.

VBA Client Code

So as this is an Excel blog I should show you some client VBA code. This helped testing. Also, I want to demonstrate how it is fine to write code in Python (and other languages) and make them callable from VBA, this helps to expand the horizons of the VBA developer.

Sub TestHTMLTidyChmFiles()
    
    '* assumes vblr6.chm has been decompiled previously (TestHelpFileDecompiler)
    
    Dim objHTMLTidyChmFiles As Object
    Set objHTMLTidyChmFiles = VBA.CreateObject("PythonInVBA.HTMLTidyChmFiles")
    
    Call objHTMLTidyChmFiles.TidyBatch(Environ$("tmp") & "\HelpFileDecompiler\VBLR6\")
    
End Sub


Sub TestHTMLTidiedChmFileNamer()
    
    Dim objFileNamer As Object
    Set objFileNamer = VBA.CreateObject("PythonInVBA.HTMLTidiedChmFileNamer")
    
    Dim dictResults As Scripting.Dictionary
    Set dictResults = objFileNamer.TidiedFilenameWin32Dict(Environ$("tmp") & "\HelpFileDecompiler\VBLR6\", "vblr6.hhc")
    
    Dim vRet As Variant
    vRet = dictResults.Items
    
    Debug.Print vRet(0)
    Debug.Print vRet(1)
    'Stop

End Sub

Final Thoughts

It is with joy that I found HTMLTidy restructures the mal-formed HTML 3.2 files buried in some *.chm files. There were some howlers such as duplicate opening <BODY>l tags etc and I am glad that the resulting files can now be further triaged by loading into an Xml parser. This is necessary because there is more work to be done to get these files up to HTML5 standard.

HTMLTidy does the restructuring for me and no doubt if I looked long enough in the documentation some command line options could help but the next task is to further validate the files to give a schedule of further triage options. HTMLTidy in its output recommends an HTML validator and I will look at that next.

I thoroughly recommend HTMLTidy over any other known VBA compatible solution for mal-formed HTML files (yes even HTML Agility pack!)