Friday 1 December 2017

Use MSHTML to parse local HTML file without using Internet Explorer (Microsoft HTML Object Library)

So an excellent question came up today on StackOverflow about the parsing of HTML in VBA for when Internet Explorer is unavailable.

Anyone who has done some web scraping will be familiar with creating an instance of Internet Explorer (IE) and the navigating to a web address and then once the page is ready start navigating the DOM using the 'Microsoft HTML Object Library' (MSHTML) type library. The question asks if IE is unavailable what to do. I am in the same situation for my box running Windows 10.

I had suspected it was possible to spin up an instance of MSHTML.HTMLDocument but its creation is not obvious. Thanks to the questioner for asking this now. The answer lies in the MSHTML.IHTMLDocument4.createDocumentFromUrl method. One needs a local file to work with (EDIT: actually one can put a webby url in as well!) but we have a nice tidy Windows API function called URLDownloadToFile to download a file.

This codes runs on my Windows 10 box where Microsoft Edge is running and not Internet Explorer. This is an important find and thanks to the questioner for raising it.


Option Explicit

'* Tools->Refernces Microsoft HTML Object Library


'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
        (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
        ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Sub Test()

    Dim fso As Object
    Set fso = CreateObject("Scripting.FileSystemObject")

    Dim sLocalFilename As String
    sLocalFilename = Environ$("TMP") & "\urlmon.html"
    
    Dim sURL As String
    sURL = "https://stackoverflow.com/users/3607273/s-meaden"
    
    
    Dim bOk As Boolean
    bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
    If bOk Then
        If fso.FileExists(sLocalFilename) Then
        
            '* Tools->References Microsoft HTML Object Library
            Dim oHtml4 As MSHTML.IHTMLDocument4
            Set oHtml4 = New MSHTML.HTMLDocument
            
            Dim oHtml As MSHTML.HTMLDocument
            Set oHtml = Nothing
            
            '* IHTMLDocument4.createDocumentFromUrl
            '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
            Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
            
            '* need to wait a little whilst the document parses
            '* because it is multithreaded
            While oHtml.readyState <> "complete"
                DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
            Wend
            Debug.Assert oHtml.readyState = "complete"
            

            Dim sTest As String
            sTest = Left$(oHtml.body.outerHTML, 100)
            Debug.Assert Len(Trim(sTest)) > 50  '* just testing we got a substantial block of text, feel free to delete
            
            '* this is where the page specific logic now goes, here I am getting info from a StackOverflow page
            Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
            Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
    
            Dim lAnswerLoop As Long
            For lAnswerLoop = 0 To htmlAnswers.Length - 1
                Dim vAnswerLoop
                Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
                Debug.Print vAnswerLoop.outerText
            
            Next
    
        End If
    End If
End Sub



No comments:

Post a Comment