So an excellent question came up today on StackOverflow about the parsing of HTML in VBA for when Internet Explorer is unavailable.
Anyone who has done some web scraping will be familiar with creating an instance of Internet Explorer (IE) and the navigating to a web address and then once the page is ready start navigating the DOM using the 'Microsoft HTML Object Library' (MSHTML) type library. The question asks if IE is unavailable what to do. I am in the same situation for my box running Windows 10.
I had suspected it was possible to spin up an instance of MSHTML.HTMLDocument but its creation is not obvious. Thanks to the questioner for asking this now. The answer lies in the MSHTML.IHTMLDocument4.createDocumentFromUrl method. One needs a local file to work with (EDIT: actually one can put a webby url in as well!) but we have a nice tidy Windows API function called URLDownloadToFile to download a file.
This codes runs on my Windows 10 box where Microsoft Edge is running and not Internet Explorer. This is an important find and thanks to the questioner for raising it.
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Sub Test()
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sLocalFilename As String
sLocalFilename = Environ$("TMP") & "\urlmon.html"
Dim sURL As String
sURL = "https://stackoverflow.com/users/3607273/s-meaden"
Dim bOk As Boolean
bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
If bOk Then
If fso.FileExists(sLocalFilename) Then
'* Tools->References Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* this is where the page specific logic now goes, here I am getting info from a StackOverflow page
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim vAnswerLoop
Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
Debug.Print vAnswerLoop.outerText
Next
End If
End If
End Sub
No comments:
Post a Comment