Thursday 4 January 2018

VBA MSHTML Webscraping - Looking for a New Oven

So webscraping is a task that can sometimes be a breeze and other times a pain in the neck. Without a shadow of a doubt the problem of mal-formed HTML where <BR>,<IMG> and <P> tags are not closed thus compromising the structure of the document is one of the most lamentable departures from a standard I can think of. If the tags were closed properly then one could have a sporting chance of running a document through an Xml parser. There was a push to get HTML to be XML compliant with XHTML but it failed. Along with unclosed tags, an Xml parser will choke on textual attributes values without quotes and also characters that need to be entitised like & -> &amp;. If it would parse successfully with Xml then XPath would be the way to navigate the document.

In the absence of Xml nirvana, other strategies have emerged. The prevalent use of JQuery and CSS means the web page developers structure the page so that their Javascript selects nodes based on CSS selectors. We can do this in MSHTML (the type library whose full name is 'Microsoft HTML Object Library') with querySelectorAll and querySelector.

Even so, I find MSHTML to be buggy and querySelectorAll and querySelector don't always work. So I find myself writing code to loop through child nodes and next siblings etc. I also write code to test the node name and the attribtutes of each node. In short, I end up writing some helper functions which I hope I will re-use. I will post code here. But I will also post some findings which are new (to me). I've even ended up writing some good old fashioned string manipulation to excise a snippet of code. I've also written code to allow compatible snippets to be loaded into a mini Xml document. All these are given below in my MSHTMLComparables class.

The task - scrape details of ovens

So we need a new oven and I need to get some details from a local shop's website. The code is given below, I saved the web pages off manually to save network traffic. I am experienced in this but one of two new (to me) things emerged.

MSHTML parses HTML5 <Article> tags as type MSHTML.HTMLUnknownElement

So HTML5 introduces new tags, and until the MSHTML library gets updates it looks like the new tags will be reported as being of type HTMLUnknownElement. So some methods we're not available to hand. Can be worked around fine.

MSHTML if querying style attribute for CSS source use CssText

So HTML elements have attributes, attributes are key value pairs. One such attribute found on an element is the Style element, and it is curious that the value of this attribute is itself a key value pair collection. I wanted to get the full style value but I ended up having to query for the specific CSS name, i.e. Height. Type library browser shows two promising set of methods, getAttribute, setAttribute, removeAttribute and getExpression, setExpression, removeExpression. Looking at the documentation remarks the xxxExpression methods are for the IHTMLDocument2::expando properties and added after the original xxxAttribute methods.

I used objStyle.getAttribute("height") because I knew it existed. I wanted the whole CSS source but the method toString gives only [object] which is a JavaScript object's default string representation. I know a thing or two about JavaScript objects and there string representation in VBA and borrowed my own SO code (below) to investigate but the code generates 150 properties and not the one or two that I was expecting.



...
            Dim objStyle As Object
            Set objStyle = divProductImagesLoop.getAttribute("style")
            
            Dim dicStyleAttrib As Scripting.Dictionary
            Set dicStyleAttrib = New Scripting.Dictionary
            
            Call GetScriptEngine.Run("enumerateKeys", objStyle, dicStyleAttrib) '* this gives JSON rendering instead of "[object Object]"


...

Private Function GetScriptEngine() As ScriptControl
    '* see code from this SO Q & A
    ' https://stackoverflow.com/questions/37711073/in-excel-vba-on-windows-how-to-get-stringified-json-respresentation-instead-of
    Static soScriptEngine As ScriptControl
    If soScriptEngine Is Nothing Then
        Set soScriptEngine = New ScriptControl
        soScriptEngine.Language = "JScript"

        
        soScriptEngine.AddCode "function enumerateKeys(jsonObj, microsoftDict) { " & _
                                    "for (var key in jsonObj) { " & _
                                    "microsoftDict.Add(key, jsonObj[key]);  " & _
                                    "}}"
    End If
    Set GetScriptEngine = soScriptEngine
End Function

Anyway, it turns out the that there is a CssText property which I think is sparsely documented, here is a non-Microsoft blog post.

MSHTMLComparables Class - beginnings of a reusable class for web scraping tasks (I hope)


Option Explicit

Private mdicComparables As New Scripting.Dictionary

Private fso As New Scripting.FileSystemObject


Private Declare Function GetTempFileName Lib "kernel32.dll" Alias "GetTempFileNameA" (ByVal lpszPath As String, _
            ByVal lpPrefixString As String, ByVal wUnique As Long, ByRef lpTempFileName As String) As Long


Friend Function HashOfHash() As Long
    
    Dim vPrimes
    vPrimes = Array(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
    
    Dim lRet As Long

    Dim lIndex As Long
    Dim vKeyLoop As Variant
    For Each vKeyLoop In mdicComparables.Keys
        lRet = lRet + (mdicComparables.HashVal(vKeyLoop) * vPrimes(lIndex))
        lIndex = lIndex + 1
    Next vKeyLoop
    
    Dim vItemLoop
    For Each vItemLoop In mdicComparables.Items
        lRet = lRet + (mdicComparables.HashVal(vKeyLoop) * vPrimes(lIndex))
        lIndex = lIndex + 1
    Next vItemLoop
    
    
    HashOfHash = lRet
End Function

Friend Function TempFile() As String

    TempFile = fso.BuildPath(VBA.Environ$("TMP"), "TempFile" & HashOfHash & ".html")
    
End Function



Friend Function SpawnHTMLFragmentFromSrc(ByVal sSrc As String) As Object
    Debug.Assert Len(sSrc) > 0
    
    Dim sTempFile As String
    sTempFile = TempFile
    
    Dim txtOut As Scripting.TextStream
    Set txtOut = fso.CreateTextFile(sTempFile)
    
    txtOut.WriteLine "<html>"
    txtOut.WriteLine "<head>"
    txtOut.WriteLine "<title>Temporary work file created by " & ThisWorkbook.Path & "</title>"
    txtOut.WriteLine "</head>"
    txtOut.WriteLine "<body>"
    txtOut.WriteLine sSrc
    txtOut.WriteLine "</body>"
    txtOut.WriteLine "</html>"
    txtOut.Close
    
    Set txtOut = Nothing
    
    Dim oHtml4 As MSHTML.IHTMLDocument4
    Set oHtml4 = New MSHTML.HTMLDocument
    
    Dim oHtml As MSHTML.HTMLDocument
    Set oHtml = Nothing
    
    '* IHTMLDocument4.createDocumentFromUrl
    '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
    Set oHtml = oHtml4.createDocumentFromUrl(sTempFile, "")
    While oHtml.readyState <> "complete"
        DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
    Wend
    Debug.Assert oHtml.readyState = "complete"
    
    
    
    Dim htmlBody As MSHTML.htmlBody
    Set htmlBody = oHtml.querySelector("body")
    
    Set SpawnHTMLFragmentFromSrc = htmlBody.FirstChild
    
    Set oHtml4 = Nothing
    Set oHtml = Nothing
    Set htmlBody = Nothing

End Function


Friend Function SpawnXmlFragmentFromSrc(ByVal sSrc As String) As MSXML2.IXMLDOMElement
    Debug.Assert Len(sSrc) > 0
    
    
    Dim sTempFile As String
    sTempFile = TempFile
    
    Dim txtOut As Scripting.TextStream
    Set txtOut = fso.CreateTextFile(sTempFile)
    txtOut.WriteLine sSrc
    txtOut.Close
    
    Set txtOut = Nothing
    
    Dim oDoc As MSXML2.DOMDocument60
    Set oDoc = New MSXML2.DOMDocument60
    oDoc.load sTempFile

    Dim xmlErr As IXMLDOMParseError2
    Set xmlErr = oDoc.parseError

    Debug.Assert oDoc.parseError.ErrorCode = 0
    
    Set SpawnXmlFragmentFromSrc = oDoc.DocumentElement
    


End Function


Public Function BuildComparablesDict(ByVal vKVPs As Variant) As Scripting.Dictionary
    Set BuildComparablesDict = Nothing

    Dim vKVPLoop As Variant
    For Each vKVPLoop In vKVPs
        If Not IsArray(vKVPLoop) Then Exit Function
        mdicComparables.add vKVPLoop(0), vKVPLoop(1)
    Next

    Set BuildComparablesDict = mdicComparables

End Function


Public Function FindSnippetInSource(ByVal sSrc As String) As String
    On Error GoTo ErrHandler
    Debug.Assert Len(sSrc) > 0
    
    
    If Not mdicComparables.exists("nodeName") Then
        Err.Raise vbObjectError, "#programmer must supply nodeName to find both start and end tag!"
    Else
        Dim sNodeName As String
        sNodeName = mdicComparables.Item("nodeName")
        
        If Len(sNodeName) = 0 Then
            Err.Raise vbObjectError, "#programmer must supply non-null nodeName to find both start and end tag!"
        Else
        
            Dim sStartTag As String
            sStartTag = "<" & sNodeName
            
            Dim sEndTag As String
            sEndTag = "</" & sNodeName & ">"
            
            Dim lIndex As Long
            lIndex = 1
            
            Do
                DoEvents
                Dim bOk As Boolean
                bOk = True
                Dim lFindStartTag As Long
                lFindStartTag = VBA.InStr(lIndex, sSrc, sStartTag, vbTextCompare)
                
                If lFindStartTag > 0 Then
                    lIndex = lFindStartTag + 1
                
                    Dim lFindEndTag As Long
                    lFindEndTag = VBA.InStr(lFindStartTag, sSrc, sEndTag, vbTextCompare)
                    
                    If lFindEndTag > 0 Then
                        
                        Dim sSnippet As String
                        sSnippet = Mid$(sSrc, lFindStartTag, lFindEndTag - lFindStartTag + Len(sEndTag))
                        
                        Dim lFindStartTag2 As Long
                        lFindStartTag2 = VBA.InStr(1, sSnippet, ">", vbTextCompare)
                        
                        If lFindStartTag2 > 0 Then
                            Dim sStartTagAndAttributes As String
                            sStartTagAndAttributes = Left$(sSnippet, lFindStartTag2)
                        
                            Dim sAttributesOnly As String
                            sAttributesOnly = Mid$(sStartTagAndAttributes, Len("<" & sStartTag & " "), Len(sStartTagAndAttributes) - Len("<" & sStartTag & " "))
                        
                            Dim vSplitAttributeBlock As Variant
                            vSplitAttributeBlock = SplitStringBySpaceRespectingQuotes(sAttributesOnly)
                            
                            Dim dicMyAttributes As Scripting.Dictionary
                            Set dicMyAttributes = New Scripting.Dictionary
                            
                            Dim vSplitAttributeBlockLoop As Variant
                            For Each vSplitAttributeBlockLoop In vSplitAttributeBlock
                                
                                Dim vSplitAttributeKVP As Variant
                                vSplitAttributeKVP = VBA.Split(vSplitAttributeBlockLoop, "=")
                                
                                Debug.Assert LBound(vSplitAttributeKVP) = 0
                                Debug.Assert UBound(vSplitAttributeKVP) = 1
                                
                                Dim vCleanedValue As Variant
                                vCleanedValue = vSplitAttributeKVP(1)
                                If Left$(vCleanedValue, 1) = """" Then vCleanedValue = Mid$(vCleanedValue, 2)
                                If Right$(vCleanedValue, 1) = """" Then vCleanedValue = Left$(vCleanedValue, Len(vCleanedValue) - 1)
                            
                                dicMyAttributes.add vSplitAttributeKVP(0), vCleanedValue
                            
                            Next vSplitAttributeBlockLoop
                            
                            Dim sStartTagRewritten As String
                            sStartTagRewritten = UCase$(sStartTag) & " "
                            
                            Dim vRewriteLoop As Variant
                            For Each vRewriteLoop In dicMyAttributes
                                sStartTagRewritten = sStartTagRewritten & vRewriteLoop & "=""" & dicMyAttributes(vRewriteLoop) & """ "
                            Next vRewriteLoop
                            
                            sStartTagRewritten = Trim(sStartTagRewritten) & ">"
                            
                            sSnippet = sStartTagRewritten & Mid$(sSnippet, lFindStartTag2 + 1)
                            
                            Dim vComparableKeyLoop As Variant
                            For Each vComparableKeyLoop In mdicComparables.Keys
                            
                                If StrComp(vComparableKeyLoop, "nodeName", vbTextCompare) <> 0 Then
                                    If Not dicMyAttributes.exists(vComparableKeyLoop) Then
                                        bOk = False '* FAIL
                                    Else
                                        
                                        If Not mdicComparables.Item(vComparableKeyLoop) = dicMyAttributes.Item(vComparableKeyLoop) Then
                                            bOk = False '* FAIL
                                        End If
                                
                                    End If
                                End If
                            
                            Next vComparableKeyLoop
                        End If
                    
                    End If
                
                    
                
                End If
                DoEvents
            Loop Until bOk Or lFindStartTag <= 0
            
            
        End If
    
    End If
    
    If bOk Then FindSnippetInSource = sSnippet
SingleExit:
    Exit Function
ErrHandler:
    Stop
    Resume

End Function

Friend Function SplitStringBySpaceRespectingQuotes(ByVal sToBeSplit As String) As Variant


    Dim sQuotesSpacesPlaceHeld As String
    sQuotesSpacesPlaceHeld = ""

    Dim lLoop As Long
    Dim bInQuote As Boolean
    For lLoop = 1 To Len(sToBeSplit)
    
        Dim sChar As String * 1
        sChar = Mid$(sToBeSplit, lLoop, 1)
        
        Dim lAsc As Long
        lAsc = Asc(sChar)
        
        If lAsc = 34 Then
            
            bInQuote = Not bInQuote
            
            sQuotesSpacesPlaceHeld = sQuotesSpacesPlaceHeld & sChar
        ElseIf lAsc = 32 And bInQuote Then
            '* don't copy over the space because it will break out VBA.Split(sFoo," ") logic
            '* instead copy over "<>" which should appear in HTML/XML attributes blocks (they should be <>)
            '* don't forget to replace the "<>" with " " at the end!
            sQuotesSpacesPlaceHeld = sQuotesSpacesPlaceHeld & "<>"
        Else
            sQuotesSpacesPlaceHeld = sQuotesSpacesPlaceHeld & sChar
        End If
        
        
    Next lLoop
    
    Dim vSplit As Variant
    vSplit = VBA.Split(sQuotesSpacesPlaceHeld, " ")
    
    
    For lLoop = LBound(vSplit) To UBound(vSplit)
        vSplit(lLoop) = VBA.Replace(vSplit(lLoop), "<>", " ")
    Next lLoop
    
    SplitStringBySpaceRespectingQuotes = vSplit
    



End Function

Friend Function NodeMatchesComparables(ByVal objNode As Object) As Boolean
    Dim bOk As Boolean
    bOk = True '* good until proven otherwise

    '* has to match all, so need to loop thru all

    Dim vComparableKeyLoop As Variant
    For Each vComparableKeyLoop In mdicComparables.Keys
        
        Dim vCompare As Variant
        vCompare = Empty
        
        If vComparableKeyLoop = "class" Then
            vCompare = objNode.className
        ElseIf vComparableKeyLoop = "nodeName" Then
            vCompare = objNode.nodeName
        Else
            vCompare = objNode.getAttribute(vComparableKeyLoop)
        End If
        
        
                    
        If IsNull(vCompare) Then
            bOk = False
        Else
            If StrComp(vCompare, mdicComparables.Item(vComparableKeyLoop), vbTextCompare) <> 0 Then
                bOk = False
            End If
        End If
        
    Next vComparableKeyLoop
    
    NodeMatchesComparables = bOk
End Function

Public Function FindNextSiblingNodeByAttibutes(ByVal objStartNode As Object) As Object
    Debug.Assert Not objStartNode Is Nothing
    If mdicComparables.Count > 0 Then
            
        Dim objNodeLoop As Object
        Set objNodeLoop = objStartNode.NextSibling
    
        Do
    
            If NodeMatchesComparables(objNodeLoop) Then
                Set FindNextSiblingNodeByAttibutes = objNodeLoop
                Exit Do
            End If
            Set objNodeLoop = objNodeLoop.NextSibling
        
        Loop Until objNodeLoop Is Nothing
        
    End If

End Function

Public Function FindChildNodeByAttibutes(ByVal objStartNode As Object) As Object
    Debug.Assert Not objStartNode Is Nothing
        
    On Error GoTo ErrHandler
    
    If mdicComparables.Count > 0 Then
    
        Dim objChildNodes As Object
        Set objChildNodes = objStartNode.ChildNodes
    
        Dim lChildNodeLoop As Long
        For lChildNodeLoop = 0 To objChildNodes.Length - 1
            
            Dim objChildNodeLoop As Object
            Set objChildNodeLoop = objChildNodes.Item(lChildNodeLoop)

            If NodeMatchesComparables(objChildNodeLoop) Then
                Set FindChildNodeByAttibutes = objChildNodeLoop
                Exit For
            End If
    
        Next lChildNodeLoop
    End If
SingleExit:
    Exit Function
    
ErrHandler:
    Stop
    Resume

End Function



The main code module - contains web page specific logic


Option Explicit

'* Tools->Refernces Microsoft HTML Object Library

Sub TestScrapeProductDetailsFromMainPages()
    Dim wsOvens As Excel.Worksheet
    Set wsOvens = ThisWorkbook.Worksheets.Item("Ovens")

    Dim dicProducts As Scripting.Dictionary
    Set dicProducts = ScrapeProductDetailsFromMainPages

    wsOvens.Cells.clear



    Dim lRowLoop As Long
    lRowLoop = 2


    Dim dicColumnOrdinals As Scripting.Dictionary
    Set dicColumnOrdinals = New Scripting.Dictionary

    Dim vProductLoop As Variant
    For Each vProductLoop In dicProducts.Keys
        Dim dicProductLoop As Scripting.Dictionary
        Set dicProductLoop = dicProducts.Item(vProductLoop)

        Dim vFeatureKeyLoop As Variant
        For Each vFeatureKeyLoop In dicProductLoop
            
            If VBA.InStr(1, "|CollectionUnavailable|HomeDeliveryAvailable|noStock|CollectInStore|OutOfStock|OnlineOnly|", "|" & vFeatureKeyLoop & "|", vbTextCompare) = 0 Then
            
                If Not dicColumnOrdinals.exists(vFeatureKeyLoop) Then
                    dicColumnOrdinals.add vFeatureKeyLoop, dicColumnOrdinals.Count
                End If
            End If
        Next vFeatureKeyLoop
    Next
    
    
    dicColumnOrdinals.add "HomeDeliveryAvailable", dicColumnOrdinals.Count
    dicColumnOrdinals.add "CollectionUnavailable", dicColumnOrdinals.Count
    dicColumnOrdinals.add "noStock", dicColumnOrdinals.Count
    dicColumnOrdinals.add "CollectInStore", dicColumnOrdinals.Count
    dicColumnOrdinals.add "OutOfStock", dicColumnOrdinals.Count
    dicColumnOrdinals.add "OnlineOnly", dicColumnOrdinals.Count

'

    Dim vColOrd As Variant
    For Each vColOrd In dicColumnOrdinals.Keys
        wsOvens.Cells(1, dicColumnOrdinals.Item(vColOrd) + 1).Value2 = vColOrd
    Next
    
    For Each vProductLoop In dicProducts.Keys
        
        Set dicProductLoop = dicProducts.Item(vProductLoop)

        
        For Each vFeatureKeyLoop In dicProductLoop
            If Not dicColumnOrdinals.exists(vFeatureKeyLoop) Then
                wsOvens.Cells(1, dicColumnOrdinals.Count).Value2 = vFeatureKeyLoop

            End If
            wsOvens.Cells(lRowLoop, dicColumnOrdinals.Item(vFeatureKeyLoop) + 1).Value2 = dicProductLoop(vFeatureKeyLoop)
        Next vFeatureKeyLoop




        lRowLoop = lRowLoop + 1
    Next


    FormatOvensRows

End Sub

Function ScrapeProductDetailsFromMainPages() As Scripting.Dictionary

    Dim fso As Scripting.FileSystemObject
    Set fso = CreateObject("Scripting.FileSystemObject")


    Dim sFiles(1 To 2) As String
    sFiles(1) = "N:\COOKER\Built-in double ovens - Cheap Built-in double ovens Deals _ Currys 1_50.html"
    sFiles(2) = "N:\COOKER\Built-in double ovens - Cheap Built-in double ovens Deals _ Currys 2_50.html"


    Debug.Assert fso.FileExists(sFiles(1))
    Debug.Assert fso.FileExists(sFiles(2))

    Dim sSrc(1 To 2) As String
    sSrc(1) = fso.OpenTextFile(sFiles(1)).ReadAll
    sSrc(2) = fso.OpenTextFile(sFiles(2)).ReadAll


     Dim lFileLoop As Long
     For lFileLoop = 1 To 2


        '* Tools->Refernces Microsoft HTML Object Library
        Dim oHtml4 As MSHTML.IHTMLDocument4
        Set oHtml4 = New MSHTML.HTMLDocument

        Dim oHtml As MSHTML.HTMLDocument
        Set oHtml = Nothing

        '* IHTMLDocument4.createDocumentFromUrl
        '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
        Set oHtml = oHtml4.createDocumentFromUrl(sFiles(lFileLoop), "")

        '* need to wait a little whilst the document parses
        '* because it is multithreaded
        While oHtml.readyState <> "complete"
            DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
        Wend
        Debug.Assert oHtml.readyState = "complete"


        Dim sTest As String
        sTest = Left$(oHtml.body.outerHTML, 100)
        Debug.Assert Len(Trim(sTest)) > 50  '* just testing we got a substantial block of text, feel free to delete

        '* this is where the page specific logic now goes, here I am getting info from a StackOverflow page
        Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
        Set htmlAnswers = oHtml.getElementsByClassName("result-prd")

        Dim htmlProductListImages As Object
        Set htmlProductListImages = oHtml.getElementsByClassName("productListImage")


        Dim htmlDescs As Object
        Set htmlDescs = oHtml.getElementsByClassName("product-desc")

        Dim htmlAnchorIns As Object
        Set htmlAnchorIns = oHtml.querySelectorAll("a.in")


        Dim htmlDivsMainAmount As Object
        Set htmlDivsMainAmount = oHtml.querySelectorAll("div.main-amount")

        Dim htmlChannels As Object
        Set htmlChannels = oHtml.querySelectorAll("ul.simple.prd-channels")

        Dim htmlPromoMessages As Object
        Set htmlPromoMessages = oHtml.querySelectorAll("div.promoMessages")

        Debug.Assert htmlAnswers.Length = htmlPromoMessages.Length
        Debug.Assert htmlAnswers.Length = htmlChannels.Length
        Debug.Assert htmlAnswers.Length = htmlDivsMainAmount.Length
        Debug.Assert htmlAnswers.Length = htmlProductListImages.Length
        Debug.Assert htmlAnswers.Length = htmlDescs.Length
        Debug.Assert htmlAnswers.Length = htmlAnchorIns.Length


        Dim dicProducts As New Scripting.Dictionary


        Dim lAnswerLoop As Long
        For lAnswerLoop = 0 To htmlAnswers.Length - 1
            Dim dicProductLoop As Scripting.Dictionary
            Set dicProductLoop = New Scripting.Dictionary


            '* each grid cell has its own article
            dicProductLoop.add "articleId", htmlAnswers.Item(lAnswerLoop).getAttribute("id")

            '* grab pictures details like src,alt and CSStext
            AddPictureDetails dicProductLoop, htmlProductListImages.Item(lAnswerLoop)


            '* grab brand and product name
            AddBrandAndName dicProductLoop, htmlAnchorIns.Item(lAnswerLoop)

            '* grab review score (out of 10) and total reviews
            AddRevoo dicProductLoop, htmlAnchorIns.Item(lAnswerLoop)

            '* grab price now, and optionally price befor and saving
            AddPrices dicProductLoop, htmlDivsMainAmount.Item(lAnswerLoop)

            '* check availability, delivery, stock on order etc
            AddChannels dicProductLoop, htmlChannels.Item(lAnswerLoop)
 
            '* get online only
            AddPromoMessages dicProductLoop, htmlPromoMessages.Item(lAnswerLoop)

            '* build a dictionary of dictionaries
            dicProducts.add dicProducts.Count, dicProductLoop
            
            

            Set dicProductLoop = Nothing
            Debug.Print lAnswerLoop
        Next
    Next lFileLoop
    Set ScrapeProductDetailsFromMainPages = dicProducts
End Function

Private Function AddPromoMessages(ByVal dicProductLoop As Scripting.Dictionary, _
            ByVal htmlDivPromoMessages As MSHTML.HTMLDivElement)
            
    Dim oOnLineOnly As MSHTMLComparables
    Set oOnLineOnly = New MSHTMLComparables

    oOnLineOnly.BuildComparablesDict Array(Array("nodeName", "span"), Array("class", "label-online-only"))
    If InStr(1, htmlDivPromoMessages.outerHTML, "label-online-only", vbTextCompare) > 0 Then
        Dim sOnlineOnlySnippet As String
        sOnlineOnlySnippet = oOnLineOnly.FindSnippetInSource(htmlDivPromoMessages.innerHTML)
        
        If Len(sOnlineOnlySnippet) > 0 Then
        
            Dim xmlOnlineOnly As Object
            Set xmlOnlineOnly = oOnLineOnly.SpawnXmlFragmentFromSrc(sOnlineOnlySnippet)
            
            If Not xmlOnlineOnly Is Nothing Then
            
                dicProductLoop.add "OnlineOnly", xmlOnlineOnly.nodeTypedValue
            End If
            
            'Stop
        End If
    End If
            
End Function


Private Function AddChannels(ByVal dicProductLoop As Scripting.Dictionary, _
            ByVal htmlChannelList As MSHTML.HTMLListElement)

    Dim lLoop As Long
    For lLoop = 0 To htmlChannelList.ChildNodes.Length - 1
        Dim objChildNode As Object
        Set objChildNode = htmlChannelList.ChildNodes.Item(lLoop)

        Debug.Assert StrComp(objChildNode.nodeName, "li", vbTextCompare) = 0
        Dim listItem As MSHTML.HTMLLIElement
        Set listItem = objChildNode

        If listItem.className = "nostock" Then

            Dim vNoStock As Variant
            Dim vUnavailable As Variant
            Dim vCollectionUnavailable As Variant
            Dim vOutOfStock As Variant
            Dim vHomeDeliveryAvailable As Variant
            Dim vCollectInStore As Variant


            Dim lListItemChildLoop As Long
            For lListItemChildLoop = 0 To listItem.ChildNodes.Length - 1

                Dim objListItemChildLoop As Object
                Set objListItemChildLoop = listItem.ChildNodes.Item(lListItemChildLoop)
                If objListItemChildLoop.nodeName = "#text" Then
                    vNoStock = vNoStock & Trim(objListItemChildLoop.data) & ". "

                ElseIf StrComp(objListItemChildLoop.nodeName, "SPAN", vbTextCompare) = 0 Then
                    If objListItemChildLoop.className = "email-when-back" Then
                        vNoStock = vNoStock & "Email me when back. "
                    Else
                        Stop
                    End If
                    'Stop
                ElseIf StrComp(objListItemChildLoop.nodeName, "I", vbTextCompare) = 0 Then
                    '*ignore
                Else
                    Stop
                End If


            Next lListItemChildLoop

        ElseIf listItem.className = "available" Then
            If listItem.getAttribute("data-availability") = "homeDeliveryAvailable" Then
            
                Dim objHomedeliveryAvailChildLoop As Object
                For Each objHomedeliveryAvailChildLoop In listItem.ChildNodes
                    If objHomedeliveryAvailChildLoop.nodeName = "#text" Then
                        vHomeDeliveryAvailable = vHomeDeliveryAvailable & objHomedeliveryAvailChildLoop.data & ". "
                    End If
                
                Next
            ElseIf listItem.getAttribute("data-availability") = "collectInStoreUnavailable" Then
                Dim objCollectInStoreUnAvailableLoop2 As Object
                For Each objCollectInStoreUnAvailableLoop2 In listItem.ChildNodes
                    If objCollectInStoreUnAvailableLoop2.nodeName = "#text" Then
                        vCollectionUnavailable = vCollectionUnavailable & objCollectInStoreUnAvailableLoop2.data & ". "
                    End If
                
                Next
                
            ElseIf listItem.getAttribute("data-availability") = "collectInStoreAvailable" Then
                Dim objCollectInStoreAvailableLoop As Object
                For Each objCollectInStoreAvailableLoop In listItem.ChildNodes
                    If objCollectInStoreAvailableLoop.nodeName = "#text" Then
                        vCollectInStore = vCollectInStore & objCollectInStoreAvailableLoop.data & ". "
                    End If
                
                Next
                'Stop
            Else
                Stop
            End If
            'Stop

        ElseIf listItem.className = "unavailable" Then
            If listItem.getAttribute("data-availability") = "collectInStoreUnavailable" Then
                Dim objCollectInStoreUnAvailableLoop1 As Object
                For Each objCollectInStoreUnAvailableLoop1 In listItem.ChildNodes
                    If objCollectInStoreUnAvailableLoop1.nodeName = "#text" Then
                        vCollectionUnavailable = vCollectionUnavailable & objCollectInStoreUnAvailableLoop1.data & ". "
                    End If
                
                Next
                
            
            
                
            ElseIf listItem.getAttribute("data-availability") = "homeDeliveryUnavailable" Then
                Dim objHomeDeliveryUnavailableLoop As Object
                For Each objHomeDeliveryUnavailableLoop In listItem.ChildNodes
                    If objHomeDeliveryUnavailableLoop.nodeName = "#text" Then
                        vOutOfStock = vOutOfStock & objHomeDeliveryUnavailableLoop.data & ". "
                    End If
                
                Next
                
            Else
                Stop
            End If
            'Stop

        Else
            'If listItem.className = "available"
            Stop
        End If

    Next lLoop

    If Not IsEmpty(vHomeDeliveryAvailable) Then
        dicProductLoop.add "HomeDeliveryAvailable", Trim(vHomeDeliveryAvailable)
    End If

    If Not IsEmpty(vNoStock) Then
        dicProductLoop.add "noStock", Trim(vNoStock)
    End If

    If Not IsEmpty(vCollectionUnavailable) Then
        dicProductLoop.add "CollectionUnavailable", Trim(vCollectionUnavailable)
    End If
    
    If Not IsEmpty(vOutOfStock) Then
        dicProductLoop.add "OutOfStock", Trim(vOutOfStock)
    End If

    If Not IsEmpty(vCollectInStore) Then
        dicProductLoop.add "CollectInStore", Trim(vCollectInStore)
    End If

End Function

Private Function AddPrices(ByVal dicProductLoop As Scripting.Dictionary, _
            ByVal htmlDivMainAmount As MSHTML.HTMLDivElement)



    Dim oPrices As MSHTMLComparables
    Set oPrices = New MSHTMLComparables

    oPrices.BuildComparablesDict Array(Array("nodeName", "strong"), Array("class", "price"), Array("data-product", "price"))

    Dim sSnippet As String
    sSnippet = oPrices.FindSnippetInSource(htmlDivMainAmount.innerHTML)

    sSnippet = VBA.Replace(sSnippet, "£", "£")
    sSnippet = VBA.Replace(sSnippet, "£", "")

    Dim xmlMainPriceSnippet As Object
    Set xmlMainPriceSnippet = oPrices.SpawnXmlFragmentFromSrc(sSnippet)

    Dim vPriceNow As Variant
    vPriceNow = xmlMainPriceSnippet.nodeTypedValue
    
    Debug.Assert IsNumeric(vPriceNow)

    dicProductLoop.add "priceNow", CCur(vPriceNow)

    Dim xmlSpanPastAmount As MSHTML.HTMLSpanElement
    Dim xmlStrongSaving As MSHTML.HTMLPhraseElement

    Dim lChildLoop As Long
    For lChildLoop = 0 To htmlDivMainAmount.ChildNodes.Length - 1

        Dim objChildLoop As Object
        Set objChildLoop = htmlDivMainAmount.ChildNodes.Item(lChildLoop)

        If StrComp(objChildLoop.nodeName, "#text", vbTextCompare) = 0 Then
            '* empty text
        ElseIf StrComp(objChildLoop.nodeName, "strong", vbTextCompare) = 0 Then
            If objChildLoop.className = "saving" Then
                Set xmlStrongSaving = objChildLoop
            End If

        ElseIf StrComp(objChildLoop.nodeName, "span", vbTextCompare) = 0 Then
            Set xmlSpanPastAmount = objChildLoop
        End If

    Next lChildLoop

    If Not xmlSpanPastAmount Is Nothing Then
        If xmlSpanPastAmount.ChildNodes.Length > 0 Then
            dicProductLoop.add "pastAmount", xmlSpanPastAmount.FirstChild.innerText
        End If
    End If

    If Not xmlStrongSaving Is Nothing Then
        dicProductLoop.add "saving", xmlStrongSaving.innerText
    End If

End Function


Private Function AddRevoo(ByVal dicProductLoop As Scripting.Dictionary, _
            ByVal aProductDescLoop As MSHTML.HTMLAnchorElement)

    Static oRevooDiv As MSHTMLComparables
    If oRevooDiv Is Nothing Then
        Set oRevooDiv = New MSHTMLComparables
        oRevooDiv.BuildComparablesDict Array(Array("nodeName", "div"), Array("class", "reevoo-placeholder"))
    End If

    Dim sSnippet As String
    sSnippet = oRevooDiv.FindSnippetInSource(aProductDescLoop.outerHTML)

    Dim xmlRevooDiv As MSXML2.IXMLDOMElement
    Set xmlRevooDiv = oRevooDiv.SpawnXmlFragmentFromSrc(sSnippet)

    If Not xmlRevooDiv Is Nothing Then

        If xmlRevooDiv.LastChild.nodeTypedValue = "No reviews yet (0)" Then
            dicProductLoop.add "totalReviews", 0
        Else
            dicProductLoop.add "totalReviews", Application.Evaluate(xmlRevooDiv.LastChild.nodeTypedValue)

            Dim xmlClass As MSXML2.IXMLDOMAttribute
            Set xmlClass = xmlRevooDiv.FirstChild.Attributes.getNamedItem("class")

            Debug.Assert Left$(xmlClass.text, Len("reevoo-score score-")) = "reevoo-score score-"
            Dim vScore As Variant
            vScore = Mid$(xmlClass.text, Len("reevoo-score score-") + 1)
            Debug.Assert IsNumeric(vScore)
            dicProductLoop.add "revooScore", CDbl(vScore)
        End If

    End If



End Function

Private Function AddBrandAndName(ByVal dicProductLoop As Scripting.Dictionary, _
            ByVal aProductDescLoop As MSHTML.HTMLAnchorElement)

    Static oBrandSpan As MSHTMLComparables
    If oBrandSpan Is Nothing Then
        Set oBrandSpan = New MSHTMLComparables
        oBrandSpan.BuildComparablesDict Array(Array("nodeName", "span"), Array("data-product", "brand"))
    End If

    Dim xmlBrandSpan As MSXML2.IXMLDOMElement
    Set xmlBrandSpan = oBrandSpan.SpawnXmlFragmentFromSrc(oBrandSpan.FindSnippetInSource(aProductDescLoop.outerHTML))

    If Not xmlBrandSpan Is Nothing Then
        dicProductLoop.add "brand", xmlBrandSpan.nodeTypedValue
    End If

    Static oNameSpan As MSHTMLComparables
    If oNameSpan Is Nothing Then
        Set oNameSpan = New MSHTMLComparables
        oNameSpan.BuildComparablesDict Array(Array("nodeName", "span"), Array("data-product", "name"))
    End If

    Dim xmlNameSpan As MSXML2.IXMLDOMElement
    Set xmlNameSpan = oNameSpan.SpawnXmlFragmentFromSrc(oNameSpan.FindSnippetInSource(aProductDescLoop.outerHTML))

    If Not xmlNameSpan Is Nothing Then
        dicProductLoop.add "name", xmlNameSpan.nodeTypedValue
    End If


End Function

Private Function AddPictureDetails(ByVal dicProduct As Scripting.Dictionary, _
            ByVal divProductListImageLoop As MSHTML.HTMLDivElement)


    Dim anchor As MSHTML.HTMLAnchorElement
    Set anchor = divProductListImageLoop.FirstChild



    Dim divProductImages As MSHTML.HTMLDivElement
    Set divProductImages = anchor.FirstChild

    Dim objStyle As Object
    Set objStyle = divProductImages.getAttribute("style")

    dicProduct.add "imageCssText", objStyle.getAttribute("cssText")

    Dim imgProductImages As MSHTML.HTMLImg
    Set imgProductImages = divProductImages.FirstChild

    If imgProductImages.nodeName = "DIV" Then
        Set imgProductImages = imgProductImages.NextSibling
    End If

    Debug.Assert imgProductImages.className = "image"
    dicProduct.add "imageSrc", imgProductImages.getAttribute("src")
    dicProduct.add "imageAlt", imgProductImages.getAttribute("alt")
    Debug.Assert Len(imgProductImages.getAttribute("alt")) > 0

End Function


Sub TestSplitAttribsOnSpacesRespectingQuotes()

    Dim o As MSHTMLComparables
    Set o = New MSHTMLComparables

    Dim v
    v = o.SplitStringBySpaceRespectingQuotes("class=""bold italic underline"" id=""45""")
    Debug.Assert v(0) = "class=""bold italic underline"""
    Debug.Assert v(1) = "id=""45"""

End Sub

Sub FormatOvensRows()
    Dim wsOvens As Excel.Worksheet
    Set wsOvens = ThisWorkbook.Worksheets.Item("Ovens")
    wsOvens.Cells.ClearFormats

    Dim rowLoop As Excel.Range
    For Each rowLoop In wsOvens.UsedRange.rows
        If rowLoop.row > 1 Then
            DoEvents
            Dim lRow As Long
            lRow = rowLoop.row - 2
    
            Dim lRowMod50 As Long
            lRowMod50 = lRow Mod 50
    
            Dim lRowMod8 As Long
            lRowMod8 = lRowMod50 Mod 8
    
            Debug.Print lRowMod8
    
            If lRowMod8 \ 4 = 0 Then
                rowLoop.Interior.Color = rgbAliceBlue
    
            End If

        End If
    Next
    
    wsOvens.Names("Me").RefersToR1C1 = "=Ovens!RC"
    
    
    
    Dim rngDataRows As Excel.Range
    Set rngDataRows = wsOvens.UsedRange.Offset(1, 0).Resize(wsOvens.UsedRange.rows.Count - 1)
    
    Dim rngPriceNo As Excel.Range
    Set rngPriceNo = rngDataRows.columns("I")
    
    Dim lNoStockOffset As Long
    lNoStockOffset = Application.Evaluate("MATCH(""noStock"",1:1,0)-MATCH(""priceNow"",1:1,0)")
    
    Dim lOutOfStockOffset As Long
    lOutOfStockOffset = Application.Evaluate("MATCH(""outOfStock"",1:1,0)-MATCH(""priceNow"",1:1,0)")
    
    With rngPriceNo.FormatConditions.add( _
        Type:=xlExpression, _
        Formula1:="=(len(offset(me,0," & lNoStockOffset & "))+len(offset(me,0," & lOutOfStockOffset & ")))>0")
        
        .Font.Color = rgbGrey
    
    End With
    

    With rngPriceNo.Font
        .Color = -16776961
        .TintAndShade = 0
        .name = "Calibri"
        .FontStyle = "Bold"
        .size = 11
        .Strikethrough = False
        .Superscript = False
        .Subscript = False
        .OutlineFont = False
        .Shadow = False
        .Underline = xlUnderlineStyleNone
        .Color = -16776961
        .TintAndShade = 0
        .ThemeFont = xlThemeFontMinor
    End With
    
    rngPriceNo.NumberFormat = "$#,##0.00"


    With rngDataRows.columns("J").Font
        .name = "Calibri"
        .FontStyle = "Regular"
        .size = 11
        .Strikethrough = False
        .Superscript = False
        .Subscript = False
        .OutlineFont = False
        .Shadow = False
        .Underline = xlUnderlineStyleNone
        .ThemeColor = xlThemeColorDark1
        .TintAndShade = -0.349986267
        .ThemeFont = xlThemeFontMinor
    End With


    With rngDataRows.columns("K").Font
        .name = "Calibri"
        .FontStyle = "Bold"
        .size = 11
        .Strikethrough = False
        .Superscript = False
        .Subscript = False
        .OutlineFont = False
        .Shadow = False
        .Underline = xlUnderlineStyleNone
        .ThemeColor = xlThemeColorLight1
        .TintAndShade = 0
        .ThemeFont = xlThemeFontMinor
    End With


End Sub




So here is a screenshot of the output in Excel worksheet

And here is screenshot of part of the original web page

No comments:

Post a Comment