So webscraping is a task that can sometimes be a breeze and other times a pain in the neck. Without a shadow of a doubt the problem of mal-formed HTML where <BR>,<IMG> and <P> tags are not closed thus compromising the structure of the document is one of the most lamentable departures from a standard I can think of. If the tags were closed properly then one could have a sporting chance of running a document through an Xml parser. There was a push to get HTML to be XML compliant with XHTML but it failed. Along with unclosed tags, an Xml parser will choke on textual attributes values without quotes and also characters that need to be entitised like & -> &amp;. If it would parse successfully with Xml then XPath would be the way to navigate the document.
In the absence of Xml nirvana, other strategies have emerged. The prevalent use of JQuery and CSS means the web page developers structure the page so that their Javascript selects nodes based on CSS selectors. We can do this in MSHTML (the type library whose full name is 'Microsoft HTML Object Library') with querySelectorAll and querySelector.
Even so, I find MSHTML to be buggy and querySelectorAll and querySelector don't always work. So I find myself writing code to loop through child nodes and next siblings etc. I also write code to test the node name and the attribtutes of each node. In short, I end up writing some helper functions which I hope I will re-use. I will post code here. But I will also post some findings which are new (to me). I've even ended up writing some good old fashioned string manipulation to excise a snippet of code. I've also written code to allow compatible snippets to be loaded into a mini Xml document. All these are given below in my MSHTMLComparables class.
The task - scrape details of ovens
So we need a new oven and I need to get some details from a local shop's website. The code is given below, I saved the web pages off manually to save network traffic. I am experienced in this but one of two new (to me) things emerged.
MSHTML parses HTML5 <Article> tags as type MSHTML.HTMLUnknownElement
So HTML5 introduces new tags, and until the MSHTML library gets updates it looks like the new tags will be reported as being of type HTMLUnknownElement. So some methods we're not available to hand. Can be worked around fine.
MSHTML if querying style attribute for CSS source use CssText
So HTML elements have attributes, attributes are key value pairs. One such attribute found on an element is the Style element, and it is curious that the value of this attribute is itself a key value pair collection. I wanted to get the full style value but I ended up having to query for the specific CSS name, i.e. Height. Type library browser shows two promising set of methods, getAttribute, setAttribute, removeAttribute and getExpression, setExpression, removeExpression. Looking at the documentation remarks the xxxExpression methods are for the IHTMLDocument2::expando properties and added after the original xxxAttribute methods.
I used objStyle.getAttribute("height") because I knew it existed. I wanted the whole CSS source but the method toString gives only [object] which is a JavaScript object's default string representation. I know a thing or two about JavaScript objects and there string representation in VBA and borrowed my own SO code (below) to investigate but the code generates 150 properties and not the one or two that I was expecting.
...
Dim objStyle As Object
Set objStyle = divProductImagesLoop.getAttribute("style")
Dim dicStyleAttrib As Scripting.Dictionary
Set dicStyleAttrib = New Scripting.Dictionary
Call GetScriptEngine.Run("enumerateKeys", objStyle, dicStyleAttrib) '* this gives JSON rendering instead of "[object Object]"
...
Private Function GetScriptEngine() As ScriptControl
'* see code from this SO Q & A
' https://stackoverflow.com/questions/37711073/in-excel-vba-on-windows-how-to-get-stringified-json-respresentation-instead-of
Static soScriptEngine As ScriptControl
If soScriptEngine Is Nothing Then
Set soScriptEngine = New ScriptControl
soScriptEngine.Language = "JScript"
soScriptEngine.AddCode "function enumerateKeys(jsonObj, microsoftDict) { " & _
"for (var key in jsonObj) { " & _
"microsoftDict.Add(key, jsonObj[key]); " & _
"}}"
End If
Set GetScriptEngine = soScriptEngine
End Function
Anyway, it turns out the that there is a CssText property which I think is sparsely documented, here is a non-Microsoft blog post.
MSHTMLComparables Class - beginnings of a reusable class for web scraping tasks (I hope)
Option Explicit
Private mdicComparables As New Scripting.Dictionary
Private fso As New Scripting.FileSystemObject
Private Declare Function GetTempFileName Lib "kernel32.dll" Alias "GetTempFileNameA" (ByVal lpszPath As String, _
ByVal lpPrefixString As String, ByVal wUnique As Long, ByRef lpTempFileName As String) As Long
Friend Function HashOfHash() As Long
Dim vPrimes
vPrimes = Array(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
Dim lRet As Long
Dim lIndex As Long
Dim vKeyLoop As Variant
For Each vKeyLoop In mdicComparables.Keys
lRet = lRet + (mdicComparables.HashVal(vKeyLoop) * vPrimes(lIndex))
lIndex = lIndex + 1
Next vKeyLoop
Dim vItemLoop
For Each vItemLoop In mdicComparables.Items
lRet = lRet + (mdicComparables.HashVal(vKeyLoop) * vPrimes(lIndex))
lIndex = lIndex + 1
Next vItemLoop
HashOfHash = lRet
End Function
Friend Function TempFile() As String
TempFile = fso.BuildPath(VBA.Environ$("TMP"), "TempFile" & HashOfHash & ".html")
End Function
Friend Function SpawnHTMLFragmentFromSrc(ByVal sSrc As String) As Object
Debug.Assert Len(sSrc) > 0
Dim sTempFile As String
sTempFile = TempFile
Dim txtOut As Scripting.TextStream
Set txtOut = fso.CreateTextFile(sTempFile)
txtOut.WriteLine "<html>"
txtOut.WriteLine "<head>"
txtOut.WriteLine "<title>Temporary work file created by " & ThisWorkbook.Path & "</title>"
txtOut.WriteLine "</head>"
txtOut.WriteLine "<body>"
txtOut.WriteLine sSrc
txtOut.WriteLine "</body>"
txtOut.WriteLine "</html>"
txtOut.Close
Set txtOut = Nothing
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sTempFile, "")
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim htmlBody As MSHTML.htmlBody
Set htmlBody = oHtml.querySelector("body")
Set SpawnHTMLFragmentFromSrc = htmlBody.FirstChild
Set oHtml4 = Nothing
Set oHtml = Nothing
Set htmlBody = Nothing
End Function
Friend Function SpawnXmlFragmentFromSrc(ByVal sSrc As String) As MSXML2.IXMLDOMElement
Debug.Assert Len(sSrc) > 0
Dim sTempFile As String
sTempFile = TempFile
Dim txtOut As Scripting.TextStream
Set txtOut = fso.CreateTextFile(sTempFile)
txtOut.WriteLine sSrc
txtOut.Close
Set txtOut = Nothing
Dim oDoc As MSXML2.DOMDocument60
Set oDoc = New MSXML2.DOMDocument60
oDoc.load sTempFile
Dim xmlErr As IXMLDOMParseError2
Set xmlErr = oDoc.parseError
Debug.Assert oDoc.parseError.ErrorCode = 0
Set SpawnXmlFragmentFromSrc = oDoc.DocumentElement
End Function
Public Function BuildComparablesDict(ByVal vKVPs As Variant) As Scripting.Dictionary
Set BuildComparablesDict = Nothing
Dim vKVPLoop As Variant
For Each vKVPLoop In vKVPs
If Not IsArray(vKVPLoop) Then Exit Function
mdicComparables.add vKVPLoop(0), vKVPLoop(1)
Next
Set BuildComparablesDict = mdicComparables
End Function
Public Function FindSnippetInSource(ByVal sSrc As String) As String
On Error GoTo ErrHandler
Debug.Assert Len(sSrc) > 0
If Not mdicComparables.exists("nodeName") Then
Err.Raise vbObjectError, "#programmer must supply nodeName to find both start and end tag!"
Else
Dim sNodeName As String
sNodeName = mdicComparables.Item("nodeName")
If Len(sNodeName) = 0 Then
Err.Raise vbObjectError, "#programmer must supply non-null nodeName to find both start and end tag!"
Else
Dim sStartTag As String
sStartTag = "<" & sNodeName
Dim sEndTag As String
sEndTag = "</" & sNodeName & ">"
Dim lIndex As Long
lIndex = 1
Do
DoEvents
Dim bOk As Boolean
bOk = True
Dim lFindStartTag As Long
lFindStartTag = VBA.InStr(lIndex, sSrc, sStartTag, vbTextCompare)
If lFindStartTag > 0 Then
lIndex = lFindStartTag + 1
Dim lFindEndTag As Long
lFindEndTag = VBA.InStr(lFindStartTag, sSrc, sEndTag, vbTextCompare)
If lFindEndTag > 0 Then
Dim sSnippet As String
sSnippet = Mid$(sSrc, lFindStartTag, lFindEndTag - lFindStartTag + Len(sEndTag))
Dim lFindStartTag2 As Long
lFindStartTag2 = VBA.InStr(1, sSnippet, ">", vbTextCompare)
If lFindStartTag2 > 0 Then
Dim sStartTagAndAttributes As String
sStartTagAndAttributes = Left$(sSnippet, lFindStartTag2)
Dim sAttributesOnly As String
sAttributesOnly = Mid$(sStartTagAndAttributes, Len("<" & sStartTag & " "), Len(sStartTagAndAttributes) - Len("<" & sStartTag & " "))
Dim vSplitAttributeBlock As Variant
vSplitAttributeBlock = SplitStringBySpaceRespectingQuotes(sAttributesOnly)
Dim dicMyAttributes As Scripting.Dictionary
Set dicMyAttributes = New Scripting.Dictionary
Dim vSplitAttributeBlockLoop As Variant
For Each vSplitAttributeBlockLoop In vSplitAttributeBlock
Dim vSplitAttributeKVP As Variant
vSplitAttributeKVP = VBA.Split(vSplitAttributeBlockLoop, "=")
Debug.Assert LBound(vSplitAttributeKVP) = 0
Debug.Assert UBound(vSplitAttributeKVP) = 1
Dim vCleanedValue As Variant
vCleanedValue = vSplitAttributeKVP(1)
If Left$(vCleanedValue, 1) = """" Then vCleanedValue = Mid$(vCleanedValue, 2)
If Right$(vCleanedValue, 1) = """" Then vCleanedValue = Left$(vCleanedValue, Len(vCleanedValue) - 1)
dicMyAttributes.add vSplitAttributeKVP(0), vCleanedValue
Next vSplitAttributeBlockLoop
Dim sStartTagRewritten As String
sStartTagRewritten = UCase$(sStartTag) & " "
Dim vRewriteLoop As Variant
For Each vRewriteLoop In dicMyAttributes
sStartTagRewritten = sStartTagRewritten & vRewriteLoop & "=""" & dicMyAttributes(vRewriteLoop) & """ "
Next vRewriteLoop
sStartTagRewritten = Trim(sStartTagRewritten) & ">"
sSnippet = sStartTagRewritten & Mid$(sSnippet, lFindStartTag2 + 1)
Dim vComparableKeyLoop As Variant
For Each vComparableKeyLoop In mdicComparables.Keys
If StrComp(vComparableKeyLoop, "nodeName", vbTextCompare) <> 0 Then
If Not dicMyAttributes.exists(vComparableKeyLoop) Then
bOk = False '* FAIL
Else
If Not mdicComparables.Item(vComparableKeyLoop) = dicMyAttributes.Item(vComparableKeyLoop) Then
bOk = False '* FAIL
End If
End If
End If
Next vComparableKeyLoop
End If
End If
End If
DoEvents
Loop Until bOk Or lFindStartTag <= 0
End If
End If
If bOk Then FindSnippetInSource = sSnippet
SingleExit:
Exit Function
ErrHandler:
Stop
Resume
End Function
Friend Function SplitStringBySpaceRespectingQuotes(ByVal sToBeSplit As String) As Variant
Dim sQuotesSpacesPlaceHeld As String
sQuotesSpacesPlaceHeld = ""
Dim lLoop As Long
Dim bInQuote As Boolean
For lLoop = 1 To Len(sToBeSplit)
Dim sChar As String * 1
sChar = Mid$(sToBeSplit, lLoop, 1)
Dim lAsc As Long
lAsc = Asc(sChar)
If lAsc = 34 Then
bInQuote = Not bInQuote
sQuotesSpacesPlaceHeld = sQuotesSpacesPlaceHeld & sChar
ElseIf lAsc = 32 And bInQuote Then
'* don't copy over the space because it will break out VBA.Split(sFoo," ") logic
'* instead copy over "<>" which should appear in HTML/XML attributes blocks (they should be <>)
'* don't forget to replace the "<>" with " " at the end!
sQuotesSpacesPlaceHeld = sQuotesSpacesPlaceHeld & "<>"
Else
sQuotesSpacesPlaceHeld = sQuotesSpacesPlaceHeld & sChar
End If
Next lLoop
Dim vSplit As Variant
vSplit = VBA.Split(sQuotesSpacesPlaceHeld, " ")
For lLoop = LBound(vSplit) To UBound(vSplit)
vSplit(lLoop) = VBA.Replace(vSplit(lLoop), "<>", " ")
Next lLoop
SplitStringBySpaceRespectingQuotes = vSplit
End Function
Friend Function NodeMatchesComparables(ByVal objNode As Object) As Boolean
Dim bOk As Boolean
bOk = True '* good until proven otherwise
'* has to match all, so need to loop thru all
Dim vComparableKeyLoop As Variant
For Each vComparableKeyLoop In mdicComparables.Keys
Dim vCompare As Variant
vCompare = Empty
If vComparableKeyLoop = "class" Then
vCompare = objNode.className
ElseIf vComparableKeyLoop = "nodeName" Then
vCompare = objNode.nodeName
Else
vCompare = objNode.getAttribute(vComparableKeyLoop)
End If
If IsNull(vCompare) Then
bOk = False
Else
If StrComp(vCompare, mdicComparables.Item(vComparableKeyLoop), vbTextCompare) <> 0 Then
bOk = False
End If
End If
Next vComparableKeyLoop
NodeMatchesComparables = bOk
End Function
Public Function FindNextSiblingNodeByAttibutes(ByVal objStartNode As Object) As Object
Debug.Assert Not objStartNode Is Nothing
If mdicComparables.Count > 0 Then
Dim objNodeLoop As Object
Set objNodeLoop = objStartNode.NextSibling
Do
If NodeMatchesComparables(objNodeLoop) Then
Set FindNextSiblingNodeByAttibutes = objNodeLoop
Exit Do
End If
Set objNodeLoop = objNodeLoop.NextSibling
Loop Until objNodeLoop Is Nothing
End If
End Function
Public Function FindChildNodeByAttibutes(ByVal objStartNode As Object) As Object
Debug.Assert Not objStartNode Is Nothing
On Error GoTo ErrHandler
If mdicComparables.Count > 0 Then
Dim objChildNodes As Object
Set objChildNodes = objStartNode.ChildNodes
Dim lChildNodeLoop As Long
For lChildNodeLoop = 0 To objChildNodes.Length - 1
Dim objChildNodeLoop As Object
Set objChildNodeLoop = objChildNodes.Item(lChildNodeLoop)
If NodeMatchesComparables(objChildNodeLoop) Then
Set FindChildNodeByAttibutes = objChildNodeLoop
Exit For
End If
Next lChildNodeLoop
End If
SingleExit:
Exit Function
ErrHandler:
Stop
Resume
End Function
The main code module - contains web page specific logic
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
Sub TestScrapeProductDetailsFromMainPages()
Dim wsOvens As Excel.Worksheet
Set wsOvens = ThisWorkbook.Worksheets.Item("Ovens")
Dim dicProducts As Scripting.Dictionary
Set dicProducts = ScrapeProductDetailsFromMainPages
wsOvens.Cells.clear
Dim lRowLoop As Long
lRowLoop = 2
Dim dicColumnOrdinals As Scripting.Dictionary
Set dicColumnOrdinals = New Scripting.Dictionary
Dim vProductLoop As Variant
For Each vProductLoop In dicProducts.Keys
Dim dicProductLoop As Scripting.Dictionary
Set dicProductLoop = dicProducts.Item(vProductLoop)
Dim vFeatureKeyLoop As Variant
For Each vFeatureKeyLoop In dicProductLoop
If VBA.InStr(1, "|CollectionUnavailable|HomeDeliveryAvailable|noStock|CollectInStore|OutOfStock|OnlineOnly|", "|" & vFeatureKeyLoop & "|", vbTextCompare) = 0 Then
If Not dicColumnOrdinals.exists(vFeatureKeyLoop) Then
dicColumnOrdinals.add vFeatureKeyLoop, dicColumnOrdinals.Count
End If
End If
Next vFeatureKeyLoop
Next
dicColumnOrdinals.add "HomeDeliveryAvailable", dicColumnOrdinals.Count
dicColumnOrdinals.add "CollectionUnavailable", dicColumnOrdinals.Count
dicColumnOrdinals.add "noStock", dicColumnOrdinals.Count
dicColumnOrdinals.add "CollectInStore", dicColumnOrdinals.Count
dicColumnOrdinals.add "OutOfStock", dicColumnOrdinals.Count
dicColumnOrdinals.add "OnlineOnly", dicColumnOrdinals.Count
'
Dim vColOrd As Variant
For Each vColOrd In dicColumnOrdinals.Keys
wsOvens.Cells(1, dicColumnOrdinals.Item(vColOrd) + 1).Value2 = vColOrd
Next
For Each vProductLoop In dicProducts.Keys
Set dicProductLoop = dicProducts.Item(vProductLoop)
For Each vFeatureKeyLoop In dicProductLoop
If Not dicColumnOrdinals.exists(vFeatureKeyLoop) Then
wsOvens.Cells(1, dicColumnOrdinals.Count).Value2 = vFeatureKeyLoop
End If
wsOvens.Cells(lRowLoop, dicColumnOrdinals.Item(vFeatureKeyLoop) + 1).Value2 = dicProductLoop(vFeatureKeyLoop)
Next vFeatureKeyLoop
lRowLoop = lRowLoop + 1
Next
FormatOvensRows
End Sub
Function ScrapeProductDetailsFromMainPages() As Scripting.Dictionary
Dim fso As Scripting.FileSystemObject
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sFiles(1 To 2) As String
sFiles(1) = "N:\COOKER\Built-in double ovens - Cheap Built-in double ovens Deals _ Currys 1_50.html"
sFiles(2) = "N:\COOKER\Built-in double ovens - Cheap Built-in double ovens Deals _ Currys 2_50.html"
Debug.Assert fso.FileExists(sFiles(1))
Debug.Assert fso.FileExists(sFiles(2))
Dim sSrc(1 To 2) As String
sSrc(1) = fso.OpenTextFile(sFiles(1)).ReadAll
sSrc(2) = fso.OpenTextFile(sFiles(2)).ReadAll
Dim lFileLoop As Long
For lFileLoop = 1 To 2
'* Tools->Refernces Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sFiles(lFileLoop), "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* this is where the page specific logic now goes, here I am getting info from a StackOverflow page
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("result-prd")
Dim htmlProductListImages As Object
Set htmlProductListImages = oHtml.getElementsByClassName("productListImage")
Dim htmlDescs As Object
Set htmlDescs = oHtml.getElementsByClassName("product-desc")
Dim htmlAnchorIns As Object
Set htmlAnchorIns = oHtml.querySelectorAll("a.in")
Dim htmlDivsMainAmount As Object
Set htmlDivsMainAmount = oHtml.querySelectorAll("div.main-amount")
Dim htmlChannels As Object
Set htmlChannels = oHtml.querySelectorAll("ul.simple.prd-channels")
Dim htmlPromoMessages As Object
Set htmlPromoMessages = oHtml.querySelectorAll("div.promoMessages")
Debug.Assert htmlAnswers.Length = htmlPromoMessages.Length
Debug.Assert htmlAnswers.Length = htmlChannels.Length
Debug.Assert htmlAnswers.Length = htmlDivsMainAmount.Length
Debug.Assert htmlAnswers.Length = htmlProductListImages.Length
Debug.Assert htmlAnswers.Length = htmlDescs.Length
Debug.Assert htmlAnswers.Length = htmlAnchorIns.Length
Dim dicProducts As New Scripting.Dictionary
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim dicProductLoop As Scripting.Dictionary
Set dicProductLoop = New Scripting.Dictionary
'* each grid cell has its own article
dicProductLoop.add "articleId", htmlAnswers.Item(lAnswerLoop).getAttribute("id")
'* grab pictures details like src,alt and CSStext
AddPictureDetails dicProductLoop, htmlProductListImages.Item(lAnswerLoop)
'* grab brand and product name
AddBrandAndName dicProductLoop, htmlAnchorIns.Item(lAnswerLoop)
'* grab review score (out of 10) and total reviews
AddRevoo dicProductLoop, htmlAnchorIns.Item(lAnswerLoop)
'* grab price now, and optionally price befor and saving
AddPrices dicProductLoop, htmlDivsMainAmount.Item(lAnswerLoop)
'* check availability, delivery, stock on order etc
AddChannels dicProductLoop, htmlChannels.Item(lAnswerLoop)
'* get online only
AddPromoMessages dicProductLoop, htmlPromoMessages.Item(lAnswerLoop)
'* build a dictionary of dictionaries
dicProducts.add dicProducts.Count, dicProductLoop
Set dicProductLoop = Nothing
Debug.Print lAnswerLoop
Next
Next lFileLoop
Set ScrapeProductDetailsFromMainPages = dicProducts
End Function
Private Function AddPromoMessages(ByVal dicProductLoop As Scripting.Dictionary, _
ByVal htmlDivPromoMessages As MSHTML.HTMLDivElement)
Dim oOnLineOnly As MSHTMLComparables
Set oOnLineOnly = New MSHTMLComparables
oOnLineOnly.BuildComparablesDict Array(Array("nodeName", "span"), Array("class", "label-online-only"))
If InStr(1, htmlDivPromoMessages.outerHTML, "label-online-only", vbTextCompare) > 0 Then
Dim sOnlineOnlySnippet As String
sOnlineOnlySnippet = oOnLineOnly.FindSnippetInSource(htmlDivPromoMessages.innerHTML)
If Len(sOnlineOnlySnippet) > 0 Then
Dim xmlOnlineOnly As Object
Set xmlOnlineOnly = oOnLineOnly.SpawnXmlFragmentFromSrc(sOnlineOnlySnippet)
If Not xmlOnlineOnly Is Nothing Then
dicProductLoop.add "OnlineOnly", xmlOnlineOnly.nodeTypedValue
End If
'Stop
End If
End If
End Function
Private Function AddChannels(ByVal dicProductLoop As Scripting.Dictionary, _
ByVal htmlChannelList As MSHTML.HTMLListElement)
Dim lLoop As Long
For lLoop = 0 To htmlChannelList.ChildNodes.Length - 1
Dim objChildNode As Object
Set objChildNode = htmlChannelList.ChildNodes.Item(lLoop)
Debug.Assert StrComp(objChildNode.nodeName, "li", vbTextCompare) = 0
Dim listItem As MSHTML.HTMLLIElement
Set listItem = objChildNode
If listItem.className = "nostock" Then
Dim vNoStock As Variant
Dim vUnavailable As Variant
Dim vCollectionUnavailable As Variant
Dim vOutOfStock As Variant
Dim vHomeDeliveryAvailable As Variant
Dim vCollectInStore As Variant
Dim lListItemChildLoop As Long
For lListItemChildLoop = 0 To listItem.ChildNodes.Length - 1
Dim objListItemChildLoop As Object
Set objListItemChildLoop = listItem.ChildNodes.Item(lListItemChildLoop)
If objListItemChildLoop.nodeName = "#text" Then
vNoStock = vNoStock & Trim(objListItemChildLoop.data) & ". "
ElseIf StrComp(objListItemChildLoop.nodeName, "SPAN", vbTextCompare) = 0 Then
If objListItemChildLoop.className = "email-when-back" Then
vNoStock = vNoStock & "Email me when back. "
Else
Stop
End If
'Stop
ElseIf StrComp(objListItemChildLoop.nodeName, "I", vbTextCompare) = 0 Then
'*ignore
Else
Stop
End If
Next lListItemChildLoop
ElseIf listItem.className = "available" Then
If listItem.getAttribute("data-availability") = "homeDeliveryAvailable" Then
Dim objHomedeliveryAvailChildLoop As Object
For Each objHomedeliveryAvailChildLoop In listItem.ChildNodes
If objHomedeliveryAvailChildLoop.nodeName = "#text" Then
vHomeDeliveryAvailable = vHomeDeliveryAvailable & objHomedeliveryAvailChildLoop.data & ". "
End If
Next
ElseIf listItem.getAttribute("data-availability") = "collectInStoreUnavailable" Then
Dim objCollectInStoreUnAvailableLoop2 As Object
For Each objCollectInStoreUnAvailableLoop2 In listItem.ChildNodes
If objCollectInStoreUnAvailableLoop2.nodeName = "#text" Then
vCollectionUnavailable = vCollectionUnavailable & objCollectInStoreUnAvailableLoop2.data & ". "
End If
Next
ElseIf listItem.getAttribute("data-availability") = "collectInStoreAvailable" Then
Dim objCollectInStoreAvailableLoop As Object
For Each objCollectInStoreAvailableLoop In listItem.ChildNodes
If objCollectInStoreAvailableLoop.nodeName = "#text" Then
vCollectInStore = vCollectInStore & objCollectInStoreAvailableLoop.data & ". "
End If
Next
'Stop
Else
Stop
End If
'Stop
ElseIf listItem.className = "unavailable" Then
If listItem.getAttribute("data-availability") = "collectInStoreUnavailable" Then
Dim objCollectInStoreUnAvailableLoop1 As Object
For Each objCollectInStoreUnAvailableLoop1 In listItem.ChildNodes
If objCollectInStoreUnAvailableLoop1.nodeName = "#text" Then
vCollectionUnavailable = vCollectionUnavailable & objCollectInStoreUnAvailableLoop1.data & ". "
End If
Next
ElseIf listItem.getAttribute("data-availability") = "homeDeliveryUnavailable" Then
Dim objHomeDeliveryUnavailableLoop As Object
For Each objHomeDeliveryUnavailableLoop In listItem.ChildNodes
If objHomeDeliveryUnavailableLoop.nodeName = "#text" Then
vOutOfStock = vOutOfStock & objHomeDeliveryUnavailableLoop.data & ". "
End If
Next
Else
Stop
End If
'Stop
Else
'If listItem.className = "available"
Stop
End If
Next lLoop
If Not IsEmpty(vHomeDeliveryAvailable) Then
dicProductLoop.add "HomeDeliveryAvailable", Trim(vHomeDeliveryAvailable)
End If
If Not IsEmpty(vNoStock) Then
dicProductLoop.add "noStock", Trim(vNoStock)
End If
If Not IsEmpty(vCollectionUnavailable) Then
dicProductLoop.add "CollectionUnavailable", Trim(vCollectionUnavailable)
End If
If Not IsEmpty(vOutOfStock) Then
dicProductLoop.add "OutOfStock", Trim(vOutOfStock)
End If
If Not IsEmpty(vCollectInStore) Then
dicProductLoop.add "CollectInStore", Trim(vCollectInStore)
End If
End Function
Private Function AddPrices(ByVal dicProductLoop As Scripting.Dictionary, _
ByVal htmlDivMainAmount As MSHTML.HTMLDivElement)
Dim oPrices As MSHTMLComparables
Set oPrices = New MSHTMLComparables
oPrices.BuildComparablesDict Array(Array("nodeName", "strong"), Array("class", "price"), Array("data-product", "price"))
Dim sSnippet As String
sSnippet = oPrices.FindSnippetInSource(htmlDivMainAmount.innerHTML)
sSnippet = VBA.Replace(sSnippet, "£", "£")
sSnippet = VBA.Replace(sSnippet, "£", "")
Dim xmlMainPriceSnippet As Object
Set xmlMainPriceSnippet = oPrices.SpawnXmlFragmentFromSrc(sSnippet)
Dim vPriceNow As Variant
vPriceNow = xmlMainPriceSnippet.nodeTypedValue
Debug.Assert IsNumeric(vPriceNow)
dicProductLoop.add "priceNow", CCur(vPriceNow)
Dim xmlSpanPastAmount As MSHTML.HTMLSpanElement
Dim xmlStrongSaving As MSHTML.HTMLPhraseElement
Dim lChildLoop As Long
For lChildLoop = 0 To htmlDivMainAmount.ChildNodes.Length - 1
Dim objChildLoop As Object
Set objChildLoop = htmlDivMainAmount.ChildNodes.Item(lChildLoop)
If StrComp(objChildLoop.nodeName, "#text", vbTextCompare) = 0 Then
'* empty text
ElseIf StrComp(objChildLoop.nodeName, "strong", vbTextCompare) = 0 Then
If objChildLoop.className = "saving" Then
Set xmlStrongSaving = objChildLoop
End If
ElseIf StrComp(objChildLoop.nodeName, "span", vbTextCompare) = 0 Then
Set xmlSpanPastAmount = objChildLoop
End If
Next lChildLoop
If Not xmlSpanPastAmount Is Nothing Then
If xmlSpanPastAmount.ChildNodes.Length > 0 Then
dicProductLoop.add "pastAmount", xmlSpanPastAmount.FirstChild.innerText
End If
End If
If Not xmlStrongSaving Is Nothing Then
dicProductLoop.add "saving", xmlStrongSaving.innerText
End If
End Function
Private Function AddRevoo(ByVal dicProductLoop As Scripting.Dictionary, _
ByVal aProductDescLoop As MSHTML.HTMLAnchorElement)
Static oRevooDiv As MSHTMLComparables
If oRevooDiv Is Nothing Then
Set oRevooDiv = New MSHTMLComparables
oRevooDiv.BuildComparablesDict Array(Array("nodeName", "div"), Array("class", "reevoo-placeholder"))
End If
Dim sSnippet As String
sSnippet = oRevooDiv.FindSnippetInSource(aProductDescLoop.outerHTML)
Dim xmlRevooDiv As MSXML2.IXMLDOMElement
Set xmlRevooDiv = oRevooDiv.SpawnXmlFragmentFromSrc(sSnippet)
If Not xmlRevooDiv Is Nothing Then
If xmlRevooDiv.LastChild.nodeTypedValue = "No reviews yet (0)" Then
dicProductLoop.add "totalReviews", 0
Else
dicProductLoop.add "totalReviews", Application.Evaluate(xmlRevooDiv.LastChild.nodeTypedValue)
Dim xmlClass As MSXML2.IXMLDOMAttribute
Set xmlClass = xmlRevooDiv.FirstChild.Attributes.getNamedItem("class")
Debug.Assert Left$(xmlClass.text, Len("reevoo-score score-")) = "reevoo-score score-"
Dim vScore As Variant
vScore = Mid$(xmlClass.text, Len("reevoo-score score-") + 1)
Debug.Assert IsNumeric(vScore)
dicProductLoop.add "revooScore", CDbl(vScore)
End If
End If
End Function
Private Function AddBrandAndName(ByVal dicProductLoop As Scripting.Dictionary, _
ByVal aProductDescLoop As MSHTML.HTMLAnchorElement)
Static oBrandSpan As MSHTMLComparables
If oBrandSpan Is Nothing Then
Set oBrandSpan = New MSHTMLComparables
oBrandSpan.BuildComparablesDict Array(Array("nodeName", "span"), Array("data-product", "brand"))
End If
Dim xmlBrandSpan As MSXML2.IXMLDOMElement
Set xmlBrandSpan = oBrandSpan.SpawnXmlFragmentFromSrc(oBrandSpan.FindSnippetInSource(aProductDescLoop.outerHTML))
If Not xmlBrandSpan Is Nothing Then
dicProductLoop.add "brand", xmlBrandSpan.nodeTypedValue
End If
Static oNameSpan As MSHTMLComparables
If oNameSpan Is Nothing Then
Set oNameSpan = New MSHTMLComparables
oNameSpan.BuildComparablesDict Array(Array("nodeName", "span"), Array("data-product", "name"))
End If
Dim xmlNameSpan As MSXML2.IXMLDOMElement
Set xmlNameSpan = oNameSpan.SpawnXmlFragmentFromSrc(oNameSpan.FindSnippetInSource(aProductDescLoop.outerHTML))
If Not xmlNameSpan Is Nothing Then
dicProductLoop.add "name", xmlNameSpan.nodeTypedValue
End If
End Function
Private Function AddPictureDetails(ByVal dicProduct As Scripting.Dictionary, _
ByVal divProductListImageLoop As MSHTML.HTMLDivElement)
Dim anchor As MSHTML.HTMLAnchorElement
Set anchor = divProductListImageLoop.FirstChild
Dim divProductImages As MSHTML.HTMLDivElement
Set divProductImages = anchor.FirstChild
Dim objStyle As Object
Set objStyle = divProductImages.getAttribute("style")
dicProduct.add "imageCssText", objStyle.getAttribute("cssText")
Dim imgProductImages As MSHTML.HTMLImg
Set imgProductImages = divProductImages.FirstChild
If imgProductImages.nodeName = "DIV" Then
Set imgProductImages = imgProductImages.NextSibling
End If
Debug.Assert imgProductImages.className = "image"
dicProduct.add "imageSrc", imgProductImages.getAttribute("src")
dicProduct.add "imageAlt", imgProductImages.getAttribute("alt")
Debug.Assert Len(imgProductImages.getAttribute("alt")) > 0
End Function
Sub TestSplitAttribsOnSpacesRespectingQuotes()
Dim o As MSHTMLComparables
Set o = New MSHTMLComparables
Dim v
v = o.SplitStringBySpaceRespectingQuotes("class=""bold italic underline"" id=""45""")
Debug.Assert v(0) = "class=""bold italic underline"""
Debug.Assert v(1) = "id=""45"""
End Sub
Sub FormatOvensRows()
Dim wsOvens As Excel.Worksheet
Set wsOvens = ThisWorkbook.Worksheets.Item("Ovens")
wsOvens.Cells.ClearFormats
Dim rowLoop As Excel.Range
For Each rowLoop In wsOvens.UsedRange.rows
If rowLoop.row > 1 Then
DoEvents
Dim lRow As Long
lRow = rowLoop.row - 2
Dim lRowMod50 As Long
lRowMod50 = lRow Mod 50
Dim lRowMod8 As Long
lRowMod8 = lRowMod50 Mod 8
Debug.Print lRowMod8
If lRowMod8 \ 4 = 0 Then
rowLoop.Interior.Color = rgbAliceBlue
End If
End If
Next
wsOvens.Names("Me").RefersToR1C1 = "=Ovens!RC"
Dim rngDataRows As Excel.Range
Set rngDataRows = wsOvens.UsedRange.Offset(1, 0).Resize(wsOvens.UsedRange.rows.Count - 1)
Dim rngPriceNo As Excel.Range
Set rngPriceNo = rngDataRows.columns("I")
Dim lNoStockOffset As Long
lNoStockOffset = Application.Evaluate("MATCH(""noStock"",1:1,0)-MATCH(""priceNow"",1:1,0)")
Dim lOutOfStockOffset As Long
lOutOfStockOffset = Application.Evaluate("MATCH(""outOfStock"",1:1,0)-MATCH(""priceNow"",1:1,0)")
With rngPriceNo.FormatConditions.add( _
Type:=xlExpression, _
Formula1:="=(len(offset(me,0," & lNoStockOffset & "))+len(offset(me,0," & lOutOfStockOffset & ")))>0")
.Font.Color = rgbGrey
End With
With rngPriceNo.Font
.Color = -16776961
.TintAndShade = 0
.name = "Calibri"
.FontStyle = "Bold"
.size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.Color = -16776961
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
rngPriceNo.NumberFormat = "$#,##0.00"
With rngDataRows.columns("J").Font
.name = "Calibri"
.FontStyle = "Regular"
.size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorDark1
.TintAndShade = -0.349986267
.ThemeFont = xlThemeFontMinor
End With
With rngDataRows.columns("K").Font
.name = "Calibri"
.FontStyle = "Bold"
.size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
End Sub
So here is a screenshot of the output in Excel worksheet
And here is screenshot of part of the original web page
No comments:
Post a Comment