So another SO question about web scraping in VBA. I wrote some code but was not happy with it and so revisited it. It turns out that the jQuery selector syntax can be quite advanced, a bit like XPath for Xml.
Tip #1 When using querySelectorAll() use Early-binding
I have seen some strangeness when using querySelectorAll() when getting the length, the problem goes away if you use early binding type library (Tools->References->Microsoft HTML Object Library)
Dim oSelectors As MSHTML.IHTMLDOMChildrenCollection
Set oSelectors = oHtml.querySelectorAll("div.blocoCampos input")
Dim lSelectorResultList As Long
lSelectorResultList = oSelectors.Length
Note above the selector gets input elements which are children of div elements with the class 'blocoCampos'.
Tip #2 When using querySelectorAll() use item to acquire each element not For Each
I have also seen some strangeness when using querySelectorAll() that errors on the Next line of a For Each loop. So avoid by establishing the length of the result array and then use a standard integer loop and acquire each element with item.
Dim lSelectorResultLoop As Long
For lSelectorResultLoop = 0 To lSelectorResultList - 1
Dim objChild As Object
Set objChild = oSelectors.Item(lSelectorResultLoop)
Selecting grandchild anchor off second span child of a div with id
Given the following HTML source the questioner wanted to navigate to the anchor links. The anchors do have not id and no class; neither do their parent span elements; but the spans' parent div element have (non-unique) id so we can start the capture there.
<div id="resumopesquisa">
<div id="itemlistaresultados" style="background-color: #EDEDED">
<span class="labellinha">Acórdãos de Repetitivos</span>
<!-- <span> PIS E ICMS E COFINS E CALCULO E BASE E DE REPETITIVOS.NOTA.
</span> -->
<span><a href="/SCON/jurisprudencia/toc.jsp?livre=ICMS+BASE+DE+CALCULO+PIS+COFINS&repetitivos=REPETITIVOS&&b=ACOR&thesaurus=JURIDICO&p=true">1
documento(s) encontrado(s)</a></span>
</div>
<div id="itemlistaresultados">
<span class="labellinha">Acórdãos</span>
<!-- <span> PIS E ICMS E COFINS E CALCULO E BASE E DE
</span> -->
<span><a href="/SCON/jurisprudencia/toc.jsp?livre=icms+base+de+calculo+pis+cofins&&b=ACOR&thesaurus=JURIDICO&p=true">284
documento(s) encontrado(s)</a></span>
</div>
</div>
So let's build up our jQuery selector expression, first let's get the divs but specifiying their id ( yeah, I know I though ids were unique as well) ...
div#itemlistaresultados
But then we need to get the second child span element of the div, we can do this with jQuery's nth-child selector. We simply add a space between the div expression and the span expression to express the parent child relationship ...
div#itemlistaresultados span:nth-child(2)
Finally we pick out the anchor element with
div#itemlistaresultados span:nth-child(2) a
So we can put this jQuery selector expression into MSHTML's querySelectorAll method (use querySelector for singleton results), here is the VBA
Set oHtml = ie.Document
Dim objResultList As MSHTML.IHTMLDOMChildrenCollection
Set objResultList = oHtml.querySelectorAll("div#itemlistaresultados span:nth-child(2) a")
Dim lResultCount As Long
lResultCount = objResultList.Length
Debug.Print
Dim lResultLoop As Long
For lResultLoop = 0 To lResultCount - 1
Dim anchorLoop As MSHTML.HTMLAnchorElement
Set anchorLoop = objResultList.Item(lResultLoop)
Debug.Print achLoop.href
Next
Tip #3 When not required use late binding to get aggregated interface
So when dealing with an input checkbox then it must be understood that its functionality is defined across a great many number of different interfaces such as MSHTML.HTMLInputElement, MSHTML.IHTMLInputElement and many more. Perhaps the input box is a worst case example because it is a multifaceted definition but for illustration here is what OLEView gives the interfaces implemented by the coclass MSHTML.HTMLInputElement ...
coclass HTMLInputElement {
[default] dispinterface DispHTMLInputElement;
[default, source] dispinterface HTMLInputTextElementEvents;
[source] dispinterface HTMLInputTextElementEvents2;
[source] dispinterface HTMLOptionButtonElementEvents;
[source] dispinterface HTMLButtonElementEvents;
interface IHTMLElement;
interface IHTMLElement2;
interface IHTMLElement3;
interface IHTMLElement4;
interface IHTMLUniqueName;
interface IHTMLDOMNode;
interface IHTMLDOMNode2;
interface IHTMLDOMNode3;
interface IHTMLDatabinding;
interface IHTMLElement5;
interface IHTMLElement6;
interface IElementSelector;
interface IHTMLDOMConstructor;
interface IHTMLElement7;
interface IHTMLControlElement;
interface IHTMLInputElement;
interface IHTMLInputElement2;
interface IHTMLInputTextElement;
interface IHTMLInputTextElement2;
interface IHTMLInputHiddenElement;
interface IHTMLInputButtonElement;
interface IHTMLInputFileElement;
interface IHTMLOptionButtonElement;
interface IHTMLInputImage;
interface IHTMLInputElement3;
interface IHTMLInputRangeElement;
};
So instead of figuring out on which interface of the list above a method is implemented it is better to declare the variable with As Object to use late binding, and then all the methods from all of the interfaces are aggregated onto a IDispatch interface.
No comments:
Post a Comment