Thursday 25 January 2018

VBA - Webscraping - jQuery selectors available with MSHTML's querySelector and querySelectorAll

So another SO question about web scraping in VBA. I wrote some code but was not happy with it and so revisited it. It turns out that the jQuery selector syntax can be quite advanced, a bit like XPath for Xml.

Tip #1 When using querySelectorAll() use Early-binding

I have seen some strangeness when using querySelectorAll() when getting the length, the problem goes away if you use early binding type library (Tools->References->Microsoft HTML Object Library)

    Dim oSelectors As MSHTML.IHTMLDOMChildrenCollection
    Set oSelectors = oHtml.querySelectorAll("div.blocoCampos input")
    
    Dim lSelectorResultList As Long
    lSelectorResultList = oSelectors.Length

Note above the selector gets input elements which are children of div elements with the class 'blocoCampos'.

Tip #2 When using querySelectorAll() use item to acquire each element not For Each

I have also seen some strangeness when using querySelectorAll() that errors on the Next line of a For Each loop. So avoid by establishing the length of the result array and then use a standard integer loop and acquire each element with item.

    Dim lSelectorResultLoop As Long
    For lSelectorResultLoop = 0 To lSelectorResultList - 1

        Dim objChild As Object
        Set objChild = oSelectors.Item(lSelectorResultLoop)

Selecting grandchild anchor off second span child of a div with id

Given the following HTML source the questioner wanted to navigate to the anchor links. The anchors do have not id and no class; neither do their parent span elements; but the spans' parent div element have (non-unique) id so we can start the capture there.


<div id="resumopesquisa">

  <div id="itemlistaresultados" style="background-color: #EDEDED">
   <span class="labellinha">Acórdãos de Repetitivos</span>
   <!-- <span>  PIS E ICMS E COFINS E CALCULO E BASE E DE REPETITIVOS.NOTA.
   </span> -->
   
   <span><a href="/SCON/jurisprudencia/toc.jsp?livre=ICMS+BASE+DE+CALCULO+PIS+COFINS&repetitivos=REPETITIVOS&&b=ACOR&thesaurus=JURIDICO&p=true">1
     documento(s) encontrado(s)</a></span>
   
  </div>
  
 
 <div id="itemlistaresultados">
  <span class="labellinha">Acórdãos</span>
  <!-- <span>  PIS E ICMS E COFINS E CALCULO E BASE E DE
  </span> -->
  
  <span><a href="/SCON/jurisprudencia/toc.jsp?livre=icms+base+de+calculo+pis+cofins&&b=ACOR&thesaurus=JURIDICO&p=true">284
    documento(s) encontrado(s)</a></span>
  
 </div>
</div>

So let's build up our jQuery selector expression, first let's get the divs but specifiying their id ( yeah, I know I though ids were unique as well) ...

div#itemlistaresultados

But then we need to get the second child span element of the div, we can do this with jQuery's nth-child selector. We simply add a space between the div expression and the span expression to express the parent child relationship ...

div#itemlistaresultados span:nth-child(2)

Finally we pick out the anchor element with

div#itemlistaresultados span:nth-child(2) a

So we can put this jQuery selector expression into MSHTML's querySelectorAll method (use querySelector for singleton results), here is the VBA

    Set oHtml = ie.Document
    Dim objResultList As MSHTML.IHTMLDOMChildrenCollection
    Set objResultList = oHtml.querySelectorAll("div#itemlistaresultados span:nth-child(2) a")

    Dim lResultCount As Long
    lResultCount = objResultList.Length

    Debug.Print
    Dim lResultLoop As Long
    For lResultLoop = 0 To lResultCount - 1

        Dim anchorLoop As MSHTML.HTMLAnchorElement
        Set anchorLoop = objResultList.Item(lResultLoop)

        Debug.Print achLoop.href

    Next

Tip #3 When not required use late binding to get aggregated interface

So when dealing with an input checkbox then it must be understood that its functionality is defined across a great many number of different interfaces such as MSHTML.HTMLInputElement, MSHTML.IHTMLInputElement and many more. Perhaps the input box is a worst case example because it is a multifaceted definition but for illustration here is what OLEView gives the interfaces implemented by the coclass MSHTML.HTMLInputElement ...

    coclass HTMLInputElement {
        [default] dispinterface DispHTMLInputElement;
        [default, source] dispinterface HTMLInputTextElementEvents;
        [source] dispinterface HTMLInputTextElementEvents2;
        [source] dispinterface HTMLOptionButtonElementEvents;
        [source] dispinterface HTMLButtonElementEvents;
        interface IHTMLElement;
        interface IHTMLElement2;
        interface IHTMLElement3;
        interface IHTMLElement4;
        interface IHTMLUniqueName;
        interface IHTMLDOMNode;
        interface IHTMLDOMNode2;
        interface IHTMLDOMNode3;
        interface IHTMLDatabinding;
        interface IHTMLElement5;
        interface IHTMLElement6;
        interface IElementSelector;
        interface IHTMLDOMConstructor;
        interface IHTMLElement7;
        interface IHTMLControlElement;
        interface IHTMLInputElement;
        interface IHTMLInputElement2;
        interface IHTMLInputTextElement;
        interface IHTMLInputTextElement2;
        interface IHTMLInputHiddenElement;
        interface IHTMLInputButtonElement;
        interface IHTMLInputFileElement;
        interface IHTMLOptionButtonElement;
        interface IHTMLInputImage;
        interface IHTMLInputElement3;
        interface IHTMLInputRangeElement;
    };

So instead of figuring out on which interface of the list above a method is implemented it is better to declare the variable with As Object to use late binding, and then all the methods from all of the interfaces are aggregated onto a IDispatch interface.

Links

No comments:

Post a Comment