Sunday 5 April 2020

Python, VBA - Beautiful Soup for webscraping

In this post I use Python's Beautiful Soup library to webscrape data from a web page, hopefully other VBA developers will realise that this is a much better library to use than VBA. But I do give a COM wrapper class so that the Python can be called from VBA.

Use case background

I'm worried about Corona Virus and its impact on stock markets, I very much need to query the London Stock Exchange database of securities to look for some safe government bonds to buy into. I have a list of government bonds from the branch of government that manages the issuance of such bonds but their unique identifier is an ISIN which is unique across the globe. Unfortunately, stockbrokers do not use ISIN numbers and I need to find alternative IDs and ticker codes. The London Stock Exchange (I am UK resident/citizen) has this information so I need to cross-reference the issuance authority's ISINs code with the LSE's database to get codes to present to my stock broker. For this I choose to write a web scraping program for I do not know of an official REST service.

Say No to VBA Webscraping

There was a time on this blog when I shared code that allowed a VBA developer to create an instance of instance of Internet Explorer, navigate to a web page and then traverse the HTML DOM to extract the necessary information. Those days have past now for a number of reasons. Firstly, Internet Explorer is legacy as Microsoft have the Edge browser so it would be foolish to write code against Internet Explorer. Secondly, since I branched out into alternatives to VBA I have discovered the quality of other libaries such as the .NET ecosystem and the Python ecosystem which both have many-fold better libraries.
I'd urge no new development in VBA for webscraping as there are three or four better options depending on your architecture and programming language preferences:
  • Chrome Extension, embedded JavaScript is best placed to traverse an HTML Dom and even sink events (needs a web server to relay info to though)
  • .NET Html Agility Pack, so C# developers would recommends this.
  • Python's Beautiful Soup
  • Webdriver interface, formerly known as Selenium
In this post I'm using Python's Beautiful Soup library but I am giving a COM wrapper so that it is callable from VBA because this is an Excel development blog after all.

I do not wish to replicate the Beautiful Soup documentation so I will be brief. Initially, I felt frustrated because I like to use a CSS selector path such as the following but this didn't not work for me which is a shame as it is a nice one shot declarative way to access an element.
#contentIndex > div.search_results_list > table > tbody > tr > td:nth-child(1)
But, I then read the documentation and saw how easily it was to script against a DOM (and actually a DOM fragment as well!) that the resulting code was perfectly easy to write.
All the web-scraping is packed into a class called LSEWebServices. But I also promised a COM wrapper class and that is LSEWebServicesCOMWrapper which simply delegates to an instance of LSEWebServices; I guess I could have inherited perhaps.

Important note: this code has been updated because London Stock Exchange changes their website

import urllib.request as urllib2
from bs4 import BeautifulSoup
import pythoncom
import logging

class LSEWebServices:

    def ExtractLSESearchResults(self,searchTerm: str):
        try:
            req = urllib2.urlopen("https://www.londonstockexchange.com/search?searchtype=all&q=" + searchTerm)
            html = req.read().decode("utf-8")

            try:
                soup = BeautifulSoup(html, 'html.parser')
                searchResultList = soup.find_all('div','item-separator')
                if (len(searchResultList)==1):
                    subText = str(searchResultList[0])
                    soup2 = BeautifulSoup(subText, 'html.parser')
                    instrumentTidm = soup2.find('span','instrument-tidm').text 
                    instrumentDesc = soup2.find_all('span','instrument-uppercase') [1].text
                    instrumentLink = soup2.find('a','tidm-and-description').attrs['href']
                    return (instrumentTidm, instrumentDesc,instrumentLink)
                pass


            except Exception as ex:
                print(ex)


        except Exception as ex:
            print(ex)

    def ExtractLSESecurityInformation(self, link:str):
        try:
            req = urllib2.urlopen(link)
            html = req.read().decode("utf-8")

            soup = BeautifulSoup(html, 'html.parser')
            instrumentInformation = soup.find_all('div','chart-table-instrument-information')
            if (len(instrumentInformation)==1):
                subText = str(instrumentInformation[0])
                soup2 = BeautifulSoup(subText, 'html.parser')
                spanMarketSegmentParent = soup2.find('span',string=' Market segment ').parent()
                marketSegment  = spanMarketSegmentParent[1].text

                spanSEDOLParent = soup2.find('span',string=' SEDOL ').parent()
                SEDOL  = spanSEDOLParent[1].text

                spanMICParent = soup2.find('span',string=' Market identifier code (MIC) ').parent()
                MIC  = spanMICParent[1].text

                spanISINParent = soup2.find('span',string=' ISIN ').parent()
                ISIN  = spanISINParent[1].text

                return (marketSegment, MIC,  SEDOL, ISIN)

        except Exception as ex:
            print(ex)

def TestLSEWebServices():
    lse = LSEWebServices()
    results = lse.ExtractLSESearchResults("gb0031790826")
    print(results)
    sec_info = lse.ExtractLSESecurityInformation(results[2])
    print(sec_info)
    dummy = 1 # a line of code upon which I can place a breakpoint
    pass

class LSEWebServicesCOMWrapper(object):
    _reg_clsid_ = "{81F3D23E-83E5-42DF-96E8-5042933379CF}"
    _reg_progid_ = 'PythonInVBA.LSEWebServicesCOMWrapper'
    _reg_clsctx_ = pythoncom.CLSCTX_LOCAL_SERVER 
    _public_methods_ = ['ExtractLSESearchResults','ExtractLSESecurityInformation']

    def ExtractLSESearchResults(self,searchTerm: str):
        try:
            lse = LSEWebServices()
            results = lse.ExtractLSESearchResults(searchTerm)
            
            if (results is not None):
                return list(results) 
            else:
                return None

        except Exception as ex:
            print(ex)

    def ExtractLSESecurityInformation(self, link:str):
        try:
            lse = LSEWebServices()
            results = lse.ExtractLSESecurityInformation(link)

            if (results is not None):
                return list(results) #.tolist()
            else:
                return None

        except Exception as ex:
            print(ex)

def TestLSEWebServicesCOMWrapper():
    lse = LSEWebServicesCOMWrapper()
    results = lse.ExtractLSESearchResults("gb0031790826")
    print(results)
    sec_info = lse.ExtractLSESecurityInformation(results[2])
    print(sec_info)
    dummy = 1
    pass

def RegisterThis():
    print("Registering COM servers...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(LSEWebServicesCOMWrapper)

if __name__ == '__main__':
    #TestLSEWebServicesCOMWrapper()
    RegisterThis()

Client VBA Code

So the Python code needs to be run once to register the COM servers and for this you need Administrator rights; once run the following VBA code can run.
All the hard work has been done for us in the Python code. In VBA, we create the COM wrapper class and call a couple of methods, the Python returns Python lists which are converted into COM/VBA variant arrays and we just pull out the relevant item. Simple.
Option Explicit

Sub TestLSEWebServicesCOMWrapper()
    
    Dim obj As Object
    Set obj = VBA.CreateObject("PythonInVBA.LSEWebServicesCOMWrapper")
    
    Dim vResult As Variant
    vResult = obj.ExtractLSESearchResults("gb0031790826")
    
    'Stop
    Dim vSecInfo As Variant
    vSecInfo = obj.ExtractLSESecurityInformation(vResult(2))

    Debug.Print "ISIN:" & vSecInfo(3), vResult(1), "TIDM:" & vResult(0), "SEDOL:" & vSecInfo(2), ""
    
    'Stop

End Sub
And the above code gives the output
ISIN:GB0031790826           UNITED KINGDOM 2% IL TREASURY 35          TIDM:T2IL     SEDOL:3179082 
So now referring back to the use case I can verify that my stock broker recognizes T2IL and 3179082 as identifiers. So we have converted the ISIN into something useful.

1 comment:

  1. Very nice code!!

    question:
    1. Dim obj As Object
    Set obj = VBA.CreateObject("PythonInVBA.LSEWebServicesCOMWrapper")

    i am getting error while setting this object. Where to find proper library?

    ReplyDelete