Monday 25 May 2020

A rather neat piece of plumbing, Chrome extension pushes byte array of jobs data to Excel via Python

Transcript

The United States is suffering from extremely high unemployment and in this post I give an application that harvests job leads from a leading jobs website. The application has numerous technical components, (i) a Chrome extension, (ii) a Python webserver housed as a COM component and (iii) a VBA deserialization component. Taken together they demonstrate transmitting binary data from the browser through to the Excel worksheet.

In the US, initial jobless claims are running at a 4-week average of 3 million and the non-farm payrolls are currently at 20 million. These figures are both depressing and staggering. Europe can expect suffering on similar terms. Hopefully the code in this post can assist some to find work.

Co-browsing vs Web-scraping

Websites depend upon ad revenue to survive and so they need humans to see the adverts placed. Every time a human sees an advert it is known as an impression. Web-scraping is the process of running code to fetch a web page and to scrape data from the HTML; this typically involves the automation of a hidden web browser and as such any adverts on a hidden web page are no longer viewable but rendering ad impression statistics false. Eventually, this means that ad revenue is debased and devalued. As such, I disapprove of web scraping.

Instead, I give a ‘co-browsing’ application where code captures job leads from a web page that a human user is browsing. So this application is only active when a human browses a web page. This means any advert impressions are genuine and website’s revenue is not threatened.

The code

There are three separate parts to this application, (i) the chrome extension, (ii) the Python web server (housed as a COM component) and (iii) the VBA deserialization component. They are all in Github, https://github.com/smeaden/ExcelDevelopmentPlatform/tree/master/PythonWebSeverCallsBackToExcel/

The Chrome Extension

https://github.com/smeaden/ExcelDevelopmentPlatform/tree/master/PythonWebSeverCallsBackToExcel/Chrome%20Extension/

The chrome extension will wait for a jobs page to load and then read the jobs data, it builds a JavaScript array of jobs and when complete it will convert the single dimensioned array of jobs into a two-dimensional grid array where each row is one job and the attributes are spread across the columns.

I convert to a grid because ultimately it will be sent to an Excel session where it is to be pasted onto a worksheet. The grid is then persisted to a byte array instead of JSON to take advantage of a data interchange format native to VB6, VBA that I have re-discovered and that allows a byte array to be deserialized to a VBA (OLE Automation) Variant (two dimensional).

Once converted to a byte array we make an XMLHttpRequest() to the Python web server (see next component). If you are experimenting then you might need to change port number in the code here.

There are two main JavaScript files, content.js and JavaScriptToVBAVariantArray.js. The former houses logic specific to this application whilst the latter is the array conversion code library file which I intend to use across a number of projects.

Python Web Server housed as a COM component

https://github.com/smeaden/ExcelDevelopmentPlatform/tree/master/PythonWebSeverCallsBackToExcel/PythonWebSeverCallsBackToExcel

I have previously written about and given code as to how to write a Python web server housed as a COM component and instantiable from VBA. I have also previously written about and given code as to how to call back into Excel VBA from a Python class.

But there is something new in this Python web server which needs detailing, in short one cannot simply call back into Excel with an interface pointer passed in a different threading apartment; instead the interface pointer has first to be ‘marshalled’. I have encapsulated the plain vanilla callback code in the Python class CallbackInfo and the special marshalling case in its derived class MarshalledCallbackInfo.

In the context of the application, the Python web server is part of the pipeline that passes the byte array from the Chrome extension into Excel VBA. It calls into Excel VBA by calling Application.Run on a (marshalled) Excel.Application pointer. The name of the procedure called by Application.Run is configurable, and passed in. Time to look into the VBA code.

Excel VBA

https://github.com/smeaden/ExcelDevelopmentPlatform/tree/master/PythonWebSeverCallsBackToExcel/ExcelVBA

I do not check into whole workbooks, I check in the individual code modules instead. Thus to build the Excel VBA workbook code base one needs to import the modules. Luckily, I wrote one module called devBuild to import the rest of them. I intend to follow this pattern when placing code in GitHub. Look at the README.md file for more detail. From here, I’ll assume you’ve built a workbook codebase.

I have written about the serialization and deserialization of Variants to byte arrays and back again so I’ll refer you to that post for the details. In short we take the byte array passed from the Chrome extension via the Python web server and deserialize this to a two dimensional variant array which can then be pasted onto the worksheet.

I guess I could write some more code to build a cumulative list but the point of this project was to show binary data being passed from browser to Excel, to demonstrate (a) the plumbing and (b) the binary data interface format (i.e. no JSON).

No comments:

Post a Comment