Friday 29 June 2018

Python - Pandas - ADO - Convert a Pandas DataFrame to an ADO Recordset

So it's Python month on this Excel Development Platform blog where I highlight some Python technologies of interest to Excel (VBA) Developers.

So Python has the Pandas data processing library and one could move logic from VBA into a Python middle tier application server but sometimes you may still want some data processing functionality to remain in the VBA layer. Can a Pandas DataFrame be converted to an ADO Recordset? Yes, but you'll need to convert it into an Xml representation first and then pass the string to VBA where it recreates the ADO recordset.

Python code to convert Pandas dataframe to Xml representation of an ADO Recordset

Much of the Xml representation of an ADO recordset is boilerplate code, however in the first section one can see the column names of Col1,Col2,Col3. Then in the z:row elements the field values are added as attributes and the attribute names must match the column names Col1,Col2,Col3. Then add a tail and return the whole string to VBA.

import pandas as pd
import numpy as np

class PopulationDensity(object):
    _reg_clsid_ = "{C50910CC-F88F-4EA5-86D4-1E5D6AF1F4AE}"
    _reg_progid_= 'PandasInVBA.PopulationDensity'
    _public_methods_ = ['getPivotTable','getADORecordset']

    def getADORecordset(self):
        url="https://raw.githubusercontent.com/datasets/house-prices-uk/master/data/data.csv"
        whole=pd.read_csv(url)

        ## project first three columns
        projected = whole[['Date','Price (All)','Change (All)']]

        ## So now start creating the Xml representation of an ADO Recordset
        sStart = ("<xml xmlns:x='urn:schemas-microsoft-com:office:excel' "  +
        "    xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882' "  +
        "    xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882' "  +
        "    xmlns:rs='urn:schemas-microsoft-com:rowset' "  +
        "    xmlns:z='#RowsetSchema'>"  +
        "<x:PivotCache>"  +
        "<x:CacheIndex>1</x:CacheIndex>"  +
        "<s:Schema id='RowsetSchema'>"  +
        "<s:ElementType name='row' content='eltOnly'>"  +
        "<s:attribute type='Col1'/>"  +
        "<s:attribute type='Col2'/>"  +
        "<s:attribute type='Col3'/>"  +
        "<s:extends type='rs:rowbase'/>"  +
        "</s:ElementType>"  +
        "<s:AttributeType name='Col1' rs:name='Date'>"  +
        "<s:datatype dt:maxLength='255'/>"  +
        "</s:AttributeType>"  +
        "<s:AttributeType name='Col2' rs:name='Price (All)'>"  +
        "<s:datatype dt:maxLength='255'/>"  +
        "</s:AttributeType>"  +
        "<s:AttributeType name='Col3' rs:name='Change (All)'>"  +
        "<s:datatype dt:maxLength='255'/>"  +
        "</s:AttributeType>"  +
        "</s:Schema>"  +
        "<rs:data>" )

        ## now the data section, we iterate over the rows of the pandas DataFrame
        sData =""
        for index, row in projected.iterrows():
            sData = sData + "<z:row Col1='" + str(row['Date']) + "' Col2='" + str(row['Price (All)']) + "' Col3='" + str(row['Change (All)']) + "'/>"

        sEnd = (
        "</rs:data>"  +
        "</x:PivotCache>"  +
        "</xml>"  )

        return sStart + sData + sEnd 

    def getPivotTable(self):
        pass  # see previous article

if __name__=='__main__':
    print ("Registering COM server...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(PopulationDensity)

Client VBA Code

So here is the VBA code. An Xml Dom document is created, and the string returned from Python is parsed as a document. Then an ADO recordset is created and we call Open passing the DomDocument as the argument. This technique leverages the fact that recordsets can be persisted to xml files.

Sub Test2()

    Dim obj As Object
    Set obj = VBA.CreateObject("PandasInVBA.PopulationDensity")
    
    Dim sDataAsXml As String
    sDataAsXml = obj.getADORecordset
    
    '* Tools->References:Microsoft Xml, v6.0
    Dim domXlPersist As MSXML2.DOMDocument60
    Set domXlPersist = New MSXML2.DOMDocument60
    domXlPersist.LoadXML sDataAsXml
    Debug.Assert domXlPersist.parseError.ErrorCode = 0
    
    '* Tools->References:Microsoft ActiveX Data Object 6.1 Library
    Dim rs As ADODB.Recordset
    Set rs = New ADODB.Recordset
    rs.Open domXlPersist
    

    Dim rngOrigin As Excel.Range
    Set rngOrigin = Sheet3.Cells(12, 1)
    
    '* write column headers
    Dim lFieldLoop As Long
    For lFieldLoop = 0 To rs.Fields.Count - 1
        rngOrigin.Offset(0, lFieldLoop).Value = rs.Fields.Item(lFieldLoop).Name
    Next lFieldLoop
    
    '* write the data, yes, in one line
    rngOrigin.Offset(1, 0).CopyFromRecordset rs

End Sub

Final Thoughts

In the code given I have serialized a Dataframe to a (potentially large) Xml string then on the client side parsed it into a Dom and then an ADO recordset. This is quite a heavy set of operations. If you are calling an in process component then it would be better to pass it back as an OLE Variant. However, the above technique maybe better suited for Flask web services where conversion to strings is standard practice as part of the HTTP protocol.

No comments:

Post a Comment