The Excel Development Platform: pandas

Showing posts with label pandas. Show all posts

Friday, 29 June 2018

Python - Pandas - ADO - Convert a Pandas DataFrame to an ADO Recordset

So it's Python month on this Excel Development Platform blog where I highlight some Python technologies of interest to Excel (VBA) Developers.

So Python has the Pandas data processing library and one could move logic from VBA into a Python middle tier application server but sometimes you may still want some data processing functionality to remain in the VBA layer. Can a Pandas DataFrame be converted to an ADO Recordset? Yes, but you'll need to convert it into an Xml representation first and then pass the string to VBA where it recreates the ADO recordset.

Python code to convert Pandas dataframe to Xml representation of an ADO Recordset

Much of the Xml representation of an ADO recordset is boilerplate code, however in the first section one can see the column names of Col1,Col2,Col3. Then in the z:row elements the field values are added as attributes and the attribute names must match the column names Col1,Col2,Col3. Then add a tail and return the whole string to VBA.

import pandas as pd
import numpy as np

class PopulationDensity(object):
    _reg_clsid_ = "{C50910CC-F88F-4EA5-86D4-1E5D6AF1F4AE}"
    _reg_progid_= 'PandasInVBA.PopulationDensity'
    _public_methods_ = ['getPivotTable','getADORecordset']

    def getADORecordset(self):
        url="https://raw.githubusercontent.com/datasets/house-prices-uk/master/data/data.csv"
        whole=pd.read_csv(url)

        ## project first three columns
        projected = whole[['Date','Price (All)','Change (All)']]

        ## So now start creating the Xml representation of an ADO Recordset
        sStart = ("<xml xmlns:x='urn:schemas-microsoft-com:office:excel' "  +
        "    xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882' "  +
        "    xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882' "  +
        "    xmlns:rs='urn:schemas-microsoft-com:rowset' "  +
        "    xmlns:z='#RowsetSchema'>"  +
        "<x:PivotCache>"  +
        "<x:CacheIndex>1</x:CacheIndex>"  +
        "<s:Schema id='RowsetSchema'>"  +
        "<s:ElementType name='row' content='eltOnly'>"  +
        "<s:attribute type='Col1'/>"  +
        "<s:attribute type='Col2'/>"  +
        "<s:attribute type='Col3'/>"  +
        "<s:extends type='rs:rowbase'/>"  +
        "</s:ElementType>"  +
        "<s:AttributeType name='Col1' rs:name='Date'>"  +
        "<s:datatype dt:maxLength='255'/>"  +
        "</s:AttributeType>"  +
        "<s:AttributeType name='Col2' rs:name='Price (All)'>"  +
        "<s:datatype dt:maxLength='255'/>"  +
        "</s:AttributeType>"  +
        "<s:AttributeType name='Col3' rs:name='Change (All)'>"  +
        "<s:datatype dt:maxLength='255'/>"  +
        "</s:AttributeType>"  +
        "</s:Schema>"  +
        "<rs:data>" )

        ## now the data section, we iterate over the rows of the pandas DataFrame
        sData =""
        for index, row in projected.iterrows():
            sData = sData + "<z:row Col1='" + str(row['Date']) + "' Col2='" + str(row['Price (All)']) + "' Col3='" + str(row['Change (All)']) + "'/>"

        sEnd = (
        "</rs:data>"  +
        "</x:PivotCache>"  +
        "</xml>"  )

        return sStart + sData + sEnd 

    def getPivotTable(self):
        pass  # see previous article

if __name__=='__main__':
    print ("Registering COM server...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(PopulationDensity)

Client VBA Code

So here is the VBA code. An Xml Dom document is created, and the string returned from Python is parsed as a document. Then an ADO recordset is created and we call Open passing the DomDocument as the argument. This technique leverages the fact that recordsets can be persisted to xml files.

Sub Test2()

    Dim obj As Object
    Set obj = VBA.CreateObject("PandasInVBA.PopulationDensity")
    
    Dim sDataAsXml As String
    sDataAsXml = obj.getADORecordset
    
    '* Tools->References:Microsoft Xml, v6.0
    Dim domXlPersist As MSXML2.DOMDocument60
    Set domXlPersist = New MSXML2.DOMDocument60
    domXlPersist.LoadXML sDataAsXml
    Debug.Assert domXlPersist.parseError.ErrorCode = 0
    
    '* Tools->References:Microsoft ActiveX Data Object 6.1 Library
    Dim rs As ADODB.Recordset
    Set rs = New ADODB.Recordset
    rs.Open domXlPersist
    

    Dim rngOrigin As Excel.Range
    Set rngOrigin = Sheet3.Cells(12, 1)
    
    '* write column headers
    Dim lFieldLoop As Long
    For lFieldLoop = 0 To rs.Fields.Count - 1
        rngOrigin.Offset(0, lFieldLoop).Value = rs.Fields.Item(lFieldLoop).Name
    Next lFieldLoop
    
    '* write the data, yes, in one line
    rngOrigin.Offset(1, 0).CopyFromRecordset rs

End Sub

Final Thoughts

In the code given I have serialized a Dataframe to a (potentially large) Xml string then on the client side parsed it into a Dom and then an ADO recordset. This is quite a heavy set of operations. If you are calling an in process component then it would be better to pass it back as an OLE Variant. However, the above technique maybe better suited for Flask web services where conversion to strings is standard practice as part of the HTTP protocol.

Python - Pandas - VBA - Return a Pandas Pivot Table to VBA (using COM)

So it's Python month on this Excel Development Platform blog where I highlight some Python technologies of interest to Excel (VBA) Developers.

A few days prior I blogged a post about using Pandas to process some very large Eurostat Population Density data set, once I got the Pandas pivot table I saved it to a csv file which is unsatisfactory; the truth is I was having difficulty returning the Pandas pivot table to VBA as a 2d OLE Variant that can be pasted onto a worksheet. In this post I fix that.

Python code to return an OLE Variant from a Pandas Pivot

First the Python code. So the upper half of this code is the same as in the previous pandas article. So this task involves extracting portions of the pivot table, converting them to numpy arrays and then glueing them into one large 2d array.

import pandas as pd
import numpy as np

class PopulationDensity(object):
    #"{C50910CC-F88F-4EA5-86D4-1E5D6AF1F4AE}"
    _reg_clsid_ = "{C50910CC-F88F-4EA5-86D4-1E5D6AF1F4AE}"
    _reg_progid_= 'PandasInVBA.PopulationDensity'
    _public_methods_ = ['getPivotTable']

    def getPivotTable(self):
        ## csv file in zip http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/GEOSTAT-grid-POP-1K-2011-V2-0-1.zip

        url="C:/Users/Simon/Downloads/GEOSTAT-grid-POP-1K-2011-V2-0-1/Version 2_0_1/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv"
        whole=pd.read_csv(url, low_memory=False)

        ## only want two columns
        populationDensity=whole[['TOT_P','CNTR_CODE']]

        ## trying to replicate graph here http://www.centreforcities.org/wp-content/uploads/2018/04/18-04-16-Square-kilometre-units-of-land-by-population.png
        ## which aggregates the records by brackets

        # https://stackoverflow.com/questions/25010215/pandas-groupby-how-to-compute-counts-in-ranges#answer-25010952
        # also relevant https://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values#answer-21441621
        ranges = [0,10000,15000,20000,25000,30000,35000,40000,45000,1000000]

        ## Yes groupby but we want to work with a DataFrame afterwards
        # https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
        groupedByBracketandCountry = populationDensity.groupby(['CNTR_CODE',pd.cut(populationDensity['TOT_P'],ranges).astype(str)]).size().to_frame(name = 'count').reset_index()

        # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table
        pivottable2=pd.pivot_table(groupedByBracketandCountry, fill_value=0, values='count', index='TOT_P',columns=['CNTR_CODE'] , aggfunc=np.sum)

        # new code here that extracts portions of pivot table into arrays and then glues then together into one returnable array

        pivottable2Values = pivottable2.values             # this gives the data elements of the pivot table
        pivottable2Columns = pivottable2.columns.values    # this gives the column headers of the pivot table

        # the next line glues one 2d array on top of another (i.e. vertically)
        pivottable2ColumnsAndValues = np.vstack((pivottable2Columns,pivottable2Values))

        # this will be our origin cell, in the top left
        pythontopLeft=np.zeros((1,1))

        # this gets our rows headers in a 2d array, np.reshape upgrades the 1d array to 2d
        indexValues =  np.reshape(pivottable2.index.values, (-1, 1))

        # this next line glues one 2d array on top of another (i.e. vertically)
        pivottable2Index = np.vstack((pythontopLeft,indexValues))

        # this next line glues one 2d array onto the side of another (i.e. horizontally)
        vRet = np.hstack((pivottable2Index,pivottable2ColumnsAndValues))

        # this final line helps the conversion to an OLE Variant pastable onto an Excel worksheet
        return vRet.tolist()

if __name__=='__main__':
    print ("Registering COM server...")
    import win32com.server.register
    win32com.server.register.UseCommandLine(PopulationDensity)

np.vstack, np.hstack and np.reshape

There are some interesting numpy array functions introduced here np.vstack, np.hstack and np.reshape; the first two glue two array either vertically or horizontally; np.reshape can be used to redimension an array.

Without tolist() you get this error

The final requirement to pass back a numpy array to Excel VBA is to use tolist() because numpy array are not convertible to an OLE Variant but lists are. If you forget to add to tolist() then you get the following error.

Unexpected Python Error: ValueError: ndarray is not C-contiguous

VBA Client Code

So now the VBA is quite simple. The Python returns an OLE Variant array of two dimensions, we calculate the dimensions and then paste onto the sheet.

Sub Test()

    Dim obj As Object
    Set obj = VBA.CreateObject("PandasInVBA.PopulationDensity")
    
    Dim v
    v = obj.getPivotTable
    
    Dim lRows As Long
    lRows = UBound(v, 1) - LBound(v, 1)
    
    Dim lColumns As Long
    lColumns = UBound(v, 2) - LBound(v, 2)
    
    Dim rng As Excel.Range
    Set rng = Sheet3.Range(Sheet3.Cells(1, 1), Sheet3.Cells(lRows, lColumns))
    rng.Value = v
End Sub