Tuesday 26 June 2018

Python - Pandas - Data Processing

Introduction

So its Python month on this blog where I'm highlighting Python libraries of interest to Excel programmers.

Python's pandas library rivals not only Excel worksheet data processing function but also SQL and even C#'s LINQ.

Installing pandas

Install from a command windows with admin rights...

pip install pandas
...
Installing collected packages: six, python-dateutil, numpy, pandas
Successfully installed numpy-1.14.5 pandas-0.23.1 python-dateutil-2.7.3 six-1.11.0

So install went very well. Of interest is the library called six which helps code span the Python2 vs Python3 divide; whilst https://pypi.org/project/numpy/ gives array processing.

Simple CSV Data Set - UK House Prices

Found a smallish data set of UK House prices and will use that to begin playing with pandas.

Rows and Columns of data set

I am reminded of VBA two dimensional arrays when working with pandas data sets and when programming one needs to get the rows and columns. Here I give code to show rows and columns. Also this exemplifies sub-selecting (projecting) the first three columns

import pandas as pd

url="https://raw.githubusercontent.com/datasets/house-prices-uk/master/data/data.csv"
whole=pd.read_csv(url)

## drill into shapes tuple to get rows
print(str(whole.shape[0]) + " rows")

## project first three columns
projected = whole[['Date','Price (All)','Change (All)']]
## drill into shapes tuple to get columns
print(str(projected.shape[1]) + " columns")

print (projected.shape) ## outputs (261, 3) 

print(projected.index)  ## outputs RangeIndex(start=0, stop=261, step=1)

print(projected)        ## outputs top and bottom of table

The final print statement outputs the following...

#           Date  Price (All)  Change (All)
#0    1952-11-01         1891           0.0
#1    1953-02-01         1891           0.0
#2    1953-05-01         1891           0.0
#3    1953-08-01         1881           0.0
#4    1953-11-01         1872          -1.0
#5    1954-02-01         1863          -1.5
#..          ...          ...           ...
#256  2016-11-01       205937           4.5
#257  2017-02-01       206665           4.1
#258  2017-05-01       209971           2.8
#259  2017-08-01       211672           2.6
#260  2017-11-01       211433           2.7

Big Data Set - European Population Density Grid

So the maximum number of rows for an Excel worksheet is 1,048,576 rows (2^20) and I have found a data set with double that. The EU statistics agency, EuroStat, divides Europe into a series of grids and counts the people living in the grid; this data set is known as the GEOSTAT 2011 population grid, here is the download page which hosts the zip file. Within the zip is csv file, Version 2_0_1\GEOSTAT_grid_POP_1K_2011_V2_0_1.csv.

I came across this data thanks to a housing report published by think tank UkOnward written by Neil O'Brien and it referenced an article from Centre For Cities. They have processed the data nicely to illustrate how different EU member states have different dispersions of population densities; I'd interpret this as saying some countries do high rise residential blocks better than others. Here is their nice graphic.

But I wanted the underlying data, so I wrote a Python program that does the number crunching and here is a portion of the edited csv file which looks like it agrees with the graph once you ignore the 0-10000 bracket which isn't on the graphic.

TOT_P BE DE ES FR IT NL UK
(0, 10000] 25399 214363 62516 372112 171736 30550 127750
(10000, 15000] 29 211 455 234 343 51 223
(15000, 20000] 23 51 241 73 93 17 49
(20000, 25000] 3 8 132 32 36 4
(25000, 30000] 2 87 14 5
(30000, 35000] 45 21 3
(35000, 40000] 23 9
(40000, 45000] 13 7
(45000, 1000000] 10 3

And here is the source code


import pandas as pd
import numpy as np

## csv file in zip http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/GEOSTAT-grid-POP-1K-2011-V2-0-1.zip

url="C:/Users/Simon/Downloads/GEOSTAT-grid-POP-1K-2011-V2-0-1/Version 2_0_1/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv"
whole=pd.read_csv(url, low_memory=False)

## only want two columns
populationDensity=whole[['TOT_P','CNTR_CODE']]

## trying to replicate graph here http://www.centreforcities.org/wp-content/uploads/2018/04/18-04-16-Square-kilometre-units-of-land-by-population.png
## which aggregates the records by brackets

# https://stackoverflow.com/questions/25010215/pandas-groupby-how-to-compute-counts-in-ranges#answer-25010952
# also relevant https://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values#answer-21441621
ranges = [0,10000,15000,20000,25000,30000,35000,40000,45000,1000000]

## Yes groupby but we want to work with a DataFrame afterwards
# https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
groupedByBracketandCountry = populationDensity.groupby(['CNTR_CODE',pd.cut(populationDensity['TOT_P'],ranges)]).size().to_frame(name = 'count').reset_index()

#print (groupedByBracketandCountry) ## outputs the new DataFrame with the brackets

# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table
pivottable2=pd.pivot_table(groupedByBracketandCountry, values='count', index='TOT_P',columns=['CNTR_CODE'] , aggfunc=np.sum)
#print (pivottable2)

pivottable2.to_csv("C:/Users/Simon/Downloads/GEOSTAT-grid-POP-1K-2011-V2-0-1/Version 2_0_1/CountPivot.csv")

Pandas is a massive library and I need to go off and read the documentation. A pandas DataFrame is akin to an ADO Recordset familiar to VBA programmers. I used some groupby functionality without using SQL. Also the above program uses a pivot table in code without needing to locate a table on a worksheet like Excel does. Pandas looks extremely useful.

No comments:

Post a Comment