Introduction
So its Python month on this blog where I'm highlighting Python libraries of interest to Excel programmers.
Python's pandas library rivals not only Excel worksheet data processing function but also SQL and even C#'s LINQ.
Installing pandas
Install from a command windows with admin rights...
pip install pandas
...
Installing collected packages: six, python-dateutil, numpy, pandas
Successfully installed numpy-1.14.5 pandas-0.23.1 python-dateutil-2.7.3 six-1.11.0
So install went very well. Of interest is the library called six which helps code span the Python2 vs Python3 divide; whilst https://pypi.org/project/numpy/ gives array processing.
Simple CSV Data Set - UK House Prices
Found a smallish data set of UK House prices and will use that to begin playing with pandas.
Rows and Columns of data set
I am reminded of VBA two dimensional arrays when working with pandas data sets and when programming one needs to get the rows and columns. Here I give code to show rows and columns. Also this exemplifies sub-selecting (projecting) the first three columns
import pandas as pd
url="https://raw.githubusercontent.com/datasets/house-prices-uk/master/data/data.csv"
whole=pd.read_csv(url)
## drill into shapes tuple to get rows
print(str(whole.shape[0]) + " rows")
## project first three columns
projected = whole[['Date','Price (All)','Change (All)']]
## drill into shapes tuple to get columns
print(str(projected.shape[1]) + " columns")
print (projected.shape) ## outputs (261, 3)
print(projected.index) ## outputs RangeIndex(start=0, stop=261, step=1)
print(projected) ## outputs top and bottom of table
The final print statement outputs the following...
# Date Price (All) Change (All)
#0 1952-11-01 1891 0.0
#1 1953-02-01 1891 0.0
#2 1953-05-01 1891 0.0
#3 1953-08-01 1881 0.0
#4 1953-11-01 1872 -1.0
#5 1954-02-01 1863 -1.5
#.. ... ... ...
#256 2016-11-01 205937 4.5
#257 2017-02-01 206665 4.1
#258 2017-05-01 209971 2.8
#259 2017-08-01 211672 2.6
#260 2017-11-01 211433 2.7
Big Data Set - European Population Density Grid
So the maximum number of rows for an Excel worksheet is 1,048,576 rows (2^20) and I have found a data set with double that. The EU statistics agency, EuroStat, divides Europe into a series of grids and counts the people living in the grid; this data set is known as the GEOSTAT 2011 population grid, here is the download page which hosts the zip file. Within the zip is csv file, Version 2_0_1\GEOSTAT_grid_POP_1K_2011_V2_0_1.csv.
I came across this data thanks to a housing report published by think tank UkOnward written by Neil O'Brien and it referenced an article from Centre For Cities. They have processed the data nicely to illustrate how different EU member states have different dispersions of population densities; I'd interpret this as saying some countries do high rise residential blocks better than others. Here is their nice graphic.
But I wanted the underlying data, so I wrote a Python program that does the number crunching and here is a portion of the edited csv file which looks like it agrees with the graph once you ignore the 0-10000 bracket which isn't on the graphic.
TOT_P | BE | DE | ES | FR | IT | NL | UK |
(0, 10000] | 25399 | 214363 | 62516 | 372112 | 171736 | 30550 | 127750 |
(10000, 15000] | 29 | 211 | 455 | 234 | 343 | 51 | 223 |
(15000, 20000] | 23 | 51 | 241 | 73 | 93 | 17 | 49 |
(20000, 25000] | 3 | 8 | 132 | 32 | 36 | 4 | |
(25000, 30000] | 2 | 87 | 14 | 5 | |||
(30000, 35000] | 45 | 21 | 3 | ||||
(35000, 40000] | 23 | 9 | |||||
(40000, 45000] | 13 | 7 | |||||
(45000, 1000000] | 10 | 3 |
And here is the source code
import pandas as pd
import numpy as np
## csv file in zip http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/GEOSTAT-grid-POP-1K-2011-V2-0-1.zip
url="C:/Users/Simon/Downloads/GEOSTAT-grid-POP-1K-2011-V2-0-1/Version 2_0_1/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv"
whole=pd.read_csv(url, low_memory=False)
## only want two columns
populationDensity=whole[['TOT_P','CNTR_CODE']]
## trying to replicate graph here http://www.centreforcities.org/wp-content/uploads/2018/04/18-04-16-Square-kilometre-units-of-land-by-population.png
## which aggregates the records by brackets
# https://stackoverflow.com/questions/25010215/pandas-groupby-how-to-compute-counts-in-ranges#answer-25010952
# also relevant https://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values#answer-21441621
ranges = [0,10000,15000,20000,25000,30000,35000,40000,45000,1000000]
## Yes groupby but we want to work with a DataFrame afterwards
# https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
groupedByBracketandCountry = populationDensity.groupby(['CNTR_CODE',pd.cut(populationDensity['TOT_P'],ranges)]).size().to_frame(name = 'count').reset_index()
#print (groupedByBracketandCountry) ## outputs the new DataFrame with the brackets
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table
pivottable2=pd.pivot_table(groupedByBracketandCountry, values='count', index='TOT_P',columns=['CNTR_CODE'] , aggfunc=np.sum)
#print (pivottable2)
pivottable2.to_csv("C:/Users/Simon/Downloads/GEOSTAT-grid-POP-1K-2011-V2-0-1/Version 2_0_1/CountPivot.csv")
Pandas is a massive library and I need to go off and read the documentation. A pandas DataFrame is akin to an ADO Recordset familiar to VBA programmers. I used some groupby functionality without using SQL. Also the above program uses a pivot table in code without needing to locate a table on a worksheet like Excel does. Pandas looks extremely useful.
No comments:
Post a Comment