So in the last few posts I have been travelling towards a solution that allows code to scrape data from a Bank Of England PDF. I have got so far as to break up the PDF into separate SVG files. SVG files are easier to work with because they are Xml based.
XPath in VBA
So my first language is VBA and I can quickly give some test code to demonstrate the XPath logic before I delve into a Python solution
Sub TestXml()
'*Tools->References->Microsoft XML, v6.0
Dim xml As MSXML2.DOMDocument60
Set xml = New MSXML2.DOMDocument60
xml.setProperty "SelectionNamespaces", "xmlns:svg='http://www.w3.org/2000/svg'"
xml.Load "C:\Users\Simon\Downloads\pdf_skunkworks\inflation-report-may-2018-page6.svg"
Debug.Assert xml.parseError.ErrorCode = 0
Dim xmlBluePaths As MSXML2.IXMLDOMNodeList
Set xmlBluePaths = xml.SelectNodes("//svg:path[@style='fill:#19518b;fill-opacity:1;fill-rule:nonzero;stroke:none']")
Debug.Assert xmlBluePaths.Length = 28
Dim xmlRedPaths As MSXML2.IXMLDOMNodeList
Set xmlRedPaths = xml.SelectNodes("//svg:path[@style='fill:#a80c3d;fill-opacity:1;fill-rule:nonzero;stroke:none']")
Debug.Assert xmlRedPaths.Length = 28
Dim xmlGreyPaths As MSXML2.IXMLDOMNodeList
Set xmlGreyPaths = xml.SelectNodes("//svg:path[@style='fill:#a98b6e;fill-opacity:1;fill-rule:nonzero;stroke:none']")
Debug.Assert xmlGreyPaths.Length = 28
Dim xmlElement As MSXML2.IXMLDOMElement
Set xmlElement = xmlBluePaths.Item(0)
Debug.Print xmlElement.xml
Debug.Print xmlElement.getAttribute("d")
End Sub
The next problem however is how to parse the path data which can be found in the d attribute of a path element, here is an example of an element...
<path xmlns="http://www.w3.org/2000/svg" id="path670" style="fill:#19518b;fill-opacity:1;fill-rule:nonzero;stroke:none" d="m 241.666,133.557 h 2.364 v -25.886 h -2.364 z"/>
Within that element one can see the path data packed into the d attribute...
m 241.666,133.557 h 2.364 v -25.886 h -2.364 z
So we need code to parse this path data. But I am not going to give that code in VBA, instead I have a Python library to show you, see next post.
No comments:
Post a Comment