Thursday 7 June 2018

Python - Javascript - MutationObserver - detecting and POST changes to a page

So in the last post I showed how to write a message queue (I have improved that code, so the latest version is on this page). Next I write code in an HTML+javascript web page which detects changes in the web page and posts those changes to our message queue.

At this point I must confess to using Visual Studio to create new Python projects, it gives me Intellisense, but I still run code from the command window. The VS project is relevant in this post because the Python needs to changed to serve up a web page and also accept POST requests but they must come from the same domain otherwise one gets irritating cross domain errors. So keeping the web page and the Python script in the same project makes sense.

So here is a screenshot of my Visual Studio project explorer window.

Chrome Only Please (No IE)

By the way, I only use Chrome for this project. IE is going away, a fact which prompted me to investigated other ways of web-scraping. So this little project has arisen out of the need to move away from IE.

ClockWithMutationObserver.html

So we need a page, ClockWithMutationObserver.html, that display a clock (with thanks to w3schools.com) . The clock's div has an id of clock. Save it in the same directory as the Python script.

<!DOCTYPE html>
<html>
<head>
    <script>
        function startTime() {
            var today = new Date();
            var h = today.getHours();
            var m = today.getMinutes();
            var s = today.getSeconds();
            m = padZero(m);
            s = padZero(s);
            document.getElementById('clock').innerHTML =
                h + ":" + m + ":" + s;
            var t = setTimeout(startTime, 1000);
        }
        function padZero(i) {
            if (i < 10) { i = "0" + i };  // add zero in front of numbers < 10
            return i;
        }
    </script>

</head>

<body onload="startTime()">

    <div style="font-size:72pt" id="clock"></div>

    <script>

        console.log("entering startObserving");
        var MutationObserver = window.MutationObserver || window.WebKitMutationObserver || window.MozMutationObserver;
        if (MutationObserver == null)
            console.log("MutationObserver not available");

        // mutation observer code from https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver
        var targetNode = document.getElementById('clock');

        // Options for the observer (which mutations to observe)
        var config = { attributes: true, childList: true };

        // Callback function to execute when mutations are observed
        var callback = function (mutationsList) {

            for (var mutation of mutationsList) {
                //debugger;
                //console.log(mutation);  //uncomment to see the full MutationRecord
                var shorterMutationRecord = "{ target: div#clock, newData: " + mutation.addedNodes[0].data + " }"

                console.log(shorterMutationRecord);

                var xhr = new XMLHttpRequest();
                xhr.open("POST", "http://127.0.0.1:8000");
                //xhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
                xhr.send(shorterMutationRecord);

            }
        };

        // Create an observer instance linked to the callback function
        var observer = new MutationObserver(callback);

        // Start observing the target node for configured mutations
        observer.observe(targetNode, config);

        // Later, you can stop observing
        //observer.disconnect();

    </script>
</body>
</html>

So the above page gives a nice large clock (in 72pt), something like this

20:37:01

Javascript MutationObserver

So in the world of Javascript the Mozilla Developer Network is a good source of documentation. Thankfully, they have a good page on MutationObserver which allows us to detect changes to the DOM.

In the above web page, there are two blocks of JavaScript, (i) the one on the head drives the clock itself; (ii) and the one at the base is the MutationObserver logic. We find the element we want to observe then we define a callback function for when it changes.

MutationRecords

When our callback function is called, we loop through the changes, for each change there is a detailed MutationRecord and they are worth investigating. In the code, the line //console.log(mutation); is commented out. Uncomment that line if you want to see the rich detail given for each change in the Chrome console. Because of all the detail, I copy across the details I want to a new object, actually a string because that is what I will POST back.

XHR to same domain avoid cross domain errors

We then use an AJAX XHR call to POST the data. It is helpful (but not strictly obligatory) to POST back to the same domain whence the page came; this helps to avoid cross origin domain errors.

PythonHTTPMessageQueue.py

So I have some updated Python web server message queue code here. The main change is that all GET requests serve up the ClockWithMutationObserver.html file.


# with thanks to https://blog.anvileight.com/posts/simple-python-http-server/#do-get

from http.server import HTTPServer, BaseHTTPRequestHandler, SimpleHTTPRequestHandler
from io import BytesIO
import tempfile
from socketserver import ThreadingMixIn
import threading

class MyHTTPRequestHandler(SimpleHTTPRequestHandler):

    def do_GET(self):
        self.path = '/ClockWithMutationObserver.html'
        return SimpleHTTPRequestHandler.do_GET(self)

    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        body = self.rfile.read(content_length)
        self.send_response(200)
        self.end_headers()
        response = BytesIO()
        response.write(b'This is POST request. ')
        response.write(b'Received: ')
        response.write(body)


        # added code to write message to tempfile in temp directory
        msgFName = msgFileName()

        with open(msgFName, 'w+') as msg:
            msg.write(body.decode("utf-8"))
            msg.flush()

        self.wfile.write(response.getvalue())

        # finally add to console so we can see it in the command window
        print(body.decode('utf-8'));

class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
    """Handle requests in a separate thread."""        

def msgFileName():
    # this function uses the date time to generate a filename which hopefully
    # should be unique and allow the files to be sorted
    import datetime
    import time
    ts=time.time()
    timestamp = datetime.datetime.fromtimestamp(ts).strftime('%Y%m%d_%H%M%S.%f')
    fileName = queueDir + '\\' + timestamp + '.txt'
    return fileName


def TempDir():
    #this creates a new directory in the temp folder
    return tempfile.mkdtemp(prefix='MsgQueue')

#Main processing starts here
queueDir =TempDir() #queueDir is in global scope
httpd = ThreadedHTTPServer(('localhost', 8000), MyHTTPRequestHandler)

print("Serve forever, message queue dir:" + queueDir)
httpd.serve_forever()  #code will disappear in here

Running the code, screen shots

So if we start the Python script and we open Chrome and its console window and browse to the address http://127.0.0.1:8000 we get to watch the clock running but we also see activity in the Chrome console window, the command window and the message queue folder. Here are the screenshots.

Final Thoughts

What have we achieved here? Well we've written code to detect changes in a web page and then POST those changes to a HTTP based message queue. Next step would be to detect changes in someone else's page.

What has this got to do with Excel? This example is a Python web server but in this post I have demonstrated that it is possible to use Excel as a web server and so Excel could easily have replaced the Python web server. But this is Python month!

No comments:

Post a Comment