Athlete's footnote

According to this incredibly understated and subtle Forbes article, the human race produces approximately 2.5 quintillion bytes of data everyday - and naughty IoT is to blame for the majority of this rabid data procreation. Sure, we can't argue with the modest conclusions of the magazine most famous for producing a list of world's top 100 quintillionaires, and world's top quintillion companies. After all, our very own highly scientific research into worldwide data production entitled Biggie Smalls Data (summarised on the greasy rear of a Hobgoblin beer mat) too concluded that 2.49 quintillion bytes of that total volume (± a 2.49 quintillion margin of error) is actually generated by hacked Samsung-Weyland-Yutani Corporation Internet-connected fridges engaging in distributed denial of service attacks. Against "smart" cat flaps, online toasters, self-aware light-bulbs and - of significantly less concern to our elected officials - entire nation states.

In other words, after the golden Age of Enlightenment, we are all now living in the mouldy brown age of Big Data. Yes, boomers, millennials and generation X/Y/Z's... that unfortunately includes all of you.

Furthermore, according to glossy marketing literature produced by companies that - quite literally - own 99.9% of Planet Earth, we should all be extremely excited about this fact. Even thankful. Just ask Cambridge Analytica. They were certainly grateful for the many legitimate uses they found of our personal data.

But just how excited should we be? Well, take a look at this illuminating diagram we found on Wikipedia describing Notoriously B.I.G. data. It depicts a giant VHS video tape (of what we assume to be Alien) being chased up Corn Du by a menacing A Clockwork Orange-style gang of cassette tapes, vinyl records and books.

Yes, it really is that exciting.

And not being ones to intentionally pass up on a chance to contribute our very own quintillion worth of meaningless JSON messages, we're going to take this opportunity to do some "random data stuff" with (pardon our Catalan) "@&%! load of data" collected during Chariots of Wire.

Yes, we're looking at you Big Poppa - the big, ballsy field that is data science. We're going to come after you big time in our wet, stinky trail running shoes. And it might just end up being quite Juicy (the pointless data manipulations that is, not our odorous footwear).

Gears of Bore 3

Look inside our tiny running backpack sized to fit two skinny hamsters, and what do we find?

No, not that half-eaten prawn sandwich from the previous run which now smells like a relative's home-made Brie. Or those energy gels that turn us florescent green like the Incredible Hulk... permanently. Nope. We find the following outdoor necessities that should always be taken with us in place of non-essential items*, such as food, water, navigation devices, and smartphone.

That's right. We're continuing to persist with the use of ESP32 development boards, tactically attached to InvenSense MPU6050 Inertial Measurement Units. These were deployed back in Chariots of Wire and we invested some time in obtaining accelerometer and gyroscope readings using I2C and MicroPython. This all culminated in the estimation of "pitch" values of our individual leg limbs, using a Complimentary Filter. Oh, what fun that was. Almost as much fun as carol singing in a Cotswold hamlet where the houses are a kilometre apart along a disused farm track.
And if you were paying attention (we don't blame you if you weren't), you would also recall that we had these measurements being dispatched back to a MQTT broker running on a Raspberry Pi. We aren't going to need any more gadgetry than this. For now, the same Raspberry Pi will be able to comfortably run InfluxDB - an open-source time-series database - to store incoming data in chronological order.
Mr. Gates doesn't appear at #2 in the 2019 Rich List for no reason (we're not at all jealous, Mo Money, Mo Problems and all that). We'll perform much of the data analysis and visualisation on a wholly unrelated Windows 10 laptop, since we can run both Python and Grafana on it (simultaneously while surfing the web for dangerous Windows viruses), and experiment with all things Pandas and Matplotlib.

*Please don't follow this advice. Be safe outdoors. Always. Especially around mole burrows.

It's a marathon, not a `print()`

This is the fourth instalment in the I-O-Mr-T series which has been slowly trundling along the outside lane for some time. And since seeded sprinters are allegedly gifted middle lanes in competitive running, we are starting to feel the pressure to perform like a pro. Like, maybe, even get our hands on some ~~mal~~silver-ware.

Below are the amateurs that have already jumped the gun and started the race in lanes 1-3. Losers. All of 'em!

The SIGH!tinerary

Be prepared. Don't blink!

In this post, colourful things are going to happen before your very eyes in quickfire succession. Quintillion code snippets and resulting screenshots of questionable value will be flashed at you like score cards at a Strictly Come Dancing final skipped through on 30 × fast forward. And, through the mystical power of numbers, claims will shamelessly be made to downright embellish the truth. Like 24,000 more trained nurses. Or 40 new hospitals. We will literally quote anything like it is fact; and expect the audience to believe it. Ha ha. The jokes are on you!

But in order to impose some much needed order to the proceedings, here's what we've printed on the side of our big bold campaign bus before getting it stuck on a peaty Wiltshire countryside road.

We begin exactly where we left our experiments back in Chariots of Wire. We pretend that we have some IMU measurements at hand (which we actually do), and further exaggerate an urgent need to store them in a specialised time-series database. Yes, that's right. To this end, we will install InfluxDB on the Raspberry Pi and casually interact with a newly created database using both InfluxDB's own CLI, and the InfluxDB library available in Python - imaginatively called InfluxDB-Python.
Where there is data, inevitably, there are graphs. Pretty ones. Ones that fatally convince a willing audience that they too see evidence in the numbers... like statistical correlation between a 999% increase in alien sightings, and a 999% increase in the consumption of magic mushrooms. You might just remember the bona fide visualisation tool - Grafana - from Hard Grapht. Grafana can be launched on our Windows 10 laptop and made to connect directly over the network to the InfluxDB database running on the Raspberry Pi. Pointy-pointy-clickity-click later, suddenly, the Sky is the Limit. We have pretty graphs being displaying in our browser. Of such outstanding quality that we can even convince the electorate that the Planet Earth is, in fact, Flat Eric.
But we have made the assumption that no serious data scientist works with tools that look semi-decent. It's time to import our data from our InfluxDB database back into Python, feed numerical bamboo shoots to Pandas, and set up our own data Adjustment Bureau starring Matplotlib Damon. Here, we're looking to further reduce the noise from our IMU sensor readings, and for bonus points, we will attempt to estimate step rate using some peak detection.
Of course, it's holiday bonus season. The sales department has been in touch via a Slack Channel that you never knew you were a member of. They want some prettier graphs. Now! For our own job security, we are going to heed their advice and store the processed data back into InfluxDB. Are Paul and Paulina happy with the results? We will never find out... since they have called it a day and are playing golf with an "important client". We'll let them off, it's already 10:00 AM.
But no serious endeavour in the world of IT these days can be considered legitimate without the use of the clooouuuud! Let's finally upload the results of our analysis in CSV format to an AWS S3 bucket so that we can sledge this academian nut with some heavy-handed machine learning.

Here's a photograph of a distinct land mass staring at the cloud, thus definitively proving the theory that the Earth is indeed Flat Eric.

You say daaata, I say dayta, let's call the whole thing off

Actually, let's not.

What did we "accomplish" in Chariots of Wire? We had inexplicably strapped some MPU6050 Inertial Measurement Units (IMUs) to our legs and believed that we were on the verge of some sort of exciting scientific breakthrough. Yet, in truth, all we managed to achieve was haphazard monitoring of IMU readings from our legs, courtesy of ESP32 development boards, and a novelty HTML5-based curiosity that displayed the positions our legs found themselves in, in near real-time.

If we're being kind to ourselves, however, we did demonstrate that accelerometer and gyroscope readings can be combined using a Complimentary Filter to adequately estimate the angular displacement (pitch) of the limbs of our legs, relative to gravity.

But now that we have copious amounts of data incoming, we really need to figure out what to do with it. And to fabricate a completely ludicrous use case, we will use the dataset from the IMUs taken during a 2-minute toilet break on a Friday night.

This leads us very nicely on to the first constituent of this experiment: data storage. In essence, we have time-dependent IMU readings that need to be stored in vast quantities, in order. So why not use the popular open-source time-series database - InfluxDB - for this purpose? To this end, we'll install InfluxDB on the Raspberry Pi, and get our IMU readings populating a database in it.

Then, we'll point Grafana at our InfluxDB database for some pretty situation-room style graphics. For some extra role play, we don our lab coats and pretend that we're legitimate data scientists, by charting colourful stuff in Matplotlib with the help of charming Pandas.

Data SIGH!ence

Right. First up: InfluxDB. We discovered that it installs pretty easily onto a Raspberry Pi running Raspbian OS. Let's give it a try:

sudo apt-get install influxdb

Once it's installed, it kinda runs in the background without much fuss. And it's accessible both locally, and over the network, afterwards (until the end of time, or until Flat Eric rises to reclaim the planet we call home).

If you don't take our word for it, you can always check whether it's running, you know.

sudo systemctl status influxdb | grep Active

Now that InfluxDB is running, we can create a database, and populate tables in it, which is sort of the whole point. Still in the Raspbian OS operating system, we can interact with InfluxDB from the Linux bash prompt, using the InfluxDB CLI.

We didn't find the CLI automatically installed though, so we had an additional step to install the client.

sudo apt-get install influxdb-client

Got our hands on the good old fashioned CLI? Good.

Keying in influx now launches the shell, from which we can do useful stuff like manually create tables, and inject and query data. This clearly isn't the manner in which we're going to interact with InfluxDB going forwards, but it's a great way to directly access the service and its data through the backdoor. And speaking of backdoors, if we were taking this seriously, we should secure the service using a suitable username and password... but we'll leave that Everyday Struggle to another day.

For starters, let's create a database called imu_readings to house our IMU data.

create database imu_readings
show databases

That's about it. It's ready to house our data.

But using the CLI is all very arduous and unsophisticated. Naturally, we want to use Python to interact with InfluxDB. Which is why we need to install the handy InfluxDB-Python library.

sudo pip3 install influxdb

There we have it. All ready to go.

Next, we rearrange the payload being received by our MQTT broker into a JSON format required by InfluxDB. This could be performed automatically by a separate application subscribed to the MQTT topic. Or by using something that involves a little more AWSorcery, like AWS IoT Greengrass. Either way, what we're talking about here is a Python application that receives and commits data to the InfluxDB database. That's it. Panic over.

Notice how we have added measurement and tags keys, and offloaded a selection of our data into the fields key. By not specifying a separate time key, the server will automatically timestamp the entry when the data is committed. If the payload already contains a timestamp of when the measurement was taken, it is likely to be more desirable to commit this using a time value to reflect the actual moment when the data was collected.

{
    "measurement": "imu_reading",
    "tags": {
        "dev_id": "right_leg",
        "imu": "0"
    },
    "fields": {
        "accel_pitch_angle": 195.0,
        "angle_pitch_filtered": 195.47,
        "angle_pitch_filtered_previous": 195.43,
        "gyro_pitch_angle": 195.49
    }
}

If we want to simply store this object back into the database, we use the write_points() method of the InfluxDBClient().

from influxdb import InfluxDBClient
client = InfluxDBClient()
client.get_list_database()
client.switch_database("imu_readings")
json_body = [
    {
        "measurement": "imu_reading",
        "tags": {
            "dev_id": "right_leg",
            "imu": "0"
        },
        "fields": {
            "accel_pitch_angle": 195.0,
            "angle_pitch_filtered": 195.47,
            "angle_pitch_filtered_previous": 195.43,
            "gyro_pitch_angle": 195.49
        }
    }
]
client.write_points(json_body)

Run this, and the data is entered into our imu_readings database. Notice how the JSON entry is assigned to a Python list first. This is because the write_points() method can actually handle multiple entries at the same time - a useful feature when you have tonnes of data ready to be committed.

Let's switch back momentarily to the InfluxDB CLI and impartially investigate if we can see this entry.

use imu_readings
select * from imu_reading

Yeah, there's data.

Well, that was fun. But it's all a bit bland so far. Let's spice it up with a bit of Grafana-na-na-na.

We covered both Kibana and Grafana back in Hard Grapht so we won't do a deep dive here. Needless to say, InfluxDB is a supported Data Source in Grafana so it's simply a case of pointing it at our Raspberry Pi running InfluxDB.

Save and test the Data Source and we should receive some sort of confirmation that all is in order.

And the time-series data in our InfluxDB database is now conveniently available for over-eager dashboarding. Just for fun, let's fabricate some arbitrary entries by committing totally fictitious JSON data and see if they appear in our graph in Grafana.

Yes, we realise, this still isn't that insightful. So we're now going to switch to using real data collected by our IMUs attached to both legs... during a Friday night toilet break.

Picture the scene... Death in Paradise is on TV. But you've probably already seen this episode, like 7 times before. You've also had a few bottles of Dom Pérignon to drink like a true rockstar. So your moment to RISE UP is now!

This specific dataset captures periods in which we're sitting on the sofa watching TV, with some walking and climbing up the stairs thrown in for good measure.

Zooming into some segments reveals probable patterns emerging, and they appear to coincide with the steps taken to walk up or down the stairs (reflected by alternating IMU measurements between the legs). Yet, the data is highly noisy at this point, which is why we think the readings could benefit from some smoothing.

We're now going to turn our attention to Pandas - the extremely popular Python data analysis tool.

First step is to connect Python up to our InfluxDB database running on the Raspberry Pi, and retrieve data pertaining to a specific time frame of interest using client.query(). We assign the extracted data to a Pandas DataFrame object. The plot() and show() methods allow us to quickly display the data using Matplotlib, a convenient way for us humans to continually validate our analysis - visually.

from influxdb import InfluxDBClient
import pandas as pd
import matplotlib.pyplot as plt
client = InfluxDBClient(host="rosie-02.local")
client.switch_database("imu_readings")
df = pd.DataFrame(
    client.query(
        "select * from imu_reading where time > '2019-11-22T22:34:10Z' AND time < '2019-11-22T22:36:10Z'"
    ).get_points()
)
ax = df[(df.imu=="0") & (df.dev_id=="right_leg")].plot(
    x="time", y="angle_pitch_filtered", label="Right Leg IMU 0", title="Anatomy of a 2-minute Bathroom Break"
)
df[(df.imu=="1") & (df.dev_id=="right_leg")].plot(
    x="time", y="angle_pitch_filtered", ax=ax, label="Right Leg IMU 1"
)
df[(df.imu=="0") & (df.dev_id=="left_leg")].plot(
    x="time", y="angle_pitch_filtered", ax=ax, label="Left Leg IMU 0"
)
df[(df.imu=="1") & (df.dev_id=="left_leg")].plot(
    x="time", y="angle_pitch_filtered", ax=ax, label="Left Leg IMU 1"
)
plt.xticks(rotation=5)
plt.show()

The data displayed natively in Matplotlib is identical to what we observed back in Grafana, but without the Blockbuster feel. True, this is more Hollyoaks than Hollywood, but we quickly learn to appreciate its uses.

We zoom in, and we can instantly start to observe the noise in the readings. While we enlightened inhabitants of Planet Earth may be able to detect the peaks and troughs in the IMU values visually, it would be useful to smoothen the curves at this point to make the results more explicit (to both our eyes, and eventually, to a computer program).

For the remainder of the post, we're going to focus now on just one IMU - IMU0 belonging to the right leg - but the method could apply equally to all. The chosen IMU is attached to our upper leg, so its fluctuation in pitch should provide us with a reliable indication of the rate at which we are moving our legs backwards and forwards.

We are applying rolling().mean() to the angle_pitch_filtered values, in an attempt to apply a simple moving average to this column. Although there is now a slight lag in the values corresponding to the window size as a result, this really isn't that important in this context (unlike to that wolf in Wall Street, for example). What is more useful, however, is that the peaks and troughs now become significantly more prominent.

sma_window = 20
df_rl_imu0 = pd.DataFrame(df[(df.imu=="0") & (df.dev_id=="right_leg")])
df_rl_imu0.reset_index(inplace=True)
df_rl_imu0["angle_pitch_filtered_sma"] = df_rl_imu0["angle_pitch_filtered"].rolling(window=sma_window).mean()
ax = df_rl_imu0.plot(
    x="time", y="angle_pitch_filtered", label="Original Noisy Signal", title="Right Leg IMU 0"
)
df_rl_imu0.plot(
    x="time", y="angle_pitch_filtered_sma", ax=ax, label="Simple Moving Average, window="+str(sma_window)
)
plt.xticks(rotation=5)
plt.show()

While this waveform is "decipherable" to us humans, how does a machine recognise what are effectively rolling maximum and minimum values of the curve?

Thankfully, using Pandas together with the venerable SciPy Python library, there are many ways in which we can employ sophisticated mathematical operations to analyse the dataset. Specifically in this case, we could employ the scipy.signal.find_peaks() method to identify the local maxima and minima. This feature appears to have a whole load of tunable parameters that we ought to investigate but here we'll baselessly use width=10, height=10. Because we can.

import scipy.signal
import numpy as np
df_rl_imu0_peaks = pd.DataFrame(scipy.signal.find_peaks(df_rl_imu0["angle_pitch_filtered_sma"], width=10, height=10)[0])
df_rl_imu0_peaks.columns = ["peak_index_ref"]
df_rl_imu0["peak"] = df_rl_imu0.index.isin(df_rl_imu0_peaks.peak_index_ref)
ax = df_rl_imu0.plot(x="time", y="angle_pitch_filtered_sma", label="Simple Moving Average, window="+str(sma_window), title="Right Leg IMU 0 Peak Detection")
df_rl_imu0[df_rl_imu0["peak"]==True].plot(x="time", y="angle_pitch_filtered_sma", linestyle="None", marker="+", ax=ax, label="Peaks Detected")
plt.xticks(rotation=5)
plt.show()

We plot the results against our filtered curve, and we can see that the peaks have been somewhat successfully identified (with a few notable exceptions during periods of inactivity).

And since we have been able to identify the peaks, we can use the time.diff() method between the timestamps of the peaks to work out what the duration was... which we'll crudely interpret as being the step rate.

df_rl_imu0["duration"] = df_rl_imu0[df_rl_imu0["peak"]==True].time.diff() / np.timedelta64(1, 's')
ax = df_rl_imu0.plot(x="time", y="angle_pitch_filtered_sma", label="Simple Moving Average, window="+str(sma_window), title="Right Leg IMU 0 Durations between Peaks")
df_rl_imu0[df_rl_imu0["peak"]==True].plot(x="time", y="angle_pitch_filtered_sma", linestyle="None", marker="+", ax=ax, label="Peaks Detected")
df_rl_imu0[df_rl_imu0["peak"]==True].plot(x="time", y="duration", secondary_y=True, ax=ax, label="Duration between Peaks (s)")
plt.xticks(rotation=5)
plt.show()

The result is annoyingly skewed by the anomalous peaks that were identified, but generally, it appears to have recognised that our legs are moving quicker when going up or down the stairs, as opposed to when remaining seated or standing. What a revelation!

And for the sake of pleasing Paul and Pamela who have now returned from the golf course intoxicated, and are demanding some slides for the client presentation "tomorrow" (despite it being Friday), we can store our processed data back into InfluxDB using write_points() again, once we reorganise the JSON into the required format.

Here you go, P&P. Shove this graphic into your client's inbox and watch that deal close.

Optionally, these datasets (both pre and post processing) could be exported in their entirety into CSV format so that they can be ingested by some sort of highly advanced machine learning lifeform running on quintillion CPU cores and hyper distributed deployments of Windows Vista. Sprinkle in some other environmental measurements that we have a habit of taking while navigating the outdoors, engineer additional attributes, and we may be able to find some interesting relationships between how our legs work and the world around us.

Then again, maybe not.

There. Have yourself A Merry Little Comma Separated Value for Christmas.

Finally, one last mention of the aforementioned sentient, machine learning overlord that crunches through numbers and spits out its heartlesss assessments about our athletic incompetency like yesterday's prawn sandwich. Such beasts these days are most likely to be couched high up in the clouds, so it's only natural to zoom up our CSV files to some form of cloud storage.

We'll upload ours to AWS S3 for the time being. It's a commonly used unceremonious dumping ground for files, after all.

There we have it. The files are safely stored in S3 on the cloud... ready for the next phase of the shenanigans.

We've reached this particular finishing line.

Our eyes are red and puffy from being Hypnotised by a continuous, blurry stream of digits. A quintillion of them, in fact. Our legs are now itching to don a pair of unwashed shoes, and take this monstrous rig outdoors to collect some real data. And once safe back at home, to analyse the numbers like concerned shoppers inspecting their mile-long receipts after an entire day spent out at an IKEA.

Thanks a quintillion for staying with us. We can sort of tell; this particular trail is about to get a little mucky.

Update 1 - January 2020

That secret tech that absolutely no pro uses. Because it provides zero improvement in your running performance. #MarginalPains #ESP32 #MicroPython pic.twitter.com/4OgVLEiozO
— Rosie the Red Robot (@rosie_red_robot) January 10, 2020

If you are thinking about running around the local trails, braving an illuminated 3D-printed curiosity on your leg, you should at least wait until night time to avoid causing unnecessary alarm!

...And ensure that the wearable is indeed... ahem... "wearable" by sanding it down carefully, or better still, by giving it some serious padding.

Athlete's footnote

How to get started on InfluxDB is documented thoroughly here:

https://docs.influxdata.com/influxdb/latest/introduction/installation/

Below, you will find the venerable Python library that allows us to interact with InfluxDB using Python. It's imaginatively called "InfluxDB-Python". The name kinda gives it away, doesn't it?

https://influxdb-python.readthedocs.io/en/latest/

...But if we want to interact with InfluxDB outside of Python, from the comfort of our operating system shell, we can use its command line interface:

https://docs.influxdata.com/influxdb/latest/tools/shell/

Pandas can be both a widely adopted Python library for working with large volumes of data... and more than one instance of a less adopted bear native to South Central China.

https://pandas.pydata.org/

SciPy is an open source Python library that allows us to perform complex mathematical operations that are at the heart of science, engineering and other trades that will make our parents proud. So complex, that we can't tell you much else about it, other than where you might find more information on it.

https://www.scipy.org/getting-started.html

Rosie the Red Robot

Search This Blog

Athlete's footnote

Gears of Bore 3

It's a marathon, not a `print()`

The SIGH!tinerary

You say daaata, I say dayta, let's call the whole thing off

Data SIGH!ence

Update 1 - January 2020

Athlete's footnote

Labels

Comments

Post a Comment

MOST VISITED (APPARENTLY)

LoRa-Wan Kenobi

Battle of BLEtain

Hard grapht

Rosie the Red Robot

Athlete's footnote

Gears of Bore 3

It's a marathon, not a print()

The SIGH!tinerary

You say daaata, I say dayta, let's call the whole thing off

Data SIGH!ence

Update 1 - January 2020

Athlete's footnote

Labels

Comments

Post a Comment

MOST VISITED (APPARENTLY)

LoRa-Wan Kenobi

Battle of BLEtain

Hard grapht

It's a marathon, not a `print()`