We all know that sinking feeling.
A bunch of avocados that we desperately need for our weekend guacamole has been delivered by Ocado*, but they arrive as hard as Vinnie Jones in a blustery weekend fixture down in South West London. And like the
What?!
Let's get this right... so although they've been expertly delivered by a man in uniform, we now have to leave our exotic green companions in a meditative state for a few more days? All while ensuring that they are exposed to a highly non-scientific, subjective temperature of room-degrees-Celsius? Yes, so it turns out, this is indeed the level of effort demanded of us shoppers to materialise the perfect condiment to our burritos. And this - as it transpires - becomes the rather nonsensical pretext to this entire blog post**.
Well, then, challenge accepted. Because we are likely to only require the most basic of IoT paraphernalia to get this particular fiesta ready... or so we thought.
...True, our experience from Frozen Pi taught us that it is easy enough to deploy a battalion of microprocessors hooked up to temperature sensors, connect them to the Curly Wurly Web, and store readings in a DynamoDB NoSQL table using AWS IoT. All worked perfectly, until ... well... it didn't. Because - in reality - we had little way of knowing if our imprecision instruments of science dropped off The Fruitrix like flawed of the flies, and definitely, no way for the devices to take corrective action. Automagically.
Which is why in typical Rosie the Red Robot fashion, and very, Very, VERY late for Valentine's Day, we will attempt to over-engineer a baffling solution that binds 2 microprocessors together in (un)holy matrimony to implement High Availability Avocado Monitoring™, or as our self-submission to the Noble Prize is aptly entitled, How to make good HA2M. And like an episode of 90 Day Fiancé (which we swear we never watch, except on the occasions that it inexplicably appears on screen, and we can't turn it off - honest), we'll investigate if this couple can successfully work through their challenges, through sickness and health. Will this result in an enviable green
Will it be 90 day disarray: happily ever after?
* Other online grocery retailers are available, but rather disappointingly, their names do not rhyme with aforementioned fruit.
** It's been brought to our attention by our eagle-eyed readership totalling 4, that it is possible to purchase "ready to eat" variants. In typical British fashion, however, we will choose not to buy them, but complain vehemently anyway like we have been grossly wronged.
Unexpected item(s) in the bagging area:
- Two of our bravest mushrooms were expected to gloriously return from our previous adventure - Frozen Pi. And by mushrooms, what we actually mean are ESP8266 microprocessors equipped with a shiitake-shaped DS18B20 protrusion. But there was already a Hitchcockian twist to this journey from the onset...
- Not so long after we started to write additional MicroPython code for this instalment, we started to encounter consistent crashes on the
connect()
method of theumqtt.simple.MQTTClient
MQTT module. Fearing a resource constraint, we adopted monastic practises to ensure we don't overwhelm our device with sizeable modules, lingering variables, and eventually, we even resorted to using frozen bytecode to minimise memory consumption during compile time. - ...But sometimes - like every England football team since 1966 - we just have to admit defeat and move on. After all, for an additional quid or two, we can upgrade to a device more capable - ESP8266's equally popular successor, the ESP32. And by all accounts we were left quite content after the swap, because our problem immediately went away. This led us to the conclusion that we were being constrained by resources (probably the RAM).
- We've been using 4 × Ni-Mh AA rechargeable batteries ever since our battery power conundrum of Frozen Pi. Using an ESP32 development board, we have the ability to regulate down the total battery voltage to the 3.3V required by the ESP32.
- Ah ha. What new strange curiosities are we going to wheel out today? We have 2 × AT24C256 EEPROMs in our possession, for reasons described later. EEPROM stands for (remember this for the Thursday night pub quiz down at the village local)... Electrically Erasable Programmable Read Only Memory. Honestly, who on earth comes up with the names for this stuff? And could they not have placed "Programmable" at the very start of the acronym instead for greater comedic effect?
- Couple of spare 4.7k Ω resistors won't go amiss to use as pull-up resistors for the I2C lines (although we didn't appear to require them initially). There's a great deep-dive on I2C by Sparkfun. In short, it's a protocol that pulls the voltage down to 0V to signal presence of "data". Therefore, rest of the time, it needs to be anchored to a positive voltage in order for it to not remain floating. Many I2C devices (especially those packaged up as modules) tend to have pull-ups built in. Jury's still very much out as to whether we need them in this instance. While we're at it, Sparkfun also have a great page on EEPROMs as well.
- We can't really think of anything else... except some mini-breadboards and cables are required to complete the union between our ESP32s. All absurdities are experienced after the matrimony. Bit like real life, then.
Evolutionary history:
This is very much a follow-up to our previous instalment in the I-O-Tea series - Frozen Pi.Going back further, we were introduced to these microprocessors back in Raspberry Bye, Hello. We even hacked a Gro-Egg display using one for no reason whatsoever.
Online (dis)ordering:
At face value, this mission doesn't seem overly complicated (but that's exactly what we said the last time, and the one before...) Here's the recipe that we would have written down had we had access to an empty Walkers crisp packet:- We somewhat pseudo-scientifically concluded in Frozen Pi that our mushrooms require a little more voltage, to take into consideration the voltage drop introduced by the development board's regulator, and minimum operating voltages of both the ESP and temperature sensor. For immediate results, we'll upgrade our holders to house 4 × Ni-MH AA rechargeable batteries. We won't go anywhere near the 3.3V lines with this power source. We are relying on the development board's regulator to perform that conversion via the Vin pin.
- We promised you some
PEEPROM action. Our EEPROMs can be contacted using I2C, which means we have to connect the usual suspects: namely the Vcc, Ground, Clock and Data lines. We'll also test our competency in writing data to / and reading data from these memory devices, using MicroPython's built inI2C
class. - Guess what? We'll then write some more MicroPython code to share rudimentary bit-level / byte-level information between the two ESP32s. What's more, we will make the secondary ESP32 automatically take over by checking what's being written to the EEPROM. This is the logic that makes this the very HA2M that we set out to achieve in our party manifesto. The primary node will be tasked with connecting to Wi-Fi, taking the temperature reading, and dispatching this to AWS IoT Core. There, through a Rule, it will be dumped in a DynanoDB table, just like before.
- Oh yes. Let's also introduce some other AWS components. We would quite like to raise an alert if something of note has happened with the cluster, and notify us. For this, we'll use AWS Lambda, and relentlessly spam ourselves with emails through the use of AWS Simple Email Service (SES). We won't explain here how we set AWS SES up, since that is a whole different topic in itself (and there are plenty of tutorials on it).
Semi-skimmed:
Right. So what exactly are we trying to attempt here?We begin this post, right where we left the previous one, like a dodgy follow-up to an EastEnders cliffhanger (who did kill Dirty Den?) And, indeed, it initially feels like we're mopping up the mess in The Vic left from the night before.
Let's recap.
We deployed a small but brave squadron of 7 × ESP8266s (NodeMCUs) one snowy week to take temperature readings using a DS18B20 temperature sensor. These data points were then duly dispatched back to AWS IoT Core. What's more, we created a Rule in IoT Core to dump our data into a DynamoDB table.
So far so good.
Yet, our feeling of triumph didn't last very long. Because while it looked so promising at the start, approaching the 24-hour mark, our devices started to disappear off the face of the Earth like dodos.
We won't spend too long discussing the life expectancies of the ESP8266s (even if they are hypnotised into deep sleep periodically) as we discussed this to some extent back in Frozen Pi. But from extracting "non-responsive" ESP8266s and performing overdramatic Holby City-style A&E triage, armed with a serial cable and a multimeter in place of a stethoscope and defibrillator, the reasons for their distress became painfully clear.
Our doctor's notes recorded:
- Battery voltage dropping below minimum required to keep ESP8266 stable. Yeah, who knew electronics required power, right? With a majority of our casualties, this was the reason. Below 2.5V, the ESP8266 starts to become temperamental. Approaching 2.3V, it no longer powers on. Where the earlier graph shows miraculous immortality, or sporadic resurrections, it was where we manually intervened with either new batteries, or rather cheekily, mains power. In other words, we cheated.
- Super dodgy code. Yes, we admit it. There was an occasion or two where we simply discovered the ESP8266 stuck on a MicroPython prompt, either caused by suspect hardware (dislodged or non-responsive temperature sensor being the main culprit) or some form of exception that we forgot to handle. In most cases, restarting the ESP8266 brought it back to life. Shame on us. Must do better.
- On the odd occasion, the root cause remained a X-Files style mystery (although we never did blame extraterrestrials for our mishap, however tempting). Sometimes, it appeared as though the ESP8266 didn't receive the interrupt signal from the RTC to wake itself up from deep sleep. On other occasions, we suspected that "little hands" of "curious individuals" had disrupted our network of sensors. Oh, so it was the aliens.
...So what if we can have 2 microprocessors identically configured, armed with the same code, but only one wakes up from deep sleep, takes the temperature reading using its own DS18B20, connects to AWS IoT Core and submits the reading. While its trusted soulmate only wakes up to check that its better half has been doing his / her job. If not (and there could be many reasons why not, like dead batteries, dodgy code or for reasons even Mulder and Scully's scriptwriter find it hard to explain away), the ever-monogamous partner can assume the primary role and do what the failed node was supposed to be doing before.
Also, why not allow the surviving node to notify us of a failure, so that we can dispatch some poor, overworked individual (official title: high availability avocado monitoring system - live incident consultant / engineer, or HA2MSLICE on the call out rota) to go and investigate.
This approach might also gain a wholly unintended benefit; that if working "offline" from a networking perspective, the secondary node should in theory outlast the primary in terms of battery life, leaving a window of opportunity to replace / recharge the previous primary before they both end up pushing daisies. HA2MSLICE no longer has to answer management's awkward questions after unplanned disruption to the guacamole supply chain. Instead, failures are proactively managed and fixed.
Now, it's worth bearing in mind that true resiliency needs to be implemented in all aspects of a solution (for example, what if our Wi-Fi, home router, or broadband provider fails on us?) Here's a simple tip to implement high availability at the avocado-tier, to safeguard against one fatally catching fire: by maintaining a hot spare. You can thank us for this later.
This is all starting to sound rather ridiculous, just for some avocados, and we wholeheartedly agree. But you know by now that we don't just give up in the face of absurdity.
Let's salsa!
Full fat:
What we're attempting to create here is some sort of active / standby cluster, common in computing. Except in most cases these days, they work over the network to do the mutual prodding (heart-beat), and both nodes tend to be up simultaneously as power consumption isn't the overarching concern. Here, we're trying a different approach, mainly because:- Each ESP32 spends a majority of its time in deep sleep, only waking up to take readings and to send them to AWS IoT. This means their status needs to be stored "externally" to the unit. Previously, the devices were rudely awakened after every 2 minutes or so. We're now going to up the ante - to every 15 minutes. Therefore, the chances are, they never see each-other online.
- Everyone tells us that Wi-Fi networking consumes a lot of power. So connecting both devices to the network feels like a waste if one doesn't need to.
...This is how we ended up where we are: a DIY, hodgepodge of a cheap and cheerful cluster, made possible using external memory. Say ¡hola! to our external EEPROMs.
EEPROM is a type of non-volatile memory that typically allows storage of small amounts of data. In other words, data continues to persist even after being power cycled, and if deployed externally to our microprocessor and accessed over I2C, it should be available for read / write operations to any other devices that share the same I2C bus. Yes, that includes the "other" ESP32 in this loving relationship... and you can now start to see how these begin to serve our purpose.
We could have, of course, opted to use a single EEPROM. And that would have reduced the complexity somewhat. But then, that EEPROM becomes a single point of failure. And which power source would it be attached to? We have opted to use an EEPROM per cluster node (ESP32) - meaning 2 in this example. Let's see how we get on.
Now, before we start showing lurid pictures of our ESP32 in dangerous liaisons with an EEPROM, we realise we owe you a fuller explanation as to why we migrated from the ESP8266 to ESP32. There are many differences in the feature set, and there are plenty of resources online that explain what these are (but think more memory and an additional processor core for starters).
But in our case, our special relationship with the ESP8266 soured for the below reasons. And - we swear - it was a case of it is you, not me.
Larger the MicroPython code became - even when we froze the code into bytecode (.mpy) - it kept crashing for unknown reasons on the
connect()
method of umqtt.simple.MQTTClient
. This single line was taking 40-50 seconds anyway, as previously noted, so we can only assume we were hitting a resource constraint... probably RAM.As we discovered, there are a number of useful commands that can be run in MicroPython to check memory utilisation. And when we say "useful", we mean they could have been useful. It didn't help us much here to understand if memory was definitively the issue. Or - more likely - we didn't know what we were looking for...
import gc gc.mem_free() gc.mem_alloc()
import micropython micropython.mem_info(1)
Incidentally, at any point, you can issue a
gc.collect()
command to try and free up memory via garbage collection. But this didn't really fix our issue.gc.collect()
We tried a lot of other things, like swapping hardware, power supplies, splitting code out in to smaller modules, and couples therapy, but we couldn't attribute the crash to any root cause in particular (although memory availability during negotiation of encryption for MQTT traffic continues to remain the chief suspect). The only way we could work around it was to arbitrarily cut down on the size of the code, which was impractical in the long run, since we would have to sacrifice functionality.
Moreover, even when working, using a less powerful microprocessor but taking 40-50 seconds on the
connect()
was rather quite counterproductive. It meant that ESP8266 was running and consuming power for almost a minute, before it went into deep sleep. If, for example, on a minutely or 2-minutely cycle, that is actually quite a large proportion of the time. MQTT connect()
, still with identical AWS IoT Core broker requirements of certificates, private key and SSL encryption, we are happy to report completes in under 5 seconds on the ESP32. That's a hefty Lidl-level saving right there. And probably why ESP32 appears in AWS's partner device catalogue, albeit for their preferred operating system - Amazon RTOS, but ESP8266 is nowhere to be seen.Let's surmise.
Using near-identical MicroPython code on the ESP32 stopped the microprocessor from crashing, and completed the entire execution in under 5 seconds. When used with ESP8266, it hopelessly crashed, and on the odd occasion it worked (with substantially reduced code), it was taking just under a minute to complete the MQTT connection. There was simply no reason to continue down this particular overgrown garden path inhabited by deadly vipers and man-eating badgers. Thanks ESP8266, but no thanks. We will remember the good times fondly.
Right... back to scheduled events. Here's the promised snap of an ESP32 holding hands with an EEPROM, in what we will grandly call a "single node" configuration. You lose power to this baby, or the ESP32 crashes and doesn't recover, or is stuck, or some nefarious state actor decides to walk off with it, your avocados are as dead as Dirty Den.
Since they are now connected up, let's see if we can do stuff with our EEPROM.
We're using I2C. Which means we need to instantiate MicroPython's
I2C
class, after we supply it with the GPIO pins we're planning to use for Clock (SCL) and Data (SDA), along with the clock frequency.Perform a scan, and it should be possible to detect the EEPROM as a slave device on the I2C bus.
from machine import I2C, Pin i2c = I2C(scl=Pin(5), sda=Pin(4), freq=400000) i2c.scan()
Cool - we have the I2C address of our EEPROM: 80 in integer, or 0x50 in hex. Incidently, we can change the EEPROM I2C slave address using some jumper pins which toggles the configuration of the A0/A1/A2 address selection pins. Something for us to keep in mind when we introduce the 2nd EEPROM on the same I2C bus.
Let's try writing some data.
We'll start by writing a single byte (8 bits) to memory location
0x00
, with only the right-most Least Significant Bit set high (00000001). This notation could, for example, be used to denote a 8-bit integer (1 in this case), or there might be special significance attached to this bit that only makes sense in our heads. Note that we convert the input into bytes using the int.to_bytes()
method before sending it to the EEPROM using writeto_mem()
of the I2C
class.Later, we can retrieve it using
readfrom_mem()
. We can see output in byte format, but can also convert it back into an integer to verify the result.num_bytes = 1 # Input in binary (8 bits) - stored as integer input = 0b00000001 i2c.writeto_mem(0x50, 0x00, input.to_bytes(num_bytes, "big"), addrsize=16) output = i2c.readfrom_mem(0x50, 0x00, num_bytes, addrsize=16) # Output in byte print(output) output = int.from_bytes(output, "big") # Converted to integer print(output) if input == output: print("success!")
That's about it. Sorry to disappoint, but we really are not trying to do very much here.
Note that we are rather brazenly writing bytes of data to physical memory addresses, directly, without any form of memory management or file system. There is no Memory Safety, and it's quite easy to overwrite adjacent addresses unintentionally (overflow), or read from adjacent addresses that have no relevance to our current operation (over-read). In other words, we are only working with known, fixed memory addresses, of pre-defined purpose and length. This method is in no way scalable to address other use cases.
We'll try this again. This time, we'll randomly generate a 16-bit number (2 bytes), write this to memory (location
0x01
), and then retrieve it. from urandom import getrandbits num_bytes = 2 # Input in integer (16 bits) - stored as integer input = getrandbits(16) print(input) i2c.writeto_mem(80, 1, input.to_bytes(num_bytes, "big"), addrsize=16) output = i2c.readfrom_mem(80, 1, num_bytes, addrsize=16) # Output back as integer output = int.from_bytes(output, "big") print(output) if input == output: print("success!")
We can repeat the read portion of this test after power cycling the EEPROM (to confirm data is persistent) and by using another ESP32 (to ensure data can be accessed from a different node).
That was the how we will do it, and now, here is the what.
We will define up front which addresses are used for tracking what aspects of the cluster. As such, we're going to crudely assign the following memory locations for our no-frills, easyCluster™, that even Stelios would be proud of.
Let's call it... | Address | ...Used for? |
---|---|---|
EEPROM_CTRL_PRIMARY | 0x00 | 1 byte Each bit represents which node is currently the primary, starting with the least significant bit (e.g. 00000001 for node 0). If more than one claims to be primary ("split-brain"), client will only assume the node with the lowest number to be primary (and correct the EEPROM during next iteration). |
EEPROM_CTRL_RETRY | 0x01 | 1 byte Each bit represents which nodes have been suspected of failure, but have been given another chance (e.g. 00001010 for nodes 1 and 3). |
EEPROM_CTRL_DEAD | 0x02 | 1 byte Each bit represents which nodes have been labelled as clinically dead (e.g. 00000110 for nodes 1 and 2). Dead nodes need to be re-admitted into the cluster manually (because there's probably something seriously wrong with them). |
<SPARE> | 0x03 | 1 byte Currently spare, but it might be useful to store other binary information about member nodes in the future. Like their political affiliation (yes, pretty much binary these days), biometric data (does the microprocessor have fingers: yes or no = binary) or has the node consented to the storage of its private data in the EEPROM in accordance with GDPR. |
Interestingly, use of 8 bits could allow us to keep track of up to 8 nodes. Or we could split these into 4 groups of 2 bits to allow us to track in integer (0-3). One for another day.
We now use the next available addresses to form a map of node sequence numbers. A WHAT???
We will use 2 bytes each for sequence numbers. The node reporting in will only increment its own actual sequence number, and copy across what the numbers are currently for the other nodes. During the next iteration, if the other nodes' actual sequence numbers have not changed since it was last checked and copied across, we can start to suspect that a node might be in jeopardy... If this continues to be the case, we'll mark the node as "dead".
That's the theory.
In order to avoid false positives, we'll use the aforementioned retry bits to signal if the sequence number hasn't changed for another subsequent iteration. And the cherry on this tortilla? If that node suspected of failure is currently primary, we change the primary bit in the EEPROM to be the node that detected the error - a failover occurs. Standby becomes Active. Secondary becomes Primary. Ken becomes Barbie, then becomes a Pokemon. The world is changed irreversibly, forever. Clearly without the retry bit, we'll probably find that there will be constant failovers taking place, as there is likely to be a natural mismatch in up times between the primary node - doing actual work - and the secondary (slacking off).
These sequence numbers being 2 big unsigned integers, we can reach 0 to 65,535 (216-1) before we enter a condition we haven't yet decided how to handle. Perhaps we'll reset the sequence down to 0 (sensible). Or set the ESP32 is on fire (not so sensible).
Incidentally, this mechanism is similar to that of a bona fide solution: a watchdog timer (we assume it's bona fide, because its Wikipedia entry has a picture of a Mars Rover). Except, here, there is a bit of cross-node checking taking place to confirm the health of one another, and the ability to assume the role of primary, rather than a hopeful reset of the microcontroller to see if this resolves whatever impediment it had encountered. We could, of course, combine this with a watchdog timer too. Like with our beloved Windows PCs, there's nothing wrong with an occasional, cheeky reboot to see if that annoying Excel cell issue can't be resolved through full-blown power cycling of the gaming rig.
In total, we've reserved 20 bytes for our cluster meta-data so far - which should now leave (32,748 bytes free), which is still approximately (a mind-boggling) 32KB. In the future, we could use this vast storage capacity to share other interesting information between the nodes, but without any form of dynamic Memory Management, or a File System, or GDPR consent, we will quickly lose our sanity.
And larger the data being written, compounded further by data that varies in size, other interesting considerations come in to play too; such as page write buffer sizes and EEPROM write times. All of this requires investigation if we are handling larger volumes, and / or bigger data. Which, thankfully, we're not.
Back to our mission.
This entire mechanism works on the principle that all EEPROM content is kept the same. In other words, whenever new state information is written back by a node, it is written back to all EEPROMs on the shared I2C bus, even those of the other nodes. If we can't write to a particular EEPROM, we'll mark that node as being "dead".
Next time you find yourself bored, surprise your family members with a loving sketch of random electronics and rambling thoughts that make them question your sanity. Impact is maximised, if left casually attached to the family fridge door, next to this week's school activities and work by toddlers of dubious artistic quality.
We now enter the ESP32s in eternal wedlock, without the prenup, where they can moan / grumble / complain for all eternity about who is the major contributor to this relationship, and whether the other half is pulling their weight. I2C data and clock lines were fused together - creating a single I2C bus. Another cable creates a common ground.
And lastly, a mysterious cable between GPIO pins provides a function that we might or might not need to manually denote if a device is busy doing stuff on the I2C line. We christened it a busy line. Apparently I2C protocol can support multi-master. But we're not entirely sure what happens when 2 nodes try to do stuff with the EEPROMs at precisely the same time. We thought it might be more in our control if we have a line that is held high when a node is about to perform an action on the I2C bus - forcing the others to wait. Ultimately, we will probably look at how I2C multi-master works, and whether we can do-away with this complexity. After all, we appear to be inadvertently introducing a single point of failure. As an example, what happens if a node crashes but continues to hold this line high?
Now let's talk about pull up resistors. Our observation was that when both nodes are powered on, both EEPROMs were happily accessible to either node without any specific pull up resistors being in place. Results were slightly different when one "side" completely lost power (including the EEPROM). We need to work out the specifics of why this happens - although we noticed that additional 4.7k Ω pull up resistors did help - so we suspect it has something to do with the fact that a I2C clock / data line is left floating, leading to spurious signals
Right. The scene is now set. The camera crew have arrived. So have the disgruntled relatives that believe that this particular microprocessor marriage is a sham.
But hang on a second. What's that? Someone has spoken up when asked if there were any objections to the wedlock. They say they would quite like to be notified of all relationship status updates, in nitty-gritty detail. Preferably in real-time. Over the Internet. Ok, let's deploy a notification mechanism.
For this, we'll create a new topic
rosie/event
to send our notable events to via MQTT (from the current or new primary node, of course), and just because we do actually want to be notified, we'll set up a rule in AWS IoT Core, and an accompanying Lambda function that uses Simple Email Service to spam us with the gossip.Here's an example of a
rosie/event
MQTT message.We create a simple Lambda rule that reacts to MQTT messages being published on the
rosie/event
topic, and the event fires off a Python function in Lambda called send_cluster_mail
. Lambda is an AWS offering commonly known as server-less. It allows us to write event-driven code that is invoked based on certain events taking place. Namely, in our case, the MQTT message being picked up by our IoT Core Rule.The Lambda function -
send_cluster_mail
- has access to the content of the MQTT message through the event
data structure. Clearly, AWS Simple Email Service needs to be set up, with our mail domain and email addresses registered and verified for this to work end to end. If done, this code simply constructs a notification email using content from event
, and dispatches it to our email address using AWS SES.Ok, we think that's it. Piña coladas at the ready. It's honeymoon time!!!
We have a high level method called
health_check()
, and if we run it on node 0, when node 1 hasn't yet been admitted into the cluster ("dead"), it shows it as being primary. Clearly, we have additional code following it to go on and connect to the Wi-Fi, take temperature readings and finally update its own sequence number if the node recognises itself to be the rightful primary.Node 1 looks powered on and healthy, so let's re-admit it back into the cluster using another high level method -
reset_node()
.Node 1 is now back in the cluster, and no longer marked as dead. Node 0 carries on being the primary as there is no reason to failover at this point. But it's always nice to have someone to back you up (
Let's now leave Node 0 stuck at this prompt (meaning it won't update its own sequence number any more). It's kind of stuck. Help!
We are now on Node 1, running the
health_check()
. First time it is run, it notices that Node 0's sequence number hasn't changed and marks it as suspicious. Second time it is run, it observes that the sequence number still hasn't changed... so forcefully seizes the primary role from Node 0 and marks it as dead. With it now being primary, it will connect to Wi-Fi, and dispatch the readings.In this current deployment, someone would have to manually run
reset_node()
again once we verify that the issue is fixed on node 0, but it could be enhanced to periodically check if node 0 has been fixed, and re-admit it back into the cluster as a secondary node.This is what the entire flow looks like on the primary node (with secondary node dead) when left to run automatically.
While this gives the illusion of a degree of autonomy and intelligence, behind the scenes, all that is happening is cluster meta-data is being written to and read from EEPROMs.
And all while this is going on...?
Our Lambda function appears to have taken effect, and a torrent of notification emails are arriving in our inbox. Of course, a slight tweak to the Lambda function could trigger some other services instead, perhaps a SMS message, or a full blown workflow on some sort of HA2MSLICE dispatch system. We aren't limited to using Lambda either. We could spawn a response beforehand at the IoT Core Rule-level.
As the health check is performed every time the microprocessor awakes from deep sleep, every 15 minutes in our case, the resulting flurry of emails arriving in our inbox did become rather annoying for this individual. Perhaps some additional logic could be implemented either on the device, or in AWS, to note that someone has already been notified of the event.
That's it.
Like that annoyingly hyperactive couple that always tells you what they are up to on social media, like, you know, every 15 minutes, we have a pair of microprocessors that appear to be keeping close tabs on each other, and informing the rest of the world about every little hiccup and argument (minus the stylishly filtered holiday snaps).
Now, this is still quite buggy, and because there are just too many potential failure scenarios, we'd need to work out exactly which ones of those this is good for, and how it could be improved. Those of you who are observant (thank you for making it this far - we assure you, this is nearing its end AND there is a video of an avocado catching fire as a reward) would have also noticed that bytes to track sequence numbers have been reserved for 2 additional nodes, so could this eventually become a 4-node cluster? If so, could the secondary nodes be doing something useful with their time while remaining offline (maybe processing data)?
All that can wait for another day. We can safely say that we've rigorously followed the instructions provided on our avocados, and taken the notice rather quite seriously. They have been ripened at room temperature, and we have continuous readings in our DynamoDB table to prove it thanks to a bit of HA2M.
¡Arriba, Arriba! We believe we have earned our burritos topped with guacamole.
Note: Please don't forget to implement high availability of the avocados themselves by deploying a hot swappable spare.
Code
The code is still buggy. Once it's polished, we'll provide a link to it here.References:
As ever, datasheets should be looked at before using a new component. Here is one for the AT24C256 that we have decided to use:We don't pretend to know half of what I2C is truly capable of. But thankfully for us, there are others that do. Sparkfun has a great overview of this serial protocol.
While we're at it, Sparkfun also have a great page on EEPROMs as well:
MicroPython has a guide on how to optimise Python code. While moving onto the ESP32 resolved all of our issues, we have continued to adopt some of those practises, such as using pre-compiled frozen byte-code.
More generally, here's the official documentation for the MicroPython I2C class:
The venerable ESP32 really requires no introduction. There are some great resources online.
- https://www.espressif.com/en/products/hardware/esp32/overview
- http://esp32.net/
- https://en.wikipedia.org/wiki/ESP32
- AWS IoT Core - https://aws.amazon.com/iot-core/
- AWS DynamoDB - https://aws.amazon.com/dynamodb/
- AWS Lambda - https://aws.amazon.com/lambda/
- AWS SES - https://aws.amazon.com/ses/
Comments
Post a Comment