MQTT Temp & Humidity sensor - gone crazy

Hi @TravisE_NCD_Technica ,

I’ve got a bunch of MQTT direct to cloud temperature and humidity sensors inputting into some cloud software through AWS.

I have the sensors reporting at an interval that equates to once per hour. For 3 to 4 months everything has been perfect, however, on Friday night one of my sensors basically went mad transmitting every few micro seconds for 2-hours, which 12,000 records in my linked db table and resulted in 12,000 alert emails. This pretty much crashed database and had an inverse affect on monthly email costs.

It’s a nightmare situation in a commercial product.

I’ve rebooted the sensor, restored my tablet and everything is back up and running, however, I am VERY concerned about this happening again.

Can you offer any advice here? Are there any firmware based settings I can make to prevent a repeat occurrence etc.

Thanks.

Hi Scott,

I have never seen that. So far these sensors have been really solid. Honestly I’m not even sure how the sensor would be capable of reporting at an interval like that. That’s really really fast. Is it possible this happened somewhere else in the system?

Hi @TravisE_NCD_Technica ,

As an update on this issue…

The first time I wrote this off as a “on off” however, I’ve since had 3 further failures across 3 different devices.

My setup is currently 3 devices all in different geographical locations (different wifi networks) all reporting a shadow update once per hour into AWS. From AWS I run a rule to create a Dynamo DB table entry from the shadow update output. My ERP software queries this data to produce charts and log out of tolerance events etc.

When looking at the AWS logs I can see over the past 2-weeks I have had 3 different errors events (1 per device) where the “UpdateThingShadow.Accepted” has vastly exceeded the expected number of events per hour, which should be just 1 per hour:

  1. Event 1 - 17th Jan 01:00 UTC a single sensor sent 3,013 shadow updates over 15 minutes

  2. Event 2 - 17th Jan 04:00 UTC a single sensor send 3,020 shadow updated over 15 minutes

  3. Event 3 - 20th Jan 20:00 UTC a single sensor send 37,892 shadow updated over 15 minutes

I have attached the log graphs showing the above.

This is a massive issue for me as my rules are running for each shadow update and creating tens of thousands of table entries in dynamo db. My table write capacity is breached and the system ultimately crashes. Currently the devices just aren’t stable enough for me.

Can you please advise here. I need some form of solution to stabilise the shadow update from these devices.

Please let me know if I can get you anymore info on the topic. I want to see these working and will support however you need.

Regards

Scott
Shadow Update Charts.pdf (117.3 KB)

Hi Scott,

This is really strange. In the case of Event 3 that is a report interval of approximately 50 milliseconds. Meaning in order to publish 37,892 MQTT messages the device would have to publish about once per 50 milliseconds on average.

This is a single threaded device, which means it can only do one thing at a time. The call to publish to MQTT is a blocking call meaning everything halts until the publish is successful or the request times out. I just ran a test and it takes approximately 70 milliseconds to publish to a local MQTT broker(not even going over the internet, just a broker running on my computer on the same network). I don’t see it being possible for the device to successfully publish 37,892 messages over the course of 15 minutes. Not to mention the device would just run out of RAM trying to publish that fast. This is test code on a device that isn’t reading a sensor, it’s just publishing messages to the broker as quickly as possible in a loop.

This being an intermittent problem is difficult to track down. I would need to sniff network traffic with Wireshark to see these requests going out to the broker on port 8883 in order to definitively see it actually happen.

I will keep digging in the code. There is already a check in the firmware for the previous publish time so it only publishes at the interval specified by the user:

if(millis() > lastReport+settings->reportInterval){

The call to publish is inside that if statement so it can only be called at the interval programmed.

In these published messages are you able to see the transmission_count variable incrementing for each publish or is it constant during these periods? That might tell us a lot.

Also if this rule is just running on Shadow Update event what is to say it’s the device updating the shadow? Do any variables in the shadow change when this happens?

For what it’s worth we’ve sold a lot of these and I do not have any reports of this happening. I just can’t see how it’s possible the device would be capable of doing this.

Thanks Scott. I look forward to helping you resolve this in one way or another.
Travis