MQTT Temp & Humidity sensor - gone crazy

scott.wells · December 3, 2023, 2:22pm

I’ve got a bunch of MQTT direct to cloud temperature and humidity sensors inputting into some cloud software through AWS.

I have the sensors reporting at an interval that equates to once per hour. For 3 to 4 months everything has been perfect, however, on Friday night one of my sensors basically went mad transmitting every few micro seconds for 2-hours, which 12,000 records in my linked db table and resulted in 12,000 alert emails. This pretty much crashed database and had an inverse affect on monthly email costs.

It’s a nightmare situation in a commercial product.

I’ve rebooted the sensor, restored my tablet and everything is back up and running, however, I am VERY concerned about this happening again.

Can you offer any advice here? Are there any firmware based settings I can make to prevent a repeat occurrence etc.

Thanks.

TravisE_NCD_Technica · December 4, 2023, 2:13pm

Hi Scott,

I have never seen that. So far these sensors have been really solid. Honestly I’m not even sure how the sensor would be capable of reporting at an interval like that. That’s really really fast. Is it possible this happened somewhere else in the system?

scott.wells · January 25, 2024, 8:46am

Hi @TravisE_NCD_Technica ,

As an update on this issue…

The first time I wrote this off as a “on off” however, I’ve since had 3 further failures across 3 different devices.

My setup is currently 3 devices all in different geographical locations (different wifi networks) all reporting a shadow update once per hour into AWS. From AWS I run a rule to create a Dynamo DB table entry from the shadow update output. My ERP software queries this data to produce charts and log out of tolerance events etc.

When looking at the AWS logs I can see over the past 2-weeks I have had 3 different errors events (1 per device) where the “UpdateThingShadow.Accepted” has vastly exceeded the expected number of events per hour, which should be just 1 per hour:

Event 1 - 17th Jan 01:00 UTC a single sensor sent 3,013 shadow updates over 15 minutes
Event 2 - 17th Jan 04:00 UTC a single sensor send 3,020 shadow updated over 15 minutes
Event 3 - 20th Jan 20:00 UTC a single sensor send 37,892 shadow updated over 15 minutes

I have attached the log graphs showing the above.

This is a massive issue for me as my rules are running for each shadow update and creating tens of thousands of table entries in dynamo db. My table write capacity is breached and the system ultimately crashes. Currently the devices just aren’t stable enough for me.

Can you please advise here. I need some form of solution to stabilise the shadow update from these devices.

Please let me know if I can get you anymore info on the topic. I want to see these working and will support however you need.

Regards

Scott
Shadow Update Charts.pdf (117.3 KB)

TravisE_NCD_Technica · January 25, 2024, 3:27pm

Hi Scott,

This is really strange. In the case of Event 3 that is a report interval of approximately 50 milliseconds. Meaning in order to publish 37,892 MQTT messages the device would have to publish about once per 50 milliseconds on average.

This is a single threaded device, which means it can only do one thing at a time. The call to publish to MQTT is a blocking call meaning everything halts until the publish is successful or the request times out. I just ran a test and it takes approximately 70 milliseconds to publish to a local MQTT broker(not even going over the internet, just a broker running on my computer on the same network). I don’t see it being possible for the device to successfully publish 37,892 messages over the course of 15 minutes. Not to mention the device would just run out of RAM trying to publish that fast. This is test code on a device that isn’t reading a sensor, it’s just publishing messages to the broker as quickly as possible in a loop.

This being an intermittent problem is difficult to track down. I would need to sniff network traffic with Wireshark to see these requests going out to the broker on port 8883 in order to definitively see it actually happen.

I will keep digging in the code. There is already a check in the firmware for the previous publish time so it only publishes at the interval specified by the user:

if(millis() > lastReport+settings->reportInterval){

The call to publish is inside that if statement so it can only be called at the interval programmed.

In these published messages are you able to see the transmission_count variable incrementing for each publish or is it constant during these periods? That might tell us a lot.

Also if this rule is just running on Shadow Update event what is to say it’s the device updating the shadow? Do any variables in the shadow change when this happens?

For what it’s worth we’ve sold a lot of these and I do not have any reports of this happening. I just can’t see how it’s possible the device would be capable of doing this.

Thanks Scott. I look forward to helping you resolve this in one way or another.
Travis

scott.wells · March 7, 2024, 4:24pm

Hi @TravisE_NCD_Technica

Apologies it’s taken me so long to reply here. Unfortunately I am still having issues with either these devices or my setup.

I currently have three MQTT temp & humidity devices testing on clients sites and want to deploy many more, but just can’t the way they are. I would say each device falls over approximately once per month.

When a device falls over I get thousands of shadow reports in AWS, where a rule creates thousands of bad data points in my dynamo db. Each time I have to remove the erroneous data.

I can’t continue with these devices like this so really want to get to the bottom of this.

I’ve attached to images from AWS. These show sensor IDs 2 and 3 both falling over last night within an hour of each other. This triggered around 2,000 shadow events each in just a few minutes until I stopped the rules manually. I’ve had to setup alerts on table write consumption to act as a trigger to turn off the rules (so frustrating).

On the AWS side they are seeing all of these shadow updates so it seems to me the devices are somehow doing this.

As mentioned I am keen to get to the bottom of this, but have VERY limited data available to pin point the issue.

Regarding your questions above:

In these published messages are you able to see the transmission_count variable incrementing for each publish or is it constant during these periods? That might tell us a lot.

I cannot see the shadow report content unless watching / listening at the time which is not feasible to monitor waiting for the error to occur. I can see that the data written into the dynamo DB via the rule DOES change. The temperature and humidity figures and timestamp all change, which would leave me to believe each shadow contains different data based on the time of report.

Also if this rule is just running on Shadow Update event what is to say it’s the device updating the shadow? Do any variables in the shadow change when this happens?

AWS see each shadow update as a stand alone event - see the charts attached.

TravisE_NCD_Technica · March 7, 2024, 5:03pm

Not knowing the true cause of this and being unable to replicate the issue on my end makes it pretty difficult to troubleshoot and correct. Really the only solution I can think of, which is not elegant, is to add a publish throttle to the device so it can only publish readings at most once per second.

Would you say it takes several days for this issue to occur? Do you have any way to trigger this failure with repeatability?

I will note that we have sold a lot of these devices at this point but this is the only instance I am aware of this happening. Perhaps it’s something strange with AWS specifically.

scott.wells · March 7, 2024, 6:11pm

Hi @TravisE_NCD_Technica ,

Thanks for the quick response and I do appreciate this one is hard to find with such little info.

The devices only fail very intermittently and there appears to be no way of predicting it.

I find myself wondering if it’s a connectivity thing… e.g., if the devices expect a 200 response back like an api and if they don’t get it do the try and push an update again?

In any case, I like the idea of throttling. I take it this could be done with a firmware update? And, if so could we throttle even more to 15 mins? My devices are set to transmit only once per hour and I would feel confident with a 15 min throttle that the issue would never reoccur. When it does fail it’s never for more than 3 or 4 minutes.

Keen to try this fix as I’m on site with my client in a week’s time and will have physical access to two of the three devices.

What’s the next step

Scott

TravisE_NCD_Technica · March 7, 2024, 6:52pm

I just pushed v1.0.4 firmware to the repo here:

You can update the firmware as explained in the product user guide.

This new version adds additional checking for the last publish time, and also adds a mandatory 100mS wait time between publishes which honestly should never affect anything.

Let me know what you find.

scott.wells · March 8, 2024, 8:03am

Hi @TravisE_NCD_Technica ,

Thanks for the support. This morning I have setup another MQTT device with the NEW V 1.0.4 firmware. I’ve now got three on the old firmware and one on the new firmware. I will monitor all devices over the coming weeks / months and report back.

Scott

TravisE_NCD_Technica · March 8, 2024, 2:11pm

Sounds good Scott. Let me know what you learn.

scott.wells · March 9, 2024, 6:27pm

Hi @TravisE_NCD_Technica ,

One of my sensors, NOT with the latest firmware, just failed.

The gave me the opportunity to download the database entries create by the sensor when it failed. I’ve attached an XLS for your reference.

You can see that up until Epoch time 1710004454051 the sensor was reporting at 3600000 intervals correctly as configured. After 1710004454051 (highlighted in red) the sense starts reporting pretty much constantly every few milliseconds.

Note that every data set is unique with a different temperature and humidity, which would make me think each entry in the database was a unique shadow coming from the device.

NOTE - this is from one of the sensors on the old firmware, however, I wanted to provide the data to you for your views.

I will keep monitoring the sensors and report back.

Scott
Export.xlsx (295.7 KB)

TravisE_NCD_Technica · March 11, 2024, 1:52pm

Hi Scott,

Would you say it’s plausible that the sensor failed after approximately 49 days? If so this could have something to do with a roll over of the processor’s internal clock. Although looking how that would react it seems the device would stop reporting in that event, rather than reporting at a fast interval.

Let me know.

Thank you,
Travis

scott.wells · March 11, 2024, 4:20pm

Hi @TravisE_NCD_Technica ,

I had not spotted that … VERY interesting indeed!!!

YES exactly 49 days between events in ALL sensors cases.

I have three sensors and here is the event data:

You can see from the table above, SN1 has been installed the longest and has failed 3 times EXACTLY 49 days apart each time.

Further, SN2 and SN3, which we installed in a different geographical location later on, failed exactly on the 17th Jan and 24th Mar, also exactly 49 days apart.

It seems to me something is happening every 49 in the device which is causing this. Based on the data I can even forecast the next event with fair confidence.

Just a thought… but could I have a firmware that rebooted the devices once per week? Might seem a crude fix, but it would resolve my issue if they auto-rebooted once a week or day. It might also iron out any sensors going offline due to other connectivity issues.

Hoping this now helps you find the gremlin… and before 44 days time when it will reoccur

Scott

TravisE_NCD_Technica · March 11, 2024, 4:52pm

Ok. Let me see if I can implement a fix in the firmware.

Like I said the internal clock may be rolling over after 49 days but by that logic I would expect the sensor to stop reporting rather than reporting in fast succession. Let me investigate further.

TravisE_NCD_Technica · March 12, 2024, 7:35pm

@scott.wells

I just uploaded v1.0.5. From my testing this should correct the bug you are encountering. I had to simulate advancing the clock to 49 days to test but in that simulation it worked. We will know for certain the bug is fixed if the devices run 50 days with no issues.

Repo for firmware available here:

scott.wells · March 13, 2024, 8:02am

Thanks @TravisE_NCD_Technica

I will start updating devices and record the date they were turned on. Like you say, then it is just a case of waiting 49 days.

Scott

TravisE_NCD_Technica · March 13, 2024, 1:40pm

Sounds good Scott. I look forward to hearing about the results and I appreciate your patience working through the problem.

Keep me updated.

Travis

TravisE_NCD_Technica · March 13, 2024, 4:39pm

Hey @scott.wells

I made a mistake on that last firmware. It was not accounting for the report interval of the sensor being variable. I just pushed a new firmware.bin to the repo. Can you update the firmware on your sensors again? I sincerely apologize for the brain lapse there.

scott.wells · March 13, 2024, 7:54pm

Hi @TravisE_NCD_Technica -

I take it version 1.0.5 is still the latest, but the files have been upload in the repo.

I haven’t updated any of my sensors yet so no problem there. I’ve downloaded the current files and will start tomorrow.

Scott

TravisE_NCD_Technica · March 13, 2024, 8:28pm

Yes, I did not increment version number since this was a fix to a fix. Latest is still 1.0.5