Missed motion events; outage?

dds82 · 2019-12-30 15:30:55 UTC

Today, I happened to have full logging turned on in one of my pistons just as it failed to run, and I saw something very different. Right in the middle of its execution, it timed out with an error that it took more than 20 app execution seconds.

I don’t have the live logs for this anymore (I’m not sure how helpful they would be anyway), but this is a piston that normally executes in under a second. It does not have any waits, and parallelism is turned on, so it should never be waiting at a semaphore.

Here’s the piston:

The timeout would happen somewhere inside the “on events” block, preventing it from turning Switch 16 and Switch 17 on or off.

Goofball · 2019-12-30 16:47:47 UTC

Also seeing the same with missed/missing events in Webcore but no issues on the SmartThings side. Actiontiles also appears to not be missing anything when I go look at my panels. In the last two days, possibly longer I’ve had long term existing, functional pistons fail to fire on switch on/off or contact/motion state change events.

This morning the scheduled task at 9A local time to turn on a virtual switch and run my downstairs Neato BotVac fired as expected, and ST changed the switch to ON. The second piston that watches for the switch state to change to on and actually send the start command to the vacuum failed to see the state change. This has worked reliably for months before this.

Yesterday evening, twice the long-term piston that monitors my garage entry door contact and turns on the garage light for 2 minutes failed to turn the light back off after the wait was up. Had what was likely another instance of this a week or so ago that was attributed to something else or someone leaving the light on after using the manual switch.

Also yesterday evening the long-term piston that monitors for a virtual switch to turn on and temporarily bump up my thermostat temps failed to see and fire on the on state change for my downstairs thermostat. The same one for the upstairs thermostat/v-switch fired as expected.

There’s definitely something going on with missed events.

ian_boje · 2019-12-30 17:00:30 UTC

In the logs, I see some negative numbers for some events. For example:
+0ms ╔Received event [My home].time = 1577724588004 with a delay of -778ms

Is that sort of message normal? Could one of the servers have a clock that is off? I ran into something once running on a VM: Ping timings were coming back as negative like this. The system engineers adjusted some setting on the VM host to correct the problem.

dds82 · 2019-12-30 19:56:20 UTC

I think this is normal; I’ve seen this for a while. My understanding is that events aren’t guaranteed to be fired at an exact time – they can be fired at time T +/- 2 seconds.

In fact, I’ve even seen double events when the event fires early enough (it fires again at the correct time). I had to write code in one particular piston that is sensitive to double events to detect a double event and ignore the second one.

Oldcomputerwiz · 2019-12-31 03:22:17 UTC

Just watched it happen in our bedroom. My bride walked into the bedroom, no lights. I quickly checked and ST registered motion and none of the conditional switches that would prevent the light from coming on were on.

She walked into the master bath and the light came on in the bath as it should. While in the bath the motion sensor timed out. She walked back into the bedroom and the bedroom light turned on as it is supposed to. I don’t have any logs but I know ST saw the motion sensor switch to active because the dashboard had updated.

andynbaker · 2019-12-31 03:32:52 UTC

I’m having motion based lighting rules not fire either. Same symptoms as everyone else: just started acting up since the last ST update, events seem to log in ST but webcore doesn’t trigger. I’m getting weird errors in the live logs too. I’ll try to capture and post.

no matter, i know i need to clean up my pistons. I have stuff with a ton of triggers and lots of semaphores at times.

ian_boje · 2019-12-31 03:33:54 UTC

At this point, I’ve simulated 52 “active” and 52 “inactive”, but the 2nd piston has only seen 41 of each. Earlier today, there was an imbalance of activations and inactivations on the second piston. This eliminates my earlier theory that I was only missing the motion clear events, at least in the test/simulated pistons.

I’m not sure what the next steps are. Is there a smartthings support number to call?

ian_boje · 2019-12-31 03:34:50 UTC

Do you know the date of the last update?

andynbaker · 2019-12-31 03:35:33 UTC

late november, i think? V2 and V3 hubs only. firmware 28 i think

dds82 · 2019-12-31 04:21:03 UTC

This has nothing to do with the hub. I don’t have a hub in my setup.

It started acting up for me after that big outage on December 19.

Earlier this evening, corresponding to a piston failure, I found this in the live logs for one of my devices:

7:04:13 PM: error org.springframework.jdbc.UncategorizedSQLException: Hibernate operation: could not execute query; uncategorized SQLException for SQL [select this_.id as id88_0_, this_.version as version88_0_, this_.date_created as date3_88_0_, this_.keyas key4_88_0_, this_.type as type88_0_, this_.value as value88_0_ from server_config this_ where this_.key=?]; SQL state [null]; error code [0]; [SimpleAsyncTaskExecutor-702] Timeout: Pool empty. Unable to fetch a connection in 30 seconds, none available[size:50; busy:1; idle:0; lastwait:30000].; nested exception is org.apache.tomcat.jdbc.pool.PoolExhaustedException: [SimpleAsyncTaskExecutor-702] Timeout: Pool empty. Unable to fetch a connection in 30 seconds, none available[size:50; busy:1; idle:0; lastwait:30000]. @line 121 (sendSwitch)

andynbaker · 2019-12-31 04:36:13 UTC

I’m getting similar/same springframework errors in my logs

ian_boje · 2019-12-31 11:22:25 UTC

Are you running tomcat, or do you think that’s from webcore? Googling around, I think the default connection pool is 100, but that error seems to imply that it is configured for 50.

dds82 · 2019-12-31 11:43:18 UTC

It’s coming from somewhere in the SmartThings ecosystem. I used to think it was from webcore but then I found it in the live logs for a device that uses an official DTH.

I’m starting to suspect that the December 19 outage wasn’t actually resolved properly.

ian_boje · 2019-12-31 15:14:59 UTC

@dds82 I think I agree with you.

Perhaps I’ll revise my test to see if webcore tests clean.

Late last night, I opened an email ticket with them. The auto-reply said that they were unusually busy.

dds82 · 2019-12-31 15:21:13 UTC

Same here. I also posted on their forum. Let’s see what gets their attention.

More likely, unusually understaffed due to the holiday week.

ian_boje · 2019-12-31 15:26:27 UTC

Half of me wondered if the guys that can make the fixes are out on holiday until next year.

ian_boje · 2019-12-31 16:43:08 UTC

The support replied back saying that they suggest contacting the webcore dev team, and post to the community website.

dds82 · 2019-12-31 16:59:14 UTC

That’s what happened when I reported it the first time too. I reported it last night as happening in their own DTH. I’m still waiting for a response to that.

ian_boje · 2019-12-31 18:24:56 UTC

Well, I have a feeling that webcore is fine. So far, no missed events when bouncing test events through IFTTT. I’ll keep watching. The smartthings events are at about 21% loss now for me.

dds82 · 2019-12-31 20:18:58 UTC

I think I may have finally gotten their attention. See this thread: