We have moved at community.getvera.com

Author Topic: Context and thread for executing actions  (Read 905 times)

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Context and thread for executing actions
« on: January 15, 2019, 07:54:53 pm »
I have a plugin (Vera2 UI5) that expects to have a service action of a device instance triggered by an HTTP call from an external device as follows:

http://<ip>:3480/data_request?id=action&DeviceNum=<dev>&serviceId=<svc>&action=SetPresent&newPresentValue=<val>

No problem triggering the action.  This action read/writes to local variables and also does luup.variable_set and luup.call_delay (used for a timeout check).

So my problem is that I have more than one HTTP client triggering the action for a single device instance and the two clients are asynchronous.  When the actions are triggered more than, say, 2 seconds apart, everything runs just fine.  When two HTTP clients trigger the same action within about 200ms of each other, I get a LuuP crash/restart.

So this raises my concerns about what thread is running and what context is running at certain times.  It was my understanding (perhaps wrong!) from many messages in this forum, that each device instance (I only have 1 of these) is in it's own Luup context (assuming that means local variable instances) and also has two threads.  Don't know why 2 threads.  What is not clear is whether the HTTP parsing calls the action function or whether it "messages", "queues" or similiar, the device to perform the action.  Or worse yet, it creates a job (which I haven't seen defined very well) that executes the action and the job has its own thread.

So, I ask:

1. what is a luup context
2. what are the two threads per device instance used for
3. in what thread are the service actions of a device instance executed
4. in what thread and context are callbacks executed (specifically call_delay, but other callbacks as well)
5. what is a job and in what thread and context are the jobs run

Though I'm no expert, I have 30 years of experience in mutlitasking, embedded programming.  I always prefer to understand the architecture of the system so I can then solve my own problems rather that bug you guys to fix them.  That is why I intentionally haven't posted code or other specifics.  Having said that, I'm very happy to answer any questions and provide any code.

Also, I'm very happy to read manuals, so if the answer is in a manual, please give me a link.  I couldn't figure out the answers from the plugin development docs, though.

Regards
Keith

Offline rafale77

  • Community Beta
  • Hero Member
  • ******
  • Posts: 1749
  • Karma: +101/-27
  • HA ≠ IoT as a blue sky is cloudless.
Re: Context and thread for executing actions
« Reply #1 on: January 15, 2019, 09:42:26 pm »
Reading this with great interest as I am observing similar issues with HTTP calls sent from a client but to multiple devices. On a Vera Plus (UI7) it works but with limitations. It seems that if I sent too many commands within a short time (I have not been able to quantify the time) I often get a Luup reload. I initially thought about as a command queuing but upon looking my logs, it is likely the Luup engine itself not being able to handle the jobs and having a job hang and timeout is what is apparently crashing the luup engine as I am getting an error code 137.

Code: [Select]
2019-01-15 16:42:56 - LuaUPnP Terminated with Exit Code: 137
02 01/15/19 16:42:58.206 JobHandler_LuaUPnP::Run: pid 27983 didn't exit <0x77fe8320>
02 01/15/19 16:42:58.648 UserData::TempLogFileSystemFailure start 1 <0x77fe8320>
02 01/15/19 16:42:58.750 UserData::TempLogFileSystemFailure 7621 res:1

Though it may not be exactly the same problem since you are dealing with a single device and multiple commands, I am seeing a crash for multiple device and commands. I too have been looking for the command queuing limits of the engine.
openLuup (79 devices, 141 scenes, 19 apps) master to VeraPlus (142 zwave nodes, 8 Zigbee nodes, 221 devices,  20 scenes , 2 apps) +  Hubitat (15 Zigbee nodes) + Home-Assistant (API Integrations). Bridged to Siri and Alexa. Homewave. VeraPlus ExtRooted and mios server independent.

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Re: Context and thread for executing actions
« Reply #2 on: January 16, 2019, 12:54:25 pm »
As I mentioned, in the action code I make calls to luup.variable_set (two).  I have now changed that to make them conditional and happen far less often.  I'm going to watch for "back to back" executions of the action (which is what seems to trigger the Luup restart) and see whether this changes things (i.e. less likely to restart Luup now).

I have a few other reorganizations of code I can do (haven't yet) to do far less work in the action (e.g. executing much code in a call_delay callback) to see if I can minimize the work that is done without a break (i.e. back to back executions of the entire action).

My suspicion is that I am crashing because I'm doing back to back executions of the action without a break in between.  Although this is completely "legal", perhaps the system is unable to do enough periodic housecleaning (garbage collection maybe?) and it incorrectly concludes a failure (memory leak maybe?) and crashes Luup.  If I only do a tiny bit in the action (I can't prevent back to back calls to the action) and schedule the rest with call delay, perhaps it will get the break it needs to clean house.  This will be my next test if this one prove better but not perfect.

I'm totally guessing since I am unfamiliar with the Lua (or Luup) environment.  My guesses are that the problem is buried in the system, not in my code.  How arrogant of me!

Will advise on the current test (fewer luup.variable_set calls in callback) when it has a bit of runtime history...

Offline rafale77

  • Community Beta
  • Hero Member
  • ******
  • Posts: 1749
  • Karma: +101/-27
  • HA ≠ IoT as a blue sky is cloudless.
Re: Context and thread for executing actions
« Reply #3 on: January 17, 2019, 02:46:58 am »
I have experienced the same. Most of my commands though come from openLuup and I am testing some codes to add intervals between commands on openLuup. I am fairly certain now that the vera has no command queue.
openLuup (79 devices, 141 scenes, 19 apps) master to VeraPlus (142 zwave nodes, 8 Zigbee nodes, 221 devices,  20 scenes , 2 apps) +  Hubitat (15 Zigbee nodes) + Home-Assistant (API Integrations). Bridged to Siri and Alexa. Homewave. VeraPlus ExtRooted and mios server independent.

Offline rigpapa

  • Beta Testers
  • Hero Member
  • *****
  • Posts: 1121
  • Karma: +187/-3
Re: Context and thread for executing actions
« Reply #4 on: January 17, 2019, 07:50:34 am »
If the action you are running is enclosed in a <run> tag in the implementation file (I_xxx.xml), you might want to try changing that to a <job> tag. This changes the execution environment from an immediate, in-line run of the action's implementation to deferred/queued. The use of <job> requires that you return a job completion status from the implementation, with return 4.0 being the most common success indication, and return 2,0 indicating failure.

I've found that many deadlocks can be avoided by careful selection of what runs in line, and what is deferred, particularly when communicating with certain classes of Z-Wave devices (locks, in particular... the deadlocks, oh, so many).
Author of Reactor, DelayLight, SiteSensor, Rachio, Deus Ex Machina II, Intesis WMP Gateway, Auto Virtual Thermostat and VirtualSensor plugins. Vera Plus w/100+ Z-wave devices. Vera3, Lite. Hassio, Slapdash.

Offline rafale77

  • Community Beta
  • Hero Member
  • ******
  • Posts: 1749
  • Karma: +101/-27
  • HA ≠ IoT as a blue sky is cloudless.
Re: Context and thread for executing actions
« Reply #5 on: January 17, 2019, 11:48:33 am »
Thank you. Unfortunately the core implemetations code on the vera does not appear accessible so I don't know whether they run as job or run, However I am now testing akbooer's suggested code on openLuup to see if openLuup can delay its calls. It appears than even a very slight delay has been able to address my problem with this one particular scene but I have another which potentially could be worse triggering 50+ zwave actions I will test.
openLuup (79 devices, 141 scenes, 19 apps) master to VeraPlus (142 zwave nodes, 8 Zigbee nodes, 221 devices,  20 scenes , 2 apps) +  Hubitat (15 Zigbee nodes) + Home-Assistant (API Integrations). Bridged to Siri and Alexa. Homewave. VeraPlus ExtRooted and mios server independent.

Offline akbooer

  • Beta Testers
  • Master Member
  • *****
  • Posts: 6387
  • Karma: +290/-70
  • "Less is more"
Re: Context and thread for executing actions
« Reply #6 on: January 17, 2019, 12:07:25 pm »
Unfortunately the core implemetations code on the vera does not appear accessible so I don't know whether they run as job or run

FWIW, IIRC, a job will return a job number as part of its return data, whereas a run does not.
3x Vera Lite-UI5/Edge-UI7, 25x Fibaro, 23x TKB, 9x MiniMote, 2x NorthQ Power, 2x Netatmo, 1x Foscam FI9831P, 9x Philips Hue,
Razberry, MySensors Arduino, HomeWave, AltUI, AltHue, DataYours, Grafana, openLuup, ZWay, ZeroBrane Studio.

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Re: Context and thread for executing actions
« Reply #7 on: January 17, 2019, 09:29:19 pm »
Turns out that removing the guarantee of two luup.set_variable in the action did not improve it.  It crashed after running the action without calling luup.set_variable once.

So I added the line "os.execute("echo 3 > /proc/sys/vm/drop_caches")" just before returning from the action and it has not failed yet.  Very premature and I'm not convinced since I have been checking top and free and I see very low on free memory and still very much cache.

@rigpapa: I'm interested in the <job> vs <run>.  Sounds promising!  I don't really see a downside to doing that.

@rafale77: keep me informed of what you try and what is better, worse, disastrous and, most importantly, successful.  Your situation and mine might be the same.

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Re: Context and thread for executing actions
« Reply #8 on: January 17, 2019, 10:01:33 pm »
So I searched re <run> vs <job> and looky here!

http://forum.micasaverde.com/index.php/topic,28583.msg204271.html#msg204271

Sure seems to support rigpapa's suggestion.

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Re: Context and thread for executing actions
« Reply #9 on: January 20, 2019, 12:17:28 am »
So adding  "os.execute("echo 3 > /proc/sys/vm/drop_caches")" helped a lot but didn't cure the problem.  Still got Luup reloads with same status (245).

Eliminated os.execute and changed <run> to <job> and changed the return true to return 4,nil.  Amazing how much code there is out there that has no return at all in either <run> or <job> tags of actions.  Even MCV code.  Lots of it.

I will let you know of the stability in the next couple of days.

@rafale77: I have very little in my action but it is always triggered by HTTP request.  I think you are in the same boat.  I'm starting to think that although my code is perfectly acceptable to be <run> (i.e. it executes quicly), in the big picture it is only the tail end of an HTTP request and, for all I know, HTTP is held up pending my return code.  It may be that if the action comes from HTTP, <job> is the only safe way.  Again, let's see about the stability.

Offline akbooer

  • Beta Testers
  • Master Member
  • *****
  • Posts: 6387
  • Karma: +290/-70
  • "Less is more"
Re: Context and thread for executing actions
« Reply #10 on: January 20, 2019, 03:19:52 am »
... changed <run> to <job> and changed the return true to return 4,nil.  Amazing how much code there is out there that has no return at all in either <run> or <job> tags of actions.  Even MCV code.  Lots of it.

AFAIK, this is not an issue.  As for scenes, a blank return from either run or job code should do the expected thing: exit successfully.

That's what I found in my testing whilst reverse engineering the Luup engine for openLuup.
3x Vera Lite-UI5/Edge-UI7, 25x Fibaro, 23x TKB, 9x MiniMote, 2x NorthQ Power, 2x Netatmo, 1x Foscam FI9831P, 9x Philips Hue,
Razberry, MySensors Arduino, HomeWave, AltUI, AltHue, DataYours, Grafana, openLuup, ZWay, ZeroBrane Studio.

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Re: Context and thread for executing actions
« Reply #11 on: January 20, 2019, 12:53:59 pm »
Well, changing from <run> to <job> didn't help.  It still crashes when two jobs run back to back.  Surprised me.

So now I have it as <job> and I added back in "os.execute("echo 3 > /proc/sys/vm/drop_caches")" which definitely helped.  I also added in "os.execute("sync")" before drop_caches to free a bit more.  Oddly, neither of these two seem to help free memory much.  Maybe it just slows things down a bit to let things settle on the back to back jobs.

Essentially, I'm still believing that the two executions back to back are an issue as is lack of free memory. [edit] Keep in mind I'm on Vera 2!

Just out of curiosity, what is exit code 245 from Luup?  I suspect it simply reflects the signal 11 (segmentation fault - likely from too little free mem) noted immediately after.
« Last Edit: January 20, 2019, 12:56:09 pm by GeekGoneOld »

Offline rafale77

  • Community Beta
  • Hero Member
  • ******
  • Posts: 1749
  • Karma: +101/-27
  • HA ≠ IoT as a blue sky is cloudless.
Re: Context and thread for executing actions
« Reply #12 on: January 20, 2019, 01:00:03 pm »
Because most of my http calls to the vera come from openLuup and that akbooer suggested slowing down the rate of the calls by changing the openLuup code, it appears that my problems have been resolved for now.

See what I tried to do below:

+--------------+  eth   +---------------------+. uart  +----------------------+  zwave  +------------------ ---+
| OpenLuup |. <-->  | Vera Luup API  |  <-->.  | ZM5304 Zwave. |    <--->    | Zwave network. |
+--------------+          +---------------------+          +----------------------+               +----------------------+

The Vera Luup to Zwave chip and the Zwave network itself are the bottlenecks and the vera should really be queuing its data flow to the Zwave chip like pretty much every controller I know, but it does not.
In order to have a queue, you need to have memory and Sorin replied to my question on queueing saying that they believe the zwave SOC (ZM5304) does it but it does not have the memory to do it. So essentially what is happening is that the Luup engine locks up because either it is getting too many http calls it is not able to store and queue or because it is a passthrough and forwards every call it gets to the zwave chip through its serial API which then locks up and causes the vera luup to reload.
The change from run to job was done in openLuup so that it prevents this problem upstream. The ethernet interface and openLuup itself are very significantly faster than the other protocols. It is a little like a bandwidth and processing power funnel as it narrows going to the right on the graph. Someone on the chain either needs to queue the calls so they get spaced out in time or we need multiply veras and zwave network so the calls can be spaced out in space.

It seems to me like you are trying to get the vera to queue the incoming calls but it maybe that either it is too slow (now that I know you have a vera 2) to process the calls or that it is just passing them through to the zwave radio hoping it will queue and it is the serial interface (likely USB in your case?) API which crashes. Did you consider slowing down the rate of the incoming calls?
« Last Edit: January 20, 2019, 01:13:23 pm by rafale77 »
openLuup (79 devices, 141 scenes, 19 apps) master to VeraPlus (142 zwave nodes, 8 Zigbee nodes, 221 devices,  20 scenes , 2 apps) +  Hubitat (15 Zigbee nodes) + Home-Assistant (API Integrations). Bridged to Siri and Alexa. Homewave. VeraPlus ExtRooted and mios server independent.

Offline GeekGoneOld

  • Jr. Member
  • **
  • Posts: 80
  • Karma: +3/-0
Re: Context and thread for executing actions
« Reply #13 on: January 20, 2019, 04:27:38 pm »
I'm not actually hitting the z-wave so that isn't an issue.  I'm actually only doing call_delay (to schedule a timout check) and (maybe) some variable_sets.  Pretty benign!

It is also hard to slow down or schedule the HTTP calls.  Essentially, external devices (RPi) are monitoring for presence on bluetooth and when they see it, they call the action of the associated Vera device and repeat that periodically (every 30s) while still present.  I have multiple RPis monitoring throughout the home so I may be present in more than one, though the presence also includes strength so it is easy to see which room I'm in.  Pretty hard to co-ordinate these independent devices.  I wouldn't have thought I had to!

My next try might be to co-ordinate the processing in Vera.  If I use <run> to receive the info then call_delay to schedule the processing of same, I can ensure that I space out the processing.  This won't eliminate concurrent event calls but might minimise their effect.  Like jumping through hoops!!!

@rafale77: glad yours is working.  @akbooer has helped many, many people along the way.  Don't forget to give him Karma.

Offline rafale77

  • Community Beta
  • Hero Member
  • ******
  • Posts: 1749
  • Karma: +101/-27
  • HA ≠ IoT as a blue sky is cloudless.
Re: Context and thread for executing actions
« Reply #14 on: January 20, 2019, 04:47:32 pm »
I just ran a stress test and I can confirm that it passes with flying colors with my setup.

Thank you for explaining. It seems then that the design issue in the Luup engine is not new and is deep. Actually the fact that I am not really seeing much delay anymore in my scene actions tells me that zwave is not the bottleneck. It is the vera itself which is the problem. It may not be able to receive and process too many http commands within a short time. It is lacking a command queue buffer on both the incoming and the outgoing commands.

@rafale77: glad yours is working.  @akbooer has helped many, many people along the way.  Don't forget to give him Karma.

Thank you for reminding me. He actually got quite a couple from me lately.  ;D
openLuup (79 devices, 141 scenes, 19 apps) master to VeraPlus (142 zwave nodes, 8 Zigbee nodes, 221 devices,  20 scenes , 2 apps) +  Hubitat (15 Zigbee nodes) + Home-Assistant (API Integrations). Bridged to Siri and Alexa. Homewave. VeraPlus ExtRooted and mios server independent.