|
Missing components when running more than one node 3 Months ago
|
|
|
Hello,
One of my subsystems has 2 nodes.
Node A runs NM with communicator.
Node B runs only NM.
Sometimes Node A can't see all B's components (in every run other components).
It didn't happened to me when each subsystem had only one node.
The temporary solution is to reset the whole subsystem (maybe several times).
What can be the cause of it? How can I resolve it?
Thanks,
Mayan.
|
|
Cohen
OpenJAUS Contributor
Posts: 13
|
|
|
|
|
Re: Missing components when running more than one node 3 Months ago
|
|
|
This has to do with the way that Node to Node component discovery is implemented in the node manager. The most likely cause is that you are resetting node managers and components faster than they are able to time-out in the remaining node manager's system tree. Be sure to bring up both node managers first and do not restart your components very quickly. Make sure you wait to restart them at least 5 seconds after being shutdown.
If you still have significant problems, then other solutions exist but they require significant changes to the node manager and component code.
|
|
|
|
|
|
|
Re: Missing components when running more than one node 2 Months, 3 Weeks ago
|
|
|
For the last days I did as you suggested and that's work.
Unfortunately this solution is not good enough for me because:
1. The nodes are independent and it's not right that components of one node wait until the second node will bring up (the components of node A should work and send information out even if node B is down).
2. The nodes are crashing frequently and I can't wait 5 seconds every time it happens.
What's the alternative?
Thanks,
Mayan.
|
|
Cohen
OpenJAUS Contributor
Posts: 13
|
|
|
|
|
Re:Missing components when running more than one node 2 Months, 3 Weeks ago
|
|
|
I think the root of your issue is the second point you made above. Figure out the root issue(s) causing your components to crash and then you will not have this thrashing issue where components are checking into the node manager very often. The system is designed to have a certain amount of stability, but still be usable in a lossy environment, therefore we have added a buffer to the time for a component to time out of the system in case of intermittent loss of communications. I think if you get to the root issue causing your components to crash so that they can run reliably you will see better performance of the system as a whole.
|
|
|
|
There's 10 types of people in the world; those that understand binary and those that don't.
|
|
|
Re:Missing components when running more than one node 2 Months, 3 Weeks ago
|
|
|
Mayan,
The nodes can still be started independently. Starting both node managers first is just helpful for ensuring the discovery process works in a controlled way. You should be able to start the components of node A even if node B is down.
Danny is correct about crashing problem. That is a definite source of confusion for the node managers and the discovery sequence. If components crash then the only solution is to let them timeout. One alternative might be to decrease the timeout period in the code, but that can be risky.
However, if the node manager itself is crashing, than that is certainly a problem we should fix. If that is the case please let us know. If so, please let us know what OS you are working in (Windows, linux, etc...) so we can figure out how to get debug information from you.
Tom
|
|
|
|
|
|
|
Re:Missing components when running more than one node 2 Months, 3 Weeks ago
|
|
|
It's true that my components crashing from time to time, but I didn't change the original NM timeout (3 sec) so I believe that the node have enough time to remove them before they checking in again. Any way, I always wait for timeout cause I want all my components to have instance id 1.
For the node's crashing, I don't think it related only to the components (sometimes it crashing when it's the only running component). I'm using Windows OS.
|
|
Cohen
OpenJAUS Contributor
Posts: 13
|
|
|
|
|