Need help designing a fix for JENKINS-50504

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Need help designing a fix for JENKINS-50504

me-3
Hi all,

I just filed JENKINS-50504 to describe a bug that I hit a few times a month. In short, when the master's connection to an SSH slave times out and a new connection is opened, jobs still keep running under the old Remoting channel, but their workspaces get handed out to new jobs (because the logic that checks for a workspace being in use doesn't take this case into account), and then both jobs clobber each other and fail.

I have written a detailed evaluation of the issue in the bug. The cause of the problem is that WorkspaceList#inUse is a Map<FilePath, Entry> and FilePath#equals checks that the channels are the same to consider a given FilePath equal to another one. In my case, the channel reference of the proposed workspace is a new channel (because the node reconnected), while the entry in inUse references the old channel (because the job is still running under the old channel). As a result, the workspace is not considered to be in use and is handed out to a new job.

Since this bug impacts me at least twice a month and takes down a large percentage of my Jenkins jobs, I would like to try and contribute a fix. However, I need help designing a solution. I can think of two ugly solutions:

1. When a node reconnects due to an I/O error, update the entries in WorkspaceList#inUse to reference the new channel in the key to the map. This would fix the bug. However, it seems ugly to use the new channel in the "in use" map, because the job is still technically running under the old channel.

2. Maintain a list of all channels that a given node has ever had open (including channels that got closed due to timeout). Then, when checking for a workspace being in use, construct a proposed FilePath for each one of those channels, and fail if any of them has an entry in the "in use" map. This design concerns me because of the potential for this list of old channels to grow in size without bound.

Could someone with more familiarity with Jenkins core weigh in with a better way to solve this problem? If so, I could try to submit a pull request.

Thanks in advance,
Basil

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/18fbc648-1294-4be5-bf2c-9da43cdc807c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Need help designing a fix for JENKINS-50504

Ivan Fernandez Calvo
The pruposed workaround could cause concurrence issues, I think the the main issue why the agent is not disconnected and keep the old connection is the most important thing. Did you checked the open connection from the Agent to the master with netstat? It should be two connections the old one an an new one, Has the  agent more than one slave.jar process running? Are your agents VM or baremetal? Did you tune your tcp stack with proper values to keepalive?

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/a78a1476-bda0-41af-8afa-6b6f517b9029%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help designing a fix for JENKINS-50504

me-3
Hi Ivan,

Thanks for your reply. I'm not exactly sure how my proposed workaround would necessarily cause concurrency issues. Doesn't that depend on how it's implemented? I agree that it's strange that the agent wasn't disconnected and still keeps the old connection to the master, even though new jobs use a new connection. Doesn't this violate the invariant implied by the implementation of WorkspaceList#inUse, which is that the entries in the map always represent the latest channel for a given node? This definitely seems like a core bug to me. I don't believe I should need to tune my TCP stack, because pipeline claims to be resilient to network outages. If the master logs "SEVERE: I/O error in channel jenkins-node" and "INFO: Attempting to reconnect jenkins-node", then why do jobs continue running on the old connection, violating the invariant in WorkspaceList#inUse?

Thanks,
Basil

On Sunday, April 1, 2018 at 6:47:02 AM UTC-7, Ivan Fernandez Calvo wrote:
The pruposed workaround could cause concurrence issues, I think the the main issue why the agent is not disconnected and keep the old connection is the most important thing. Did you checked the open connection from the Agent to the master with netstat? It should be two connections the old one an an new one, Has the  agent more than one slave.jar process running? Are your agents VM or baremetal? Did you tune your tcp stack with proper values to keepalive?

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/15a81809-f8dc-4fc4-aa88-9f7d8c492064%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help designing a fix for JENKINS-50504

Oleg Nenashev
This issue seems to be Pipeline-specific (actually DueableTask-Specific). Standard Freestyle jobs should abort immediately on the agent disconnection, but Pipeline jobs may recover and continue using the workspace.

However, it seems ugly to use the new channel in the "in use" map, because the job is still technically running under the old channel.

No, it should be running under the new channel. Old channel gets disposed, and Remoting 3.14+ adds some diagnostics for these cases (e.g. JENKINS-45294). Now it causes some issues in Durable task which does not always recreate FilePath and underlying Workspace (JENKINS-41854 and other similar issues with "Channel is closing or closed").

WorkspaceList#inUse should be reacquired by Pipeline for sure when it reconnects to a new agent. I would guess it happens even now (or not?), but clearly there is a potential of race conditions between recovered jobs and new submissions.

The proposed patch may help, although workspace management is not really the strongest part of the Jenkins core. I would rather suggest redesigning it so that workspaces can be tracked independently on the node state (the proposed change does the same for a single cache). Some better UI/ workspace release features could be added as an added value.

BR, Oleg
 

On Monday, April 2, 2018 at 10:08:28 PM UTC+2, [hidden email] wrote:
Hi Ivan,

Thanks for your reply. I'm not exactly sure how my proposed workaround would necessarily cause concurrency issues. Doesn't that depend on how it's implemented? I agree that it's strange that the agent wasn't disconnected and still keeps the old connection to the master, even though new jobs use a new connection. Doesn't this violate the invariant implied by the implementation of WorkspaceList#inUse, which is that the entries in the map always represent the latest channel for a given node? This definitely seems like a core bug to me. I don't believe I should need to tune my TCP stack, because pipeline claims to be resilient to network outages. If the master logs "SEVERE: I/O error in channel jenkins-node" and "INFO: Attempting to reconnect jenkins-node", then why do jobs continue running on the old connection, violating the invariant in WorkspaceList#inUse?

Thanks,
Basil

On Sunday, April 1, 2018 at 6:47:02 AM UTC-7, Ivan Fernandez Calvo wrote:
The pruposed workaround could cause concurrence issues, I think the the main issue why the agent is not disconnected and keep the old connection is the most important thing. Did you checked the open connection from the Agent to the master with netstat? It should be two connections the old one an an new one, Has the  agent more than one slave.jar process running? Are your agents VM or baremetal? Did you tune your tcp stack with proper values to keepalive?

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/66d428d6-c7c1-48fb-ab8a-4b7b7236a1eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help designing a fix for JENKINS-50504

Jesse Glick-4
On Wed, Apr 4, 2018 at 4:26 AM, Oleg Nenashev <[hidden email]> wrote:
> WorkspaceList#inUse should be reacquired by Pipeline for sure when it
> reconnects to a new agent. I would guess it happens even now (or not?)

No, currently a lock is acquired only when a `node` (or `ws`) body is
started. I made a note in JENKINS-41854 about this.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1k446ET852Ot3g%3D1cOJAcPE8YFPsc%3DrVoKby3EcA%2BMEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help designing a fix for JENKINS-50504

me-3
In reply to this post by Oleg Nenashev
Thanks for pointing out JENKINS-45294. That is exactly what I am facing, at least twice a month. It causes severe disruption to my users, so I need to come up with a plan. I see that the bug is unassigned. If it isn't fixed soon, I might have to try to fix it myself by necessity. I suppose the best way to start would be by writing a test case that triggers the issue. Does the Jenkinsrule test harness provide any functionality for setting up this kind of scenario? I see there are some existing tests that restart Jenkins, but I'm not sure how to write an automated test that makes a node disconnect and reconnect in the manner described in the bug. Any advice or pointers to existing code or tests would be appreciated.

On Wednesday, April 4, 2018 at 1:26:29 AM UTC-7, Oleg Nenashev wrote:
This issue seems to be Pipeline-specific (actually DueableTask-Specific). Standard Freestyle jobs should abort immediately on the agent disconnection, but Pipeline jobs may recover and continue using the workspace.

However, it seems ugly to use the new channel in the "in use" map, because the job is still technically running under the old channel.

No, it should be running under the new channel. Old channel gets disposed, and Remoting 3.14+ adds some diagnostics for these cases (e.g. <a href="https://issues.jenkins-ci.org/browse/JENKINS-45294" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-45294\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFX4COJy2knKDLAGtAa5NzE5QHGVQ&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-45294\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFX4COJy2knKDLAGtAa5NzE5QHGVQ&#39;;return true;">JENKINS-45294). Now it causes some issues in Durable task which does not always recreate FilePath and underlying Workspace (<a href="https://issues.jenkins-ci.org/browse/JENKINS-41854" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-41854\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHgOD11wcTS-gKyh2YZz7JL1QrSXQ&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-41854\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHgOD11wcTS-gKyh2YZz7JL1QrSXQ&#39;;return true;">JENKINS-41854 and other similar issues with "Channel is closing or closed").

WorkspaceList#inUse should be reacquired by Pipeline for sure when it reconnects to a new agent. I would guess it happens even now (or not?), but clearly there is a potential of race conditions between recovered jobs and new submissions.

The proposed patch may help, although workspace management is not really the strongest part of the Jenkins core. I would rather suggest redesigning it so that workspaces can be tracked independently on the node state (the proposed change does the same for a single cache). Some better UI/ workspace release features could be added as an added value.

BR, Oleg
 

On Monday, April 2, 2018 at 10:08:28 PM UTC+2, [hidden email] wrote:
Hi Ivan,

Thanks for your reply. I'm not exactly sure how my proposed workaround would necessarily cause concurrency issues. Doesn't that depend on how it's implemented? I agree that it's strange that the agent wasn't disconnected and still keeps the old connection to the master, even though new jobs use a new connection. Doesn't this violate the invariant implied by the implementation of WorkspaceList#inUse, which is that the entries in the map always represent the latest channel for a given node? This definitely seems like a core bug to me. I don't believe I should need to tune my TCP stack, because pipeline claims to be resilient to network outages. If the master logs "SEVERE: I/O error in channel jenkins-node" and "INFO: Attempting to reconnect jenkins-node", then why do jobs continue running on the old connection, violating the invariant in WorkspaceList#inUse?

Thanks,
Basil

On Sunday, April 1, 2018 at 6:47:02 AM UTC-7, Ivan Fernandez Calvo wrote:
The pruposed workaround could cause concurrence issues, I think the the main issue why the agent is not disconnected and keep the old connection is the most important thing. Did you checked the open connection from the Agent to the master with netstat? It should be two connections the old one an an new one, Has the  agent more than one slave.jar process running? Are your agents VM or baremetal? Did you tune your tcp stack with proper values to keepalive?

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/f523fd4b-b9c5-46c6-94fe-9ee2c71d81f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help designing a fix for JENKINS-50504

me-3
I meant "Thanks for pointing out JENKINS-41854" below.

On Tuesday, April 10, 2018 at 5:42:20 PM UTC-7, [hidden email] wrote:
Thanks for pointing out JENKINS-45294. That is exactly what I am facing, at least twice a month. It causes severe disruption to my users, so I need to come up with a plan. I see that the bug is unassigned. If it isn't fixed soon, I might have to try to fix it myself by necessity. I suppose the best way to start would be by writing a test case that triggers the issue. Does the Jenkinsrule test harness provide any functionality for setting up this kind of scenario? I see there are some existing tests that restart Jenkins, but I'm not sure how to write an automated test that makes a node disconnect and reconnect in the manner described in the bug. Any advice or pointers to existing code or tests would be appreciated.

On Wednesday, April 4, 2018 at 1:26:29 AM UTC-7, Oleg Nenashev wrote:
This issue seems to be Pipeline-specific (actually DueableTask-Specific). Standard Freestyle jobs should abort immediately on the agent disconnection, but Pipeline jobs may recover and continue using the workspace.

However, it seems ugly to use the new channel in the "in use" map, because the job is still technically running under the old channel.

No, it should be running under the new channel. Old channel gets disposed, and Remoting 3.14+ adds some diagnostics for these cases (e.g. <a href="https://issues.jenkins-ci.org/browse/JENKINS-45294" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-45294\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFX4COJy2knKDLAGtAa5NzE5QHGVQ&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-45294\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFX4COJy2knKDLAGtAa5NzE5QHGVQ&#39;;return true;">JENKINS-45294). Now it causes some issues in Durable task which does not always recreate FilePath and underlying Workspace (<a href="https://issues.jenkins-ci.org/browse/JENKINS-41854" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-41854\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHgOD11wcTS-gKyh2YZz7JL1QrSXQ&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fissues.jenkins-ci.org%2Fbrowse%2FJENKINS-41854\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHgOD11wcTS-gKyh2YZz7JL1QrSXQ&#39;;return true;">JENKINS-41854 and other similar issues with "Channel is closing or closed").

WorkspaceList#inUse should be reacquired by Pipeline for sure when it reconnects to a new agent. I would guess it happens even now (or not?), but clearly there is a potential of race conditions between recovered jobs and new submissions.

The proposed patch may help, although workspace management is not really the strongest part of the Jenkins core. I would rather suggest redesigning it so that workspaces can be tracked independently on the node state (the proposed change does the same for a single cache). Some better UI/ workspace release features could be added as an added value.

BR, Oleg
 

On Monday, April 2, 2018 at 10:08:28 PM UTC+2, [hidden email] wrote:
Hi Ivan,

Thanks for your reply. I'm not exactly sure how my proposed workaround would necessarily cause concurrency issues. Doesn't that depend on how it's implemented? I agree that it's strange that the agent wasn't disconnected and still keeps the old connection to the master, even though new jobs use a new connection. Doesn't this violate the invariant implied by the implementation of WorkspaceList#inUse, which is that the entries in the map always represent the latest channel for a given node? This definitely seems like a core bug to me. I don't believe I should need to tune my TCP stack, because pipeline claims to be resilient to network outages. If the master logs "SEVERE: I/O error in channel jenkins-node" and "INFO: Attempting to reconnect jenkins-node", then why do jobs continue running on the old connection, violating the invariant in WorkspaceList#inUse?

Thanks,
Basil

On Sunday, April 1, 2018 at 6:47:02 AM UTC-7, Ivan Fernandez Calvo wrote:
The pruposed workaround could cause concurrence issues, I think the the main issue why the agent is not disconnected and keep the old connection is the most important thing. Did you checked the open connection from the Agent to the master with netstat? It should be two connections the old one an an new one, Has the  agent more than one slave.jar process running? Are your agents VM or baremetal? Did you tune your tcp stack with proper values to keepalive?

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/9fd964f0-ad48-4a96-8cb2-f6524cfa9b33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help designing a fix for JENKINS-50504

Jesse Glick-4
There are some tests in `workflow-durable-task-step` which simulate broken connections as well as restarts, so if the issue is indeed reliably reproducible, you could probably do it that way.

A test case would certainly be a valuable contribution. I doubt there is a straightforward, localized fix—my proposed approach involves adding new APIs in core Pipeline code that would involve somewhat subtle changes to multiple plugins and an understanding of serialization semantics including pickles.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr3WF9z5AT69Vs7oObr-FBP%3DPXo4ZkNcGpmg%2BKm-GROULg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.