Jenkins Agent/Slave on Windows Disconnect Issue

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Jenkins Agent/Slave on Windows Disconnect Issue

Vinod Krishna

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels.ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/35187bff-2f22-4eb8-8bcf-2e82161f71f5%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Jenkins Agent/Slave on Windows Disconnect Issue

D'raj
try increasing aws elb Idle timeout, by default its 60 sec


On Saturday, 30 May 2020 03:22:47 UTC+5:30, Vinod Krishna wrote:

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels.ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/f7c35898-4a54-4e1c-b199-97b5bd43db77%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Jenkins Agent/Slave on Windows Disconnect Issue

monger_39
have you looked on the agent in the remoting logs ?
I've had (and still have) the same issue. Often I see in the remoting logs on the node an error like
   "Reader thread killed by OutOfMemoryError
  java.lang.OutOfMemoryError: unable to create new native thread
  "
which btw does not necessarily mean ''out of memory". It apparently can also indicate 'unable to create new thread'.
Exact reason(s) for the latter are not 100% clear to me still.
I'm very curious/anxious to have more info here too...
On Wednesday, June 3, 2020, 07:40:53 AM GMT+2, D'raj <[hidden email]> wrote:


try increasing aws elb Idle timeout, by default its 60 sec


On Saturday, 30 May 2020 03:22:47 UTC+5:30, Vinod Krishna wrote:

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels. ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit
https://groups.google.com/d/msgid/jenkinsci-users/f7c35898-4a54-4e1c-b199-97b5bd43db77%40googlegroups.com
.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/361904326.2842768.1591268811812%40mail.yahoo.com.
Reply | Threaded
Open this post in threaded view
|

Re: Jenkins Agent/Slave on Windows Disconnect Issue

Vinod Krishna
Thanks for the response!

I did check the remoting logs; all I see is below

Jun 04, 2020 1:57:27 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jun 04, 2020 1:57:27 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to <myci.example.com>:50000
Jun 04, 2020 1:57:27 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Jun 04, 2020 1:57:27 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: 65:f3:2a:9c:fc:ec:55:9f:49:de:49:a0:bf:27:ff:93
Jun 04, 2020 1:57:28 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Jun 04, 2020 1:59:15 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated


There are no logs that say what is triggering the termination of service. However, it comes back online after some time. 

On Thursday, 4 June 2020 07:07:11 UTC-4, monger_39 wrote:
have you looked on the agent in the remoting logs ?
I've had (and still have) the same issue. Often I see in the remoting logs on the node an error like
   "Reader thread killed by OutOfMemoryError
  java.lang.OutOfMemoryError: unable to create new native thread
  "
which btw does not necessarily mean ''out of memory". It apparently can also indicate 'unable to create new thread'.
Exact reason(s) for the latter are not 100% clear to me still.
I'm very curious/anxious to have more info here too...
On Wednesday, June 3, 2020, 07:40:53 AM GMT+2, D'raj <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="CApUkgYxAwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">dnyanr...@...> wrote:


try increasing aws elb Idle timeout, by default its 60 sec


On Saturday, 30 May 2020 03:22:47 UTC+5:30, Vinod Krishna wrote:

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels. ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a rel="nofollow" shape="rect" href="javascript:" target="_blank" gdf-obfuscated-mailto="CApUkgYxAwAJ" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">jenkins...@googlegroups.com.
To view this discussion on the web visit
<a rel="nofollow" shape="rect" href="https://groups.google.com/d/msgid/jenkinsci-users/f7c35898-4a54-4e1c-b199-97b5bd43db77%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/jenkinsci-users/f7c35898-4a54-4e1c-b199-97b5bd43db77%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/jenkinsci-users/f7c35898-4a54-4e1c-b199-97b5bd43db77%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/jenkinsci-users/f7c35898-4a54-4e1c-b199-97b5bd43db77%40googlegroups.com
.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/c93ad5d2-3aa6-481e-956e-83513a9dc9ad%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Jenkins Agent/Slave on Windows Disconnect Issue

Vinod Krishna
In reply to this post by D'raj
I wonder if that is going to help. The ELB timeout is only good enough for the connections between the 1. Client and ELB and  2. ELB and Backend Instance. In this case, only the Jenkins Master is behind the ALB and the connection between is fine! The Windows Agents mentioned here is not part of the ELB setup, but can be considered as a client connection to the ELB. I can try increasing the timeout, not sure if that is going to help. 

On Wednesday, 3 June 2020 01:13:37 UTC-4, D'raj wrote:
try increasing aws elb Idle timeout, by default its 60 sec


On Saturday, 30 May 2020 03:22:47 UTC+5:30, Vinod Krishna wrote:

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels.ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/13582539-55db-432f-9ed2-06aa686985db%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Jenkins Agent/Slave on Windows Disconnect Issue

Vinod Krishna
In reply to this post by D'raj
Hi,

It looks like increasing the ELB Timeout helped us! Thanks a lot!

On Wednesday, 3 June 2020 01:13:37 UTC-4, D'raj wrote:
try increasing aws elb Idle timeout, by default its 60 sec


On Saturday, 30 May 2020 03:22:47 UTC+5:30, Vinod Krishna wrote:

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels.ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/cdd8e861-e2d5-41b9-8d4e-87f8be076467o%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Jenkins Agent/Slave on Windows Disconnect Issue

slide
How did you modify this setting? In the Jenkins cloud configuration, or on AWS itself?

On Mon, Jun 8, 2020 at 2:55 PM Vinod Krishna <[hidden email]> wrote:
Hi,

It looks like increasing the ELB Timeout helped us! Thanks a lot!

On Wednesday, 3 June 2020 01:13:37 UTC-4, D'raj wrote:
try increasing aws elb Idle timeout, by default its 60 sec


On Saturday, 30 May 2020 03:22:47 UTC+5:30, Vinod Krishna wrote:

Hi, 


We have around 10 Jenkins Agents, each running on its own Windows 2016 EC2 instance.  Java_slave is running as a service. The Jenkins master runs on a separate Amazon Linux instance. We are able to establish connectivity between the Master and Agents and jobs are running fine. 

However, for some reason, the Service goes offline at different intervals and comes back online. This is a repeated behavior and we are not able to find many logs from the Windows Event Viewer , except that it Says "Jenkins Slave stopping" . and the service comes back online. We installed NewRelic APM Agent to the server to check the Java metrics and there is minimal Heap consumption. The Java versions of both the Agent and Server are the same ( jdk1.8.0_211).  We are not able to find the root cause of the Service being stopped abruptly and Jobs running on them gets killed.


“"windows agent was marked offline: Connection was broken: java.nio.channels.ClosedChannelException"”


Thanks in advance. 

Vinod

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/cdd8e861-e2d5-41b9-8d4e-87f8be076467o%40googlegroups.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/CAPiUgVd8fP1DjMW9XMkbbonwaptfu9od%2BhGQsEr_XG_uRtVnkQ%40mail.gmail.com.