remoting issue

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

remoting issue

Glenn Burkhardt
We've been seeing job termination due to slave disconnects, and would like help in solving this issue.

Jenkins 2.140
ssh-slaves-plugin 1.29.4
Ubuntu 16.04.5
Hypervisor, plugin Libvirt 1.8.6
Slave guest:  Ubuntu 18.04.1

Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

I've been looking at the wiki note here: https://wiki.jenkins.io/display/JENKINS/Remoting+issue

and the anomaly I've noticed is repeated in the slave.log file created by Jenkins (SocketTimeoutException):

Feb 06, 2019 8:37:58 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
ERROR: Connection terminated
java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
ERROR: Socket connection to SSH server was lost
java.net.SocketTimeoutException: The connect timeout expired
    at com.trilead.ssh2.Connection$1.run(Connection.java:762)
    at com.trilead.ssh2.util.TimeoutService$TimeoutThread.run(TimeoutService.java:91)
Slave JVM has not reported exit code before the socket was lost
[02/06/19 08:41:05] [SSH] Connection closed.

The remoting log on the slave has

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

and the /var/log/jenkins/jenkins.log has:

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel ubuntu-122-2
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

Feb 06, 2019 8:41:12 AM hudson.slaves.RetentionStrategy$Demand check
INFO: Disconnecting computer ubuntu-122-1 as it has been idle for 1 min 2 sec
Feb 06, 2019 8:41:12 AM hudson.plugins.libvirt.VirtualMachineSlaveComputer disconnect
INFO: Virtual machine "ubuntu-20170406-122-1" (slave "ubuntu-122-1") is to be shut down.reason: Offline because computer was idle; it will be relaunched when needed. (hudson.slaves.OfflineCause$IdleOfflineCause)

The syslog file on the slave doesn't indicate any anomalies between the startup and termination initiated by Jenkins:

Feb  6 08:39:31 VirtualBox systemd[1]: Starting Clean php session files...
Feb  6 08:39:31 VirtualBox systemd[1]: Started Clean php session files.
Feb  6 08:41:43 VirtualBox systemd[1]: Stopping User Manager for UID 998...
Feb  6 08:41:43 VirtualBox systemd[1294]: Stopped target Default.

Thanks.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/79b107de-2331-4f79-af8c-40c9e6c51d39%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Ivan Fernandez Calvo
>Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

network is not an issue, but performance response, Could it be? it is not a good idea to run the Jenkins Master and Agents in the same machine, if you use Docker container and you do not limit the resources used by each container (memory and CPU), they will fight for resources and some of them will be killed by docker with a nice OOM.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/424e7cf6-4788-47d5-8e2d-5c4fa3f72418%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Glenn Burkhardt
My reading of the code indicates that the timeout value is set by "kexTimeout" in com\trilead\ssh2\Connection.java at line 693.  That appears to be set in SSHLauncher.openConnection():1184.  The value we're using for 'launchTimeoutMillis' should be 15000, assuming that it comes from "Startup Idle" in the slave configuration (we enter '15', since the help says that the units are seconds).  The machine we're using has 64gb of RAM, and 24 cores.  It's possible that it's a performance issue, but I doubt it.  I'll try monitoring that a bit...

Thanks for your response.

On Wednesday, February 6, 2019 at 11:14:53 AM UTC-5, Ivan Fernandez Calvo wrote:
>Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

network is not an issue, but performance response, Could it be? it is not a good idea to run the Jenkins Master and Agents in the same machine, if you use Docker container and you do not limit the resources used by each container (memory and CPU), they will fight for resources and some of them will be killed by docker with a nice OOM.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Ivan Fernandez Calvo
This timeout, it is only for the connection stage, and it includes whole retry  reconnections, long history, the default value is 210 seconds, less than 30-60 seconds it is not a good value and only if you have reties to 0. 

I do not know how many agents you spin at the same time, I would try to find which number is the limit, I meant, If I spin 50 agent and is stable for some hours, I would increase de number until I find the limit

Un Saludo 
Ivan Fernandez Calvo

El 6 feb 2019, a las 19:01, Glenn Burkhardt <[hidden email]> escribió:

My reading of the code indicates that the timeout value is set by "kexTimeout" in com\trilead\ssh2\Connection.java at line 693.  That appears to be set in SSHLauncher.openConnection():1184.  The value we're using for 'launchTimeoutMillis' should be 15000, assuming that it comes from "Startup Idle" in the slave configuration (we enter '15', since the help says that the units are seconds).  The machine we're using has 64gb of RAM, and 24 cores.  It's possible that it's a performance issue, but I doubt it.  I'll try monitoring that a bit...

Thanks for your response.

On Wednesday, February 6, 2019 at 11:14:53 AM UTC-5, Ivan Fernandez Calvo wrote:
>Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

network is not an issue, but performance response, Could it be? it is not a good idea to run the Jenkins Master and Agents in the same machine, if you use Docker container and you do not limit the resources used by each container (memory and CPU), they will fight for resources and some of them will be killed by docker with a nice OOM.

--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/E0FF82A5-C542-42C1-AA02-75FA9A30DB5B%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Glenn Burkhardt
I see now that the value for "kexTimeout" should be the 210 value you reference.  BTW, we are using only 5 agents.

I found reports of a similar problem with connection timeouts here:


and as an experiment, followed the recommendation of increasing the connection time to an absurdly high value, viz., 6000.  The connection timeouts stopped, but we're still seeing communictions problems.  It looks like it's originating on the slaves.  The remoting log contains:

Feb 11, 2019 8:15:03 AM hudson.remoting.ProxyOutputStream$Chunk$1 run
WARNING: Failed to ack the stream
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:313)
    at hudson.remoting.StandardOutputStream.write(StandardOutputStream.java:83)
    at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
    at hudson.remoting.ChunkedOutputStream.sendBreak(ChunkedOutputStream.java:62)
    at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:46)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:47)
    at hudson.remoting.Channel.send(Channel.java:721)
    at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:270)
    at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Feb 11, 2019 8:15:03 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)



On Wednesday, February 6, 2019 at 1:43:24 PM UTC-5, Ivan Fernandez Calvo wrote:
This timeout, it is only for the connection stage, and it includes whole retry  reconnections, long history, the default value is 210 seconds, less than 30-60 seconds it is not a good value and only if you have reties to 0. 

I do not know how many agents you spin at the same time, I would try to find which number is the limit, I meant, If I spin 50 agent and is stable for some hours, I would increase de number until I find the limit

Un Saludo 
Ivan Fernandez Calvo

El 6 feb 2019, a las 19:01, Glenn Burkhardt <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="KKiyiVN3GgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">gbbur...@...> escribió:

My reading of the code indicates that the timeout value is set by "kexTimeout" in com\trilead\ssh2\Connection.java at line 693.  That appears to be set in SSHLauncher.openConnection():1184.  The value we're using for 'launchTimeoutMillis' should be 15000, assuming that it comes from "Startup Idle" in the slave configuration (we enter '15', since the help says that the units are seconds).  The machine we're using has 64gb of RAM, and 24 cores.  It's possible that it's a performance issue, but I doubt it.  I'll try monitoring that a bit...

Thanks for your response.

On Wednesday, February 6, 2019 at 11:14:53 AM UTC-5, Ivan Fernandez Calvo wrote:
>Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

network is not an issue, but performance response, Could it be? it is not a good idea to run the Jenkins Master and Agents in the same machine, if you use Docker container and you do not limit the resources used by each container (memory and CPU), they will fight for resources and some of them will be killed by docker with a nice OOM.

--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe&#39;;return true;">https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="KKiyiVN3GgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com?utm_medium\x3demail\x26utm_source\x3dfooter&#39;;return true;">https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/cbb19c5d-5031-48b9-9e5c-193810e922d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Ivan Fernandez Calvo
In reply to this post by Glenn Burkhardt
It could be tons of things, Probably, if you go to the node log (JENKINS_URL/computer/NODE_NAME/log) we could see if the agent connect or not, then if the communication is broken before/after connection and in which point.

El miércoles, 6 de febrero de 2019, 16:39:17 (UTC+1), Glenn Burkhardt escribió:
We've been seeing job termination due to slave disconnects, and would like help in solving this issue.

Jenkins 2.140
ssh-slaves-plugin 1.29.4
Ubuntu 16.04.5
Hypervisor, plugin Libvirt 1.8.6
Slave guest:  Ubuntu 18.04.1

Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

I've been looking at the wiki note here: <a href="https://wiki.jenkins.io/display/JENKINS/Remoting+issue" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwiki.jenkins.io%2Fdisplay%2FJENKINS%2FRemoting%2Bissue\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH64po7FABb6HumJokoRmFhTcfuGA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwiki.jenkins.io%2Fdisplay%2FJENKINS%2FRemoting%2Bissue\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNH64po7FABb6HumJokoRmFhTcfuGA&#39;;return true;">https://wiki.jenkins.io/display/JENKINS/Remoting+issue

and the anomaly I've noticed is repeated in the slave.log file created by Jenkins (SocketTimeoutException):

Feb 06, 2019 8:37:58 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
ERROR: Connection terminated
java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
ERROR: Socket connection to SSH server was lost
java.net.SocketTimeoutException: The connect timeout expired
    at com.trilead.ssh2.Connection$1.run(Connection.java:762)
    at com.trilead.ssh2.util.TimeoutService$TimeoutThread.run(TimeoutService.java:91)
Slave JVM has not reported exit code before the socket was lost
[02/06/19 08:41:05] [SSH] Connection closed.

The remoting log on the slave has

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

and the /var/log/jenkins/jenkins.log has:

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel ubuntu-122-2
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

Feb 06, 2019 8:41:12 AM hudson.slaves.RetentionStrategy$Demand check
INFO: Disconnecting computer ubuntu-122-1 as it has been idle for 1 min 2 sec
Feb 06, 2019 8:41:12 AM hudson.plugins.libvirt.VirtualMachineSlaveComputer disconnect
INFO: Virtual machine "ubuntu-20170406-122-1" (slave "ubuntu-122-1") is to be shut down.reason: Offline because computer was idle; it will be relaunched when needed. (hudson.slaves.OfflineCause$IdleOfflineCause)

The syslog file on the slave doesn't indicate any anomalies between the startup and termination initiated by Jenkins:

Feb  6 08:39:31 VirtualBox systemd[1]: Starting Clean php session files...
Feb  6 08:39:31 VirtualBox systemd[1]: Started Clean php session files.
Feb  6 08:41:43 VirtualBox systemd[1]: Stopping User Manager for UID 998...
Feb  6 08:41:43 VirtualBox systemd[1294]: Stopped target Default.

Thanks.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/0ffc8193-e585-4821-9aec-3b4eec70e629%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

niristotle okram
This thing caught my eye:

Feb 11, 2019 8:15:03 AM hudson.remoting.ProxyOutputStream$Chunk$1 run
WARNING: Failed to ack the stream 


That indicates a TCP issue over there. Try this 


[root@ip-172-31-2-59 ec2-user]# ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
    tx-scatter-gather: off
tcp-segmentation-offload: off
    tx-tcp-segmentation: off [requested on]
generic-segmentation-offload: off [requested on]




 

On Mon, Feb 11, 2019 at 9:16 AM Ivan Fernandez Calvo <[hidden email]> wrote:
It could be tons of things, Probably, if you go to the node log (JENKINS_URL/computer/NODE_NAME/log) we could see if the agent connect or not, then if the communication is broken before/after connection and in which point.

El miércoles, 6 de febrero de 2019, 16:39:17 (UTC+1), Glenn Burkhardt escribió:
We've been seeing job termination due to slave disconnects, and would like help in solving this issue.

Jenkins 2.140
ssh-slaves-plugin 1.29.4
Ubuntu 16.04.5
Hypervisor, plugin Libvirt 1.8.6
Slave guest:  Ubuntu 18.04.1

Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

I've been looking at the wiki note here: https://wiki.jenkins.io/display/JENKINS/Remoting+issue

and the anomaly I've noticed is repeated in the slave.log file created by Jenkins (SocketTimeoutException):

Feb 06, 2019 8:37:58 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
ERROR: Connection terminated
java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
ERROR: Socket connection to SSH server was lost
java.net.SocketTimeoutException: The connect timeout expired
    at com.trilead.ssh2.Connection$1.run(Connection.java:762)
    at com.trilead.ssh2.util.TimeoutService$TimeoutThread.run(TimeoutService.java:91)
Slave JVM has not reported exit code before the socket was lost
[02/06/19 08:41:05] [SSH] Connection closed.

The remoting log on the slave has

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

and the /var/log/jenkins/jenkins.log has:

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel ubuntu-122-2
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

Feb 06, 2019 8:41:12 AM hudson.slaves.RetentionStrategy$Demand check
INFO: Disconnecting computer ubuntu-122-1 as it has been idle for 1 min 2 sec
Feb 06, 2019 8:41:12 AM hudson.plugins.libvirt.VirtualMachineSlaveComputer disconnect
INFO: Virtual machine "ubuntu-20170406-122-1" (slave "ubuntu-122-1") is to be shut down.reason: Offline because computer was idle; it will be relaunched when needed. (hudson.slaves.OfflineCause$IdleOfflineCause)

The syslog file on the slave doesn't indicate any anomalies between the startup and termination initiated by Jenkins:

Feb  6 08:39:31 VirtualBox systemd[1]: Starting Clean php session files...
Feb  6 08:39:31 VirtualBox systemd[1]: Started Clean php session files.
Feb  6 08:41:43 VirtualBox systemd[1]: Stopping User Manager for UID 998...
Feb  6 08:41:43 VirtualBox systemd[1294]: Stopped target Default.

Thanks.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/0ffc8193-e585-4821-9aec-3b4eec70e629%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Regards
nirish okram

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/CAPzcO4ibDezMCL9T9AwcRc7_DRyqyKciDwUnYX1k95tPZqFyuQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Glenn Burkhardt
In reply to this post by Ivan Fernandez Calvo
I've attached the two logs.  I did make a change after seeing 'Failed to ack the stream' to turn off scatter/gather.  But I think that's a bit of a "hail mary".

On Monday, February 11, 2019 at 10:16:05 AM UTC-5, Ivan Fernandez Calvo wrote:
It could be tons of things, Probably, if you go to the node log (JENKINS_URL/computer/NODE_NAME/log) we could see if the agent connect or not, then if the communication is broken before/after connection and in which point.


--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/733c4024-b246-4dfb-b50c-3a7c85641e1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

slave.log (6K) Download Attachment
remoting-122-2.log (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

niristotle okram
Okay going by the logs now, here are some of the possible things that you can look into:

1. jdk versions consistent across the master and slaves
4. "Jenkins 2.58+ only support java8 or higher, this includes your agents JVM." so you can try jdk7 instead of 8

i will bet on 1 and 4. All the best!





On Mon, Feb 11, 2019 at 2:52 PM Glenn Burkhardt <[hidden email]> wrote:
I've attached the two logs.  I did make a change after seeing 'Failed to ack the stream' to turn off scatter/gather.  But I think that's a bit of a "hail mary".

On Monday, February 11, 2019 at 10:16:05 AM UTC-5, Ivan Fernandez Calvo wrote:
It could be tons of things, Probably, if you go to the node log (JENKINS_URL/computer/NODE_NAME/log) we could see if the agent connect or not, then if the communication is broken before/after connection and in which point.


--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/733c4024-b246-4dfb-b50c-3a7c85641e1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Regards
nirish okram

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/CAPzcO4iiX8iOZviNzXfynLDjkyt7pCUQ8Z7G_8kmg%3DaCS-2%3DTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Ivan Fernandez Calvo
I think you can discard JVM versions issues and connectivity issues, because the agent connects and copy the remoting.jar, How much time takes between the agent connect and the channel is broken? Check you do not have any configuration  that kill idle connections, also try to assign memory to the remoting process (-Xmx256m -Xms256m), check those files dmesg, kernel.log, syslog, and messages of the agent to see if there is some error.

```
[02/11/19 08:14:54] [SSH] Checking java version of /usr/lib/jvm/jdk1.8.0_161/bin/java
[02/11/19 08:14:55] [SSH] /usr/lib/jvm/jdk1.8.0_161/bin/java -version returned 1.8.0_161.
[02/11/19 08:14:55] [SSH] Starting sftp client.
[02/11/19 08:14:55] [SSH] Copying latest remoting.jar...
[02/11/19 08:14:55] [SSH] Copied 776,265 bytes.
Expanded the channel window size to 4MB
[02/11/19 08:14:55] [SSH] Starting agent process: cd "/home/jenkins" && /usr/lib/jvm/jdk1.8.0_161/bin/java  -jar remoting.jar -workDir /home/jenkins
Feb 11, 2019 8:14:55 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>   channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
```

El mar., 12 feb. 2019 a las 5:45, niristotle okram (<[hidden email]>) escribió:
Okay going by the logs now, here are some of the possible things that you can look into:

1. jdk versions consistent across the master and slaves
4. "Jenkins 2.58+ only support java8 or higher, this includes your agents JVM." so you can try jdk7 instead of 8

i will bet on 1 and 4. All the best!





On Mon, Feb 11, 2019 at 2:52 PM Glenn Burkhardt <[hidden email]> wrote:
I've attached the two logs.  I did make a change after seeing 'Failed to ack the stream' to turn off scatter/gather.  But I think that's a bit of a "hail mary".

On Monday, February 11, 2019 at 10:16:05 AM UTC-5, Ivan Fernandez Calvo wrote:
It could be tons of things, Probably, if you go to the node log (JENKINS_URL/computer/NODE_NAME/log) we could see if the agent connect or not, then if the communication is broken before/after connection and in which point.


--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].


--
Regards
nirish okram

--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/CAPzcO4iiX8iOZviNzXfynLDjkyt7pCUQ8Z7G_8kmg%3DaCS-2%3DTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/CAKo5QrrMnacuS05sUA-EBjECqf5eyB58QPCHg%3DTMpWjBNzxhMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Glenn Burkhardt
I have to agree that the Java version is a red herring.

Our VMs have been more stable in the last few days.  The changes that I think might have had an effect are:

a)  Increasing the connection timeout to 60000
b) Turning off scatter/gather

But I also upgraded them all with the most recent Ubuntu updates, including Oracle Java 8.201, and Linux version 4.15.0-45-generic.

I'll wait a few days, and then try backing out changes to see which makes the problem return.  Thanks.

On Tuesday, February 12, 2019 at 2:28:21 AM UTC-5, Ivan Fernandez Calvo wrote:
I think you can discard JVM versions issues and connectivity issues, because the agent connects and copy the remoting.jar, How much time takes between the agent connect and the channel is broken? Check you do not have any configuration  that kill idle connections, also try to assign memory to the remoting process (-Xmx256m -Xms256m), check those files dmesg, kernel.log, syslog, and messages of the agent to see if there is some error.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/0cbe8b8e-cb2c-4f95-9225-8e7ced80ce4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: remoting issue

Ivan Fernandez Calvo
I dunno if it is your case, but things like VMotion that changes the VM to a less loaded rack, energy saving, any kind of power on network, and in general every that suspends the VM will break the connection, Jenkins check on agents every 4 min, if the agent does not respond on time Jenkins breaks the channel.

El 18 feb 2019, a las 16:17, Glenn Burkhardt <[hidden email]> escribió:

I have to agree that the Java version is a red herring.

Our VMs have been more stable in the last few days.  The changes that I think might have had an effect are:

a)  Increasing the connection timeout to 60000
b) Turning off scatter/gather

But I also upgraded them all with the most recent Ubuntu updates, including Oracle Java 8.201, and Linux version 4.15.0-45-generic.

I'll wait a few days, and then try backing out changes to see which makes the problem return.  Thanks.

On Tuesday, February 12, 2019 at 2:28:21 AM UTC-5, Ivan Fernandez Calvo wrote:
I think you can discard JVM versions issues and connectivity issues, because the agent connects and copy the remoting.jar, How much time takes between the agent connect and the channel is broken? Check you do not have any configuration  that kill idle connections, also try to assign memory to the remoting process (-Xmx256m -Xms256m), check those files dmesg, kernel.log, syslog, and messages of the agent to see if there is some error.

--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/0cbe8b8e-cb2c-4f95-9225-8e7ced80ce4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/C0BE11E8-E103-4BBF-A273-6AF43535E312%40gmail.com.
For more options, visit https://groups.google.com/d/optout.