Ping failed. Terminating

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Ping failed. Terminating

Steele, Richard-2
I have a Windows XP slave machine (running in a VM) that periodically dies with the cryptic message "Ping failed. Terminating" in its log file.  This then seems to tickle http://issues.hudson-ci.org/browse/HUDSON-5413, and I end up having to restart the master to recover.

I can't quite figure out why only this slave is affected, but it's pretty consistent, at least twice a day.

Anyone else see this?  Anyone know what it means?  I see it's coming from Launcher.java when the PingThread, but why?

It would be helpful if the message had a little more information in it, like a timestamp, because then maybe I could correlate to a specific job.

Thanks,
Rich
Reply | Threaded
Open this post in threaded view
|

Re: Ping failed. Terminating

Steele, Richard-2
The node that keeps shutting down on my typically runs a job that's configured to collect code coverage (using the emma plugin) as well as tasks, warnings, and findbugs.  I've turned all of these off and the node seems to be stable again.

I'm going to reenable them one at a time to see which, if any, are the culprit.  Anyone have any theories?  Which of these plugins could knock a channel off-line, causing the ping to fail, causing the node stop with a call to System.exit()?

Rich

On Wed, Feb 3, 2010 at 11:41 AM, Steele, Richard <[hidden email]> wrote:
I have a Windows XP slave machine (running in a VM) that periodically dies with the cryptic message "Ping failed. Terminating" in its log file.  This then seems to tickle http://issues.hudson-ci.org/browse/HUDSON-5413, and I end up having to restart the master to recover.

I can't quite figure out why only this slave is affected, but it's pretty consistent, at least twice a day.

Anyone else see this?  Anyone know what it means?  I see it's coming from Launcher.java when the PingThread, but why?

It would be helpful if the message had a little more information in it, like a timestamp, because then maybe I could correlate to a specific job.

Thanks,
Rich

Reply | Threaded
Open this post in threaded view
|

Re: Ping failed. Terminating

Steele, Richard-2
It seems that using the Emma plugin is somehow causing the channel to get closed; once closed, the PingThread must lose contact with the master and therefore it kills the node's slave process.  Once killed, the server is no longer able to reconnect to the node and must be restarted.

To diagnose the problem, I disabled the emma plugin, ran the job manually 10 or so times with no problems; I did the same after enabling the emma plugin in the project and I got the following after a single attempt at building the project:

From the job's console:

BUILD SUCCESSFUL
Total time: 1 minute 47 seconds
Recording test results
Emma: looking for coverage reports in the entire workspace: C:\Hudson\workspace\X
Emma: found 3 report files:
          C:\Hudson\workspace\X\LMDEVRSTEELE4_X\png\ant-build-tools\build\reports\emma-report\coverage.xml
          C:\Hudson\workspace\X\LMDEVRSTEELE4_X\png\api\build\reports\emma-report\coverage.xml
          C:\Hudson\workspace\X\LMDEVRSTEELE4_X\png\fw\tools\build\reports\emma-report\coverage.xml
Emma: stored 3 report files in the build folder: C:\Hudson\jobs\X\builds\2010-02-08_15-14-15\emma
Emma: Coverage: Classes 23/70 (33%). Methods 130/470 (28%). Blocks 1553/7545 (21%). Lines 359.5/1709 (21%).
No emails were triggered.
Finished: UNSTABLE

From the SCM polling log:

Started on Feb 8, 2010 3:31:15 PM
remote file operation failed: C:\Hudson\workspace\X at hudson.remoting.Channel@1a99984:LMDEVRSTEELE4
hudson.util.IOException2: remote file operation failed: C:\Hudson\workspace\X at hudson.remoting.Channel@1a99984:LMDEVRSTEELE4
    at hudson.FilePath.act(FilePath.java:690)
    at hudson.FilePath.act(FilePath.java:676)
    at hudson.FilePath.exists(FilePath.java:944)
    at hudson.model.AbstractProject.pollSCMChanges(AbstractProject.java:1041)
    at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:319)
    at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:346)
    at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(Unknown Source)
    at java.net.SocketOutputStream.write(Unknown Source)
    at hudson.remoting.SocketOutputStream.write(SocketOutputStream.java:58)
    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
    at java.io.BufferedOutputStream.flush(Unknown Source)
    at java.io.ObjectOutputStream$BlockDataOutputStream.flush(Unknown Source)
    at java.io.ObjectOutputStream.flush(Unknown Source)
    at hudson.remoting.Channel.send(Channel.java:418)
    at hudson.remoting.Request.call(Request.java:105)
    at hudson.remoting.Channel.call(Channel.java:551)
    at hudson.FilePath.act(FilePath.java:683)
    ... 12 more
Done. Took 0 ms
No changes

Anyone have any ideas why the emma plugin would cause this?

At this point, I'm just happy we've restored stability to our job.  Having the emma plugin track code coverage was nice, but it's far from critical.

Rich

On Fri, Feb 5, 2010 at 7:20 AM, Steele, Richard <[hidden email]> wrote:
The node that keeps shutting down on my typically runs a job that's configured to collect code coverage (using the emma plugin) as well as tasks, warnings, and findbugs.  I've turned all of these off and the node seems to be stable again.

I'm going to reenable them one at a time to see which, if any, are the culprit.  Anyone have any theories?  Which of these plugins could knock a channel off-line, causing the ping to fail, causing the node stop with a call to System.exit()?

Rich


On Wed, Feb 3, 2010 at 11:41 AM, Steele, Richard <[hidden email]> wrote:
I have a Windows XP slave machine (running in a VM) that periodically dies with the cryptic message "Ping failed. Terminating" in its log file.  This then seems to tickle http://issues.hudson-ci.org/browse/HUDSON-5413, and I end up having to restart the master to recover.

I can't quite figure out why only this slave is affected, but it's pretty consistent, at least twice a day.

Anyone else see this?  Anyone know what it means?  I see it's coming from Launcher.java when the PingThread, but why?

It would be helpful if the message had a little more information in it, like a timestamp, because then maybe I could correlate to a specific job.

Thanks,
Rich


Reply | Threaded
Open this post in threaded view
|

Re: Ping failed. Terminating

Keith Kowalczykowski-3
I'm seeing similar behavior on our build box, which I posted about several months ago, but didn't get anywhere with resolving it. Our issue is more drastic in that we can not start any slaves without crashing with the ping failed message. The issue came to light in an upgrade from 1.332 to 1.338, and then we promptly reverted back. I recently had some time to go back and debug this further, and found that specifically the issue comes up due to changes in 1.337. I suspect the culprit is the MultiClassLoader support that Kohsuke relanded  in 1.337 (see http://issues.hudson-ci.org/browse/HUDSON-5048). Can others who are experiencing this issue try reverting to 1.336, to see if it resolves their issue. I'm going to revive my post on the dev list to see if we can't make any progress in getting this resolved.

    -Keith

P.S. My original message from a few months ago about the issue:


Hi All,

We recently updated our hudson install from 1.332 to 1.338 and are now not
able to start our hudson slaves. We launch the slave agent via ssh (ie. ssh
USER@HOST "java -jar bin/slave.jar"), which has worked flawlessly in the
past. Now, when the slave launches, it waits for 1 minute (the timeout in
PingThread.java) and then dies with "Ping failed. Terminating". Below is the
stack trace? Any ideas on what the issue is and/or how to debug it futher?

-Keith

<===[HUDSON REMOTING CAPACITY]===>channel started
Ping failed. Terminating
Unexpected error in launching a slave. This is probably a bug in Hudson
hudson.remoting.RequestAbortedException:
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.call(Request.java:137)
at hudson.remoting.Channel.call(Channel.java:547)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:305)
at hudson.slaves.CommandLauncher.launch(CommandLauncher.java:111)
at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:180)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
at java.lang.Thread.run(Thread.java:619)
[12/22/09 12:49:49] slave agent was terminated
Caused by: hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:257)
at hudson.remoting.Channel.terminate(Channel.java:594)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:872)
java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.ja
va:2554)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:852)
Caused by: java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.ja
va:2554)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:852)




Steele, Richard wrote:
It seems that using the Emma plugin is somehow causing the channel to get closed; once closed, the PingThread must lose contact with the master and therefore it kills the node's slave process.  Once killed, the server is no longer able to reconnect to the node and must be restarted.

To diagnose the problem, I disabled the emma plugin, ran the job manually 10 or so times with no problems; I did the same after enabling the emma plugin in the project and I got the following after a single attempt at building the project:

From the job's console:

BUILD SUCCESSFUL
Total time: 1 minute 47 seconds
Recording test results
Emma: looking for coverage reports in the entire workspace: C:\Hudson\workspace\X
Emma: found 3 report files:
          C:\Hudson\workspace\X\LMDEVRSTEELE4_X\png\ant-build-tools\build\reports\emma-report\coverage.xml
          C:\Hudson\workspace\X\LMDEVRSTEELE4_X\png\api\build\reports\emma-report\coverage.xml
          C:\Hudson\workspace\X\LMDEVRSTEELE4_X\png\fw\tools\build\reports\emma-report\coverage.xml
Emma: stored 3 report files in the build folder: C:\Hudson\jobs\X\builds\2010-02-08_15-14-15\emma
Emma: Coverage: Classes 23/70 (33%). Methods 130/470 (28%). Blocks 1553/7545 (21%). Lines 359.5/1709 (21%).
No emails were triggered.
Finished: UNSTABLE

From the SCM polling log:

Started on Feb 8, 2010 3:31:15 PM
remote file operation failed: C:\Hudson\workspace\X at hudson.remoting.Channel@1a99984:LMDEVRSTEELE4
hudson.util.IOException2: remote file operation failed: C:\Hudson\workspace\X at hudson.remoting.Channel@1a99984:LMDEVRSTEELE4
    at hudson.FilePath.act(FilePath.java:690)
    at hudson.FilePath.act(FilePath.java:676)
    at hudson.FilePath.exists(FilePath.java:944)
    at hudson.model.AbstractProject.pollSCMChanges(AbstractProject.java:1041)
    at hudson.triggers.SCMTrigger$Runner.runPolling(SCMTrigger.java:319)
    at hudson.triggers.SCMTrigger$Runner.run(SCMTrigger.java:346)
    at hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:118)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(Unknown Source)
    at java.net.SocketOutputStream.write(Unknown Source)
    at hudson.remoting.SocketOutputStream.write(SocketOutputStream.java:58)
    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
    at java.io.BufferedOutputStream.flush(Unknown Source)
    at java.io.ObjectOutputStream$BlockDataOutputStream.flush(Unknown Source)
    at java.io.ObjectOutputStream.flush(Unknown Source)
    at hudson.remoting.Channel.send(Channel.java:418)
    at hudson.remoting.Request.call(Request.java:105)
    at hudson.remoting.Channel.call(Channel.java:551)
    at hudson.FilePath.act(FilePath.java:683)
    ... 12 more
Done. Took 0 ms
No changes

Anyone have any ideas why the emma plugin would cause this?

At this point, I'm just happy we've restored stability to our job.  Having the emma plugin track code coverage was nice, but it's far from critical.

Rich

On Fri, Feb 5, 2010 at 7:20 AM, Steele, Richard <[hidden email]> wrote:
The node that keeps shutting down on my typically runs a job that's configured to collect code coverage (using the emma plugin) as well as tasks, warnings, and findbugs.  I've turned all of these off and the node seems to be stable again.

I'm going to reenable them one at a time to see which, if any, are the culprit.  Anyone have any theories?  Which of these plugins could knock a channel off-line, causing the ping to fail, causing the node stop with a call to System.exit()?

Rich


On Wed, Feb 3, 2010 at 11:41 AM, Steele, Richard <[hidden email]> wrote:
I have a Windows XP slave machine (running in a VM) that periodically dies with the cryptic message "Ping failed. Terminating" in its log file.  This then seems to tickle http://issues.hudson-ci.org/browse/HUDSON-5413, and I end up having to restart the master to recover.

I can't quite figure out why only this slave is affected, but it's pretty consistent, at least twice a day.

Anyone else see this?  Anyone know what it means?  I see it's coming from Launcher.java when the PingThread, but why?

It would be helpful if the message had a little more information in it, like a timestamp, because then maybe I could correlate to a specific job.

Thanks,
Rich