Quantcast

Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

GLundh
This post was updated on .
We are pretty sure we have tracked down one of the reasons for channels being terminated in the middle of builds.

As seen in a few JIRA issues:

hudson.util.IOException2: Failed to join the process
        at hudson.Proc$RemoteProc.join(Proc.java:312)
        at hudson.Launcher$ProcStarter.join(Launcher.java:280)
        at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83)
[...]
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
        at hudson.remoting.Channel.send(Channel.java:412)
        at hudson.remoting.Request.call(Request.java:105)
        at hudson.remoting.Channel.call(Channel.java:555)
[...]

Our build system have 100+ slaves. Every now and then we lose important builds due to the connections being lost to seemingly random build machines.

We have tried various keep alive options to ensure that the connection to the build machine is always kept open. With no real success. For us, this is a critical flaw.

But what we have seen lately, is that a random number of slaves lose their connection whenever the GC starts/stops working. This is a reproducible error.

The reason seems to be that the GC takes 90+ seconds to complete during heavy system load. During this time the GC freezes the Hudson process and cannot receive the ping being sent through the PingThread executed in Launcher.java or Engine.java. When the GC is done, the ping (if executed prior to the GC starts working) directly times out and the channel is closed unconditionally. I guess this is the reason Hudson forcibly terminates the channel. If not, the closed channel is still directly related to the GC taking 60sec+ to complete its work.

Probably the same issue that can be seen in this JIRA:
http://issues.hudson-ci.org/browse/HUDSON-6566

Can we please get a way to optionally disable the ping, both from master side and slave side? At least for verification purposes.

Br
Gustaf Lundh - Sony Ericsson
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

John Pederzolli
Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately. Although having a way to disable the ping &/or having some type of fall-back mechanism would be ideal, have you tried enabling concurrent GC?

- John




On 7/7/10 4:20 AM, "Lundh, Gustaf" <Gustaf.Lundh@...> wrote:

We are pretty sure we have tracked down one of the reasons for channels being terminated in the middle of builds.
 
As seen in a few JIRA issues:
 
hudson.util.IOException2: Failed to join the process
        at hudson.Proc$RemoteProc.join(Proc.java:312)
        at hudson.Launcher$ProcStarter.join(Launcher.java:280)
        at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83)
[...]
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
        at hudson.remoting.Channel.send(Channel.java:412)
        at hudson.remoting.Request.call(Request.java:105)
        at hudson.remoting.Channel.call(Channel.java:555)
[...]
 
Our build system have 100+ slaves. Every now and then we lose important builds due to the connections being lost to seemingly random build machines.
 
We have tried various keep alive options to ensure that the connection to the build machine is always kept open. With no real success. For us, this is a critical flaw.
 
But what we have seen lately, is that a random number of slaves lose their connection whenever the GC starts/stops working. This is a reproducible error.
 
The reason seems to be that the GC takes 90+ seconds to complete during heavy system load. During this time the GC freezes the Hudson process and cannot receive the ping being sent through the PingThread executed in Launcher.java or Engine.java. When the GC is done, the ping (if executed prior to the GC starts working) directly times out and the channel is closed unconditionally. I guess this is the reason Hudson forcibly terminates the channel. If not, the closed channel is still directly related to the GC taking 60sec+ to complete its work.
 
Probably the same issue that can be seen in this JIRA:
http://issues.hudson-ci.org/browse/HUDSON-6566
 
Can we please get a way to optionally disable the ping, both from master side and slave side? At least for verification purposes.
 
Br
Gustaf Lundh - Sony Ericsson

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Tom Huybrechts
In reply to this post by GLundh
I've made a few patches in my Hudson instance to deal with this
(increase ping timeout, reduce JUnit history).
Try building it from
http://github.com/huybrechts/hudson/tree/production (it is based on
source of May 28).

The cobertura plugin is also really good at eating all your memory.
Either disable it, or patch it to disable history graphs.

Have a look at your build results too: e.g. do you really need those
50.000 low priority FindBugs warnings that nobody ever looks at ?

Tom

On Wed, Jul 7, 2010 at 1:20 PM, Lundh, Gustaf
<[hidden email]> wrote:

> We are pretty sure we have tracked down one of the reasons for channels
> being terminated in the middle of builds.
>
>
>
> As seen in a few JIRA issues:
>
>
>
> hudson.util.IOException2: Failed to join the process
>         at hudson.Proc$RemoteProc.join(Proc.java:312)
>         at hudson.Launcher$ProcStarter.join(Launcher.java:280)
>         at
> hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83)
>
> [...]
>
> Caused by: hudson.remoting.ChannelClosedException: channel is already closed
>         at hudson.remoting.Channel.send(Channel.java:412)
>         at hudson.remoting.Request.call(Request.java:105)
>         at hudson.remoting.Channel.call(Channel.java:555)
>
> [...]
>
>
>
> Our build system have 100+ slaves. Every now and then we lose important
> builds due to the connections being lost to seemingly random build machines.
>
>
>
> We have tried various keep alive options to ensure that the connection to
> the build machine is always kept open. With no real success. For us, this is
> a critical flaw.
>
>
>
> But what we have seen lately, is that a random number of slaves lose their
> connection whenever the GC starts/stops working. This is a reproducible
> error.
>
>
>
> The reason seems to be that the GC takes 90+ seconds to complete during
> heavy system load. During this time the GC freezes the Hudson process and
> cannot receive the ping being sent through the PingThread executed in
> Launcher.java or Engine.java. When the GC is done, the ping (if executed
> prior to the GC starts working) directly times out and the channel is closed
> unconditionally. I guess this is the reason Hudson forcibly terminates the
> channel. If not, the closed channel is still directly related to the GC
> taking 60sec+ to complete its work.
>
>
>
> Probably the same issue that can be seen in this JIRA:
>
> http://issues.hudson-ci.org/browse/HUDSON-6566
>
>
>
> Can we please get a way to optionally disable the ping, both from master
> side and slave side? At least for verification purposes.
>
>
>
> Br
>
> Gustaf Lundh - Sony Ericsson

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

GLundh
In reply to this post by John Pederzolli
John Pederzolli wrote
Although having a way to disable the ping &/or having some type of fall-back mechanism would be ideal, have you tried enabling concurrent GC?
Hi, John!

No, currently we only set Max/Min Heap to decent values and run the VM with -XX:+UseParallelGC.

As I understand the Sun JavaVM  documentation, the Concurrent GC still will pause the application threads during the major collections. And from the VM FAQs I have also gotten the impression that the Concurrent GC also brings a quite hard performance penalty.

However, I will of course investigate your suggestion further and hopefully be able to give it a try for evaluation purposes in our environment.

Thanks for the suggestion!

Further on, I would also like to take this post as an opportunity, to add my question from from the JIRA issue to the mail thread:

Is there a possibility that the channel can get so _congested_ from one (or many) ongoing heavy tasks (file copy/searching/parsing/server load/etc), that the PingThread cannot complete its request on the same channel - and therefor actually terminate an healthy connection?

Best Regards
Gustaf Lundh - Sony Ericsson
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Dean Yu
On 7/7/10 1:48 PM, "GLundh" <[hidden email]> wrote:
> Further on, I would also like to take this post as an opportunity, to add my
> question from from the JIRA issue to the mail thread:
>
> Is there a possibility that the channel can get so _congested_ from one (or
> many) ongoing heavy tasks (file copy/searching/parsing/server load/etc),
> that the PingThread cannot complete its request on the same channel - and
> therefor actually terminate an healthy connection?

In my experience, this is very possible. I diagnosed as much as part of
this:

http://issues.hudson-ci.org/browse/HUDSON-5413

  -- Dean

>
> Best Regards
> Gustaf Lundh - Sony Ericsson


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

GLundh
In reply to this post by Tom Huybrechts
Tom Huybrechts wrote
I've made a few patches in my Hudson instance to deal with this
(increase ping timeout, reduce JUnit history).
Try building it from
http://github.com/huybrechts/hudson/tree/production (it is based on
source of May 28).

The cobertura plugin is also really good at eating all your memory.
Either disable it, or patch it to disable history graphs.

Have a look at your build results too: e.g. do you really need those
50.000 low priority FindBugs warnings that nobody ever looks at ?

Tom
Thank you!

I will download the code and give it a run. How delightful that you actually had a commit with the following message:

  "increase slave connection timeout to 4 minutes, to account for long full GCs on server"

Still, let us continue the discussion. I though feel we need a more permanent fix in the official trunk.

Any other core developers that want to comment on the issue? Since this ping issue brings more problems than it solves, how about disabling ping by default and use the old --ping option to enable it...?  (This counts for both launcher.java and engine.java)

Or at least let us increase the timeout to 4-5min. I'd be happy to submit a patch if someone can OK this.

Br
Gustaf Lundh - Sony Ericsson
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

GLundh
In reply to this post by Dean Yu
Hi, Dean!

Thanks for the interesting link.

It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working.

But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread.

Suggestion:
A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch.
 
Br
Gustaf Lundh - Sony Ericsson


-----Original Message-----
From: Dean Yu [mailto:[hidden email]]
Sent: den 7 juli 2010 22:55
To: [hidden email]; Lundh, Gustaf
Subject: Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

On 7/7/10 1:48 PM, "GLundh" <[hidden email]> wrote:
> Further on, I would also like to take this post as an opportunity, to add my
> question from from the JIRA issue to the mail thread:
>
> Is there a possibility that the channel can get so _congested_ from one (or
> many) ongoing heavy tasks (file copy/searching/parsing/server load/etc),
> that the PingThread cannot complete its request on the same channel - and
> therefor actually terminate an healthy connection?

In my experience, this is very possible. I diagnosed as much as part of
this:

http://issues.hudson-ci.org/browse/HUDSON-5413

  -- Dean

>
> Best Regards
> Gustaf Lundh - Sony Ericsson


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

GLundh
For us, this issue is very critical.

I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default:

Launcher.java:
86      // no-op, but left for backward compatibility
87   @Option(name="-ping")
88   public boolean ping = true; // <- Switch to false

Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds.

This _really_ needs to be fixed asap.

Is everyone OK with me reverting this behavior? Kohsuke?

Br
Gustaf Lundh - Team Hudson @ Sony Ericsson


-----Original Message-----
From: Lundh, Gustaf [mailto:[hidden email]]
Sent: den 8 juli 2010 09:58
To: Dean Yu; [hidden email]
Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Hi, Dean!

Thanks for the interesting link.

It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working.


But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread.

Suggestion:
A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch.
 
Br
Gustaf Lundh - Sony Ericsson

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Amarkanth Ranganamayna (amrangan)
Hello Gustaf,

"IOException2/ChannelClosedException" issue has been bothering us for a
couple of months now. I'm glad you found a potential fix. I'll look
forward to this patch.


Thanks,
Amar

-----Original Message-----
From: Lundh, Gustaf [mailto:[hidden email]]
Sent: Friday, July 09, 2010 5:21 AM
To: [hidden email]
Cc: Kohsuke Kawaguchi
Subject: RE: Regarding the IOException2/ChannelClosedException many of
us has been seeing lately.

For us, this issue is very critical.

I would like to revert the Launcher.java's "boolean ping = true" to
"ping = false". So that the -ping is no longer no-op and ping is off by
default:

Launcher.java:
86      // no-op, but left for backward compatibility
87   @Option(name="-ping")
88   public boolean ping = true; // <- Switch to false

Since Hudson generates high amounts of soft/weak references, the GC has
to work really hard. And any Hudson installation that goes up to
+2GB-4GB will see the same long GC execution times and issues as many of
us have seen. E.g. lost slaves and builds.

This _really_ needs to be fixed asap.

Is everyone OK with me reverting this behavior? Kohsuke?

Br
Gustaf Lundh - Team Hudson @ Sony Ericsson


-----Original Message-----
From: Lundh, Gustaf [mailto:[hidden email]]
Sent: den 8 juli 2010 09:58
To: Dean Yu; [hidden email]
Subject: RE: Regarding the IOException2/ChannelClosedException many of
us has been seeing lately.

Hi, Dean!

Thanks for the interesting link.

It seems quite necessary to rework parts of the channel communication
code, since it appears that the same base-issue affects various parts in
Hudson. E.g. channels gets easily congested and renders critical parts
of the system bug ridden or non-working.



But until a more major and long term change can be worked out, I would
like to solve our (and apparently other's) issue first. E.g. allowing
the optional use of the PingThread.

Suggestion:
A "--noping" option for Hudson. Does that seems reasonable? In that
case, I'll be happy to commit a patch.
 
Br
Gustaf Lundh - Sony Ericsson

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Andrew Bayer
In reply to this post by GLundh
+1 - the ping is a nice feature but not worth the pain of these issues.

A.

On Fri, Jul 9, 2010 at 5:21 AM, Lundh, Gustaf <[hidden email]> wrote:
For us, this issue is very critical.

I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default:

Launcher.java:
86          // no-op, but left for backward compatibility
87         @Option(name="-ping")
88         public boolean ping = true; // <- Switch to false

Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds.

This _really_ needs to be fixed asap.

Is everyone OK with me reverting this behavior? Kohsuke?

Br
Gustaf Lundh - Team Hudson @ Sony Ericsson


-----Original Message-----
From: Lundh, Gustaf [mailto:[hidden email]]
Sent: den 8 juli 2010 09:58
To: Dean Yu; [hidden email]
Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Hi, Dean!

Thanks for the interesting link.

It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working.


But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread.

Suggestion:
A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch.

Br
Gustaf Lundh - Sony Ericsson

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Mike Dillon-5
In reply to this post by GLundh
I rather see an approach where the ping is retained, but the timeout
exception can be treated as non-fatal. In the case that a ping timeout
is not treated as fatal, the Launcher should keep a running count of the
number of failed pings and the start time for the outage. In an ideal
world, these numbers would then be available via JMX for monitoring by a
third system that watched the health of both the master and all slaves.

-md

begin Lundh, Gustaf quotation:

> For us, this issue is very critical.
>
> I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default:
>
> Launcher.java:
> 86      // no-op, but left for backward compatibility
> 87   @Option(name="-ping")
> 88   public boolean ping = true; // <- Switch to false
>
> Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds.
>
> This _really_ needs to be fixed asap.
>
> Is everyone OK with me reverting this behavior? Kohsuke?
>
> Br
> Gustaf Lundh - Team Hudson @ Sony Ericsson
>
>
> -----Original Message-----
> From: Lundh, Gustaf [mailto:[hidden email]]
> Sent: den 8 juli 2010 09:58
> To: Dean Yu; [hidden email]
> Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.
>
> Hi, Dean!
>
> Thanks for the interesting link.
>
> It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working.
>
>
> But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread.
>
> Suggestion:
> A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch.
>  
> Br
> Gustaf Lundh - Sony Ericsson
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

Alan Harder-2
In reply to this post by GLundh
I'm committing this change for next week's release.. hope this will be
helpful:

1) change defaults from ping every 5 min with 1 min timeout,
   to ping every 10 min with 4 min timeout.
2) add hudson.remoting.Launcher.pingTimeoutSec
   and hudson.remoting.Launcher.pingIntervalSec system properties to
override
   defaults.  Setting either to <= 0 disables the ping thread.

    - Alan


Lundh, Gustaf wrote:

> For us, this issue is very critical.
>
> I would like to revert the Launcher.java's "boolean ping = true" to "ping = false". So that the -ping is no longer no-op and ping is off by default:
>
> Launcher.java:
> 86      // no-op, but left for backward compatibility
> 87   @Option(name="-ping")
> 88   public boolean ping = true; // <- Switch to false
>
> Since Hudson generates high amounts of soft/weak references, the GC has to work really hard. And any Hudson installation that goes up to +2GB-4GB will see the same long GC execution times and issues as many of us have seen. E.g. lost slaves and builds.
>
> This _really_ needs to be fixed asap.
>
> Is everyone OK with me reverting this behavior? Kohsuke?
>
> Br
> Gustaf Lundh - Team Hudson @ Sony Ericsson
>
>
> -----Original Message-----
> From: Lundh, Gustaf [mailto:[hidden email]]
> Sent: den 8 juli 2010 09:58
> To: Dean Yu; [hidden email]
> Subject: RE: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.
>
> Hi, Dean!
>
> Thanks for the interesting link.
>
> It seems quite necessary to rework parts of the channel communication code, since it appears that the same base-issue affects various parts in Hudson. E.g. channels gets easily congested and renders critical parts of the system bug ridden or non-working.
>
>
> But until a more major and long term change can be worked out, I would like to solve our (and apparently other's) issue first. E.g. allowing the optional use of the PingThread.
>
> Suggestion:
> A "--noping" option for Hudson. Does that seems reasonable? In that case, I'll be happy to commit a patch.
>  
> Br
> Gustaf Lundh - Sony Ericsson
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regarding the IOException2/ChannelClosedException many of us has been seeing lately.

GLundh
Fantastic!

Thanks for the quick fix! Let us know where we should send the bottle of wine. :)

Br
Gustaf Lundh @ Sony Ericsson


Alan Harder-2 wrote
I'm committing this change for next week's release.. hope this will be
helpful:

1) change defaults from ping every 5 min with 1 min timeout,
   to ping every 10 min with 4 min timeout.
2) add hudson.remoting.Launcher.pingTimeoutSec
   and hudson.remoting.Launcher.pingIntervalSec system properties to
override
   defaults.  Setting either to <= 0 disables the ping thread.

    - Alan
Loading...