slow slave performance and yourkit license

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

slow slave performance and yourkit license

John McNair
I have recently been experiencing severe performance problems with slaves.  Slaves used to run almost identically to the master node, but now take 3-6 times as long.  All the build machines are very similarly spec'ed VMs running on very similar h/w.

I ran a single node for a long time and only added a slave when 303 was current.  Performance was almost identical on slaves and masters.  On the afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20 minute builds were taking 1 hour.  I had upgraded IcedTea on the master node in the morning, so I downgraded.  This did not impact anything.  I investigated system level issues, especially the ethernet driver.  The end of that story is that I can scp large files consistently from master to slave at a steady 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also, the builds that take one hour on the slaves via hudson still only take 15-20 minutes when run manually on the slaves.

I have been able to narrow the issue down (I think).  I noticed that archiving artifacts from slave to master takes an extraordinary amount of time.  jnettop puts the slave->master network usage at consistently less than 60 kB/s, even during artifact archival, which is roughly 2% of the available bandwidth given this combination of switch, drivers, ssh, etc.

I upgraded to the current Hudson build at the time (may have been 309) as part of troubleshooting, but that made no difference.  310 and 311 show the same performance.  I have made some guesses inside FilePath, but have not been able to influence performance with my custom builds either.

So I have a few developer questions:
1. Is anyone else seeing this with large artifact builds on a sufficient network connection (i.e., LAN)?
2. Is there any buffering/flushing logic that might be sensitive to environmental concerns?  The symptoms resemble a buffering problem, but I can't prove that.
3. Any recent code changes that are likely linked to this?
4. Any suggestions to resolve this?

Answers to the above would be appreciated, but in addition, is it still possible to get a YourKit license?  I am making random guesses, building, deploying, starting a build, etc.  That's not nearly as effective as measuring performance through a profiler.

--
John McNair
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

Kohsuke Kawaguchi
Administrator
John McNair wrote:

> I have recently been experiencing severe performance problems with slaves.
> Slaves used to run almost identically to the master node, but now take 3-6
> times as long.  All the build machines are very similarly spec'ed VMs
> running on very similar h/w.
>
> I ran a single node for a long time and only added a slave when 303 was
> current.  Performance was almost identical on slaves and masters.  On the
> afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20 minute
> builds were taking 1 hour.  I had upgraded IcedTea on the master node in the
> morning, so I downgraded.  This did not impact anything.  I investigated
> system level issues, especially the ethernet driver.  The end of that story
> is that I can scp large files consistently from master to slave at a steady
> 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also, the
> builds that take one hour on the slaves via hudson still only take 15-20
> minutes when run manually on the slaves.
>
> I have been able to narrow the issue down (I think).  I noticed that
> archiving artifacts from slave to master takes an extraordinary amount of
> time.  jnettop puts the slave->master network usage at consistently less
> than 60 kB/s, even during artifact archival, which is roughly 2% of the
> available bandwidth given this combination of switch, drivers, ssh, etc.
I believe I fixed this in 1.311. In my deployment this made a
substantial improvement.

> I upgraded to the current Hudson build at the time (may have been 309) as
> part of troubleshooting, but that made no difference.  310 and 311 show the
> same performance.  I have made some guesses inside FilePath, but have not
> been able to influence performance with my custom builds either.
>
> So I have a few developer questions:
> 1. Is anyone else seeing this with large artifact builds on a sufficient
> network connection (i.e., LAN)?
> 2. Is there any buffering/flushing logic that might be sensitive to
> environmental concerns?  The symptoms resemble a buffering problem, but I
> can't prove that.
This is often the cause, and indeed the bug in 1.311 was caused by
excessive flushing by TarOutputStream.

> 3. Any recent code changes that are likely linked to this?
> 4. Any suggestions to resolve this?
>
> Answers to the above would be appreciated, but in addition, is it still
> possible to get a YourKit license?  I am making random guesses, building,
> deploying, starting a build, etc.  That's not nearly as effective as
> measuring performance through a profiler.

If you are a Hudson committer, I can certainly request additional
licenses. If not, I'll make you a committer, so let me know your
java.net ID.

--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/

smime.p7s (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

John McNair


On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <[hidden email]> wrote:
John McNair wrote:
> I have recently been experiencing severe performance problems with slaves.
> Slaves used to run almost identically to the master node, but now take 3-6
> times as long.  All the build machines are very similarly spec'ed VMs
> running on very similar h/w.
>
> I ran a single node for a long time and only added a slave when 303 was
> current.  Performance was almost identical on slaves and masters.  On the
> afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20 minute
> builds were taking 1 hour.  I had upgraded IcedTea on the master node in the
> morning, so I downgraded.  This did not impact anything.  I investigated
> system level issues, especially the ethernet driver.  The end of that story
> is that I can scp large files consistently from master to slave at a steady
> 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also, the
> builds that take one hour on the slaves via hudson still only take 15-20
> minutes when run manually on the slaves.
>
> I have been able to narrow the issue down (I think).  I noticed that
> archiving artifacts from slave to master takes an extraordinary amount of
> time.  jnettop puts the slave->master network usage at consistently less
> than 60 kB/s, even during artifact archival, which is roughly 2% of the
> available bandwidth given this combination of switch, drivers, ssh, etc.

I believe I fixed this in 1.311. In my deployment this made a
substantial improvement.

> I upgraded to the current Hudson build at the time (may have been 309) as
> part of troubleshooting, but that made no difference.  310 and 311 show the
> same performance.  I have made some guesses inside FilePath, but have not
> been able to influence performance with my custom builds either.
>
> So I have a few developer questions:
> 1. Is anyone else seeing this with large artifact builds on a sufficient
> network connection (i.e., LAN)?
> 2. Is there any buffering/flushing logic that might be sensitive to
> environmental concerns?  The symptoms resemble a buffering problem, but I
> can't prove that.

This is often the cause, and indeed the bug in 1.311 was caused by
excessive flushing by TarOutputStream.

> 3. Any recent code changes that are likely linked to this?
> 4. Any suggestions to resolve this?
>
> Answers to the above would be appreciated, but in addition, is it still
> possible to get a YourKit license?  I am making random guesses, building,
> deploying, starting a build, etc.  That's not nearly as effective as
> measuring performance through a profiler.

If you are a Hudson committer, I can certainly request additional
licenses. If not, I'll make you a committer, so let me know your
java.net ID.

--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/

311 is the same for me.  Your fix looked like the right thing to do.  I just must be hitting something else.

I think I can put together a trivial test pom to expose this.  Just depend on hudson.war and make the pom zip dependencies.

My java.net id is pamdirac.

--
John McNair
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

Kohsuke Kawaguchi
Administrator

A test case would be very helpful. In the mean time, I'm adding you as a
Hudson committer.

John McNair wrote:

> On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <
> [hidden email]> wrote:
>
>> John McNair wrote:
>> > I have recently been experiencing severe performance problems with
>> slaves.
>> > Slaves used to run almost identically to the master node, but now take
>> 3-6
>> > times as long.  All the build machines are very similarly spec'ed VMs
>> > running on very similar h/w.
>> >
>> > I ran a single node for a long time and only added a slave when 303 was
>> > current.  Performance was almost identical on slaves and masters.  On the
>> > afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20
>> minute
>> > builds were taking 1 hour.  I had upgraded IcedTea on the master node in
>> the
>> > morning, so I downgraded.  This did not impact anything.  I investigated
>> > system level issues, especially the ethernet driver.  The end of that
>> story
>> > is that I can scp large files consistently from master to slave at a
>> steady
>> > 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also,
>> the
>> > builds that take one hour on the slaves via hudson still only take 15-20
>> > minutes when run manually on the slaves.
>> >
>> > I have been able to narrow the issue down (I think).  I noticed that
>> > archiving artifacts from slave to master takes an extraordinary amount of
>> > time.  jnettop puts the slave->master network usage at consistently less
>> > than 60 kB/s, even during artifact archival, which is roughly 2% of the
>> > available bandwidth given this combination of switch, drivers, ssh, etc.
>>
>> I believe I fixed this in 1.311. In my deployment this made a
>> substantial improvement.
>>
>> > I upgraded to the current Hudson build at the time (may have been 309) as
>> > part of troubleshooting, but that made no difference.  310 and 311 show
>> the
>> > same performance.  I have made some guesses inside FilePath, but have not
>> > been able to influence performance with my custom builds either.
>> >
>> > So I have a few developer questions:
>> > 1. Is anyone else seeing this with large artifact builds on a sufficient
>> > network connection (i.e., LAN)?
>> > 2. Is there any buffering/flushing logic that might be sensitive to
>> > environmental concerns?  The symptoms resemble a buffering problem, but I
>> > can't prove that.
>>
>> This is often the cause, and indeed the bug in 1.311 was caused by
>> excessive flushing by TarOutputStream.
>>
>> > 3. Any recent code changes that are likely linked to this?
>> > 4. Any suggestions to resolve this?
>> >
>> > Answers to the above would be appreciated, but in addition, is it still
>> > possible to get a YourKit license?  I am making random guesses, building,
>> > deploying, starting a build, etc.  That's not nearly as effective as
>> > measuring performance through a profiler.
>>
>> If you are a Hudson committer, I can certainly request additional
>> licenses. If not, I'll make you a committer, so let me know your
>> java.net ID.
>>
>> --
>> Kohsuke Kawaguchi
>> Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/
>>
>
> 311 is the same for me.  Your fix looked like the right thing to do.  I just
> must be hitting something else.
>
> I think I can put together a trivial test pom to expose this.  Just depend
> on hudson.war and make the pom zip dependencies.
>
> My java.net id is pamdirac.
>

--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/

smime.p7s (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

John McNair
Already a committer.  But thanks for your vigilance!

On Tue, Jun 23, 2009 at 4:56 PM, Kohsuke Kawaguchi <[hidden email]> wrote:

A test case would be very helpful. In the mean time, I'm adding you as a
Hudson committer.

John McNair wrote:
> On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <
> [hidden email]> wrote:
>
>> John McNair wrote:
>> > I have recently been experiencing severe performance problems with
>> slaves.
>> > Slaves used to run almost identically to the master node, but now take
>> 3-6
>> > times as long.  All the build machines are very similarly spec'ed VMs
>> > running on very similar h/w.
>> >
>> > I ran a single node for a long time and only added a slave when 303 was
>> > current.  Performance was almost identical on slaves and masters.  On the
>> > afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20
>> minute
>> > builds were taking 1 hour.  I had upgraded IcedTea on the master node in
>> the
>> > morning, so I downgraded.  This did not impact anything.  I investigated
>> > system level issues, especially the ethernet driver.  The end of that
>> story
>> > is that I can scp large files consistently from master to slave at a
>> steady
>> > 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also,
>> the
>> > builds that take one hour on the slaves via hudson still only take 15-20
>> > minutes when run manually on the slaves.
>> >
>> > I have been able to narrow the issue down (I think).  I noticed that
>> > archiving artifacts from slave to master takes an extraordinary amount of
>> > time.  jnettop puts the slave->master network usage at consistently less
>> > than 60 kB/s, even during artifact archival, which is roughly 2% of the
>> > available bandwidth given this combination of switch, drivers, ssh, etc.
>>
>> I believe I fixed this in 1.311. In my deployment this made a
>> substantial improvement.
>>
>> > I upgraded to the current Hudson build at the time (may have been 309) as
>> > part of troubleshooting, but that made no difference.  310 and 311 show
>> the
>> > same performance.  I have made some guesses inside FilePath, but have not
>> > been able to influence performance with my custom builds either.
>> >
>> > So I have a few developer questions:
>> > 1. Is anyone else seeing this with large artifact builds on a sufficient
>> > network connection (i.e., LAN)?
>> > 2. Is there any buffering/flushing logic that might be sensitive to
>> > environmental concerns?  The symptoms resemble a buffering problem, but I
>> > can't prove that.
>>
>> This is often the cause, and indeed the bug in 1.311 was caused by
>> excessive flushing by TarOutputStream.
>>
>> > 3. Any recent code changes that are likely linked to this?
>> > 4. Any suggestions to resolve this?
>> >
>> > Answers to the above would be appreciated, but in addition, is it still
>> > possible to get a YourKit license?  I am making random guesses, building,
>> > deploying, starting a build, etc.  That's not nearly as effective as
>> > measuring performance through a profiler.
>>
>> If you are a Hudson committer, I can certainly request additional
>> licenses. If not, I'll make you a committer, so let me know your
>> java.net ID.
>>
>> --
>> Kohsuke Kawaguchi
>> Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/
>>
>
> 311 is the same for me.  Your fix looked like the right thing to do.  I just
> must be hitting something else.
>
> I think I can put together a trivial test pom to expose this.  Just depend
> on hudson.war and make the pom zip dependencies.
>
> My java.net id is pamdirac.
>


--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/



--
John McNair
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

John McNair
https://hudson.dev.java.net/issues/show_bug.cgi?id=3922

On Tue, Jun 23, 2009 at 6:04 PM, John McNair <[hidden email]> wrote:
Already a committer.  But thanks for your vigilance!


On Tue, Jun 23, 2009 at 4:56 PM, Kohsuke Kawaguchi <[hidden email]> wrote:

A test case would be very helpful. In the mean time, I'm adding you as a
Hudson committer.

John McNair wrote:
> On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <
> [hidden email]> wrote:
>
>> John McNair wrote:
>> > I have recently been experiencing severe performance problems with
>> slaves.
>> > Slaves used to run almost identically to the master node, but now take
>> 3-6
>> > times as long.  All the build machines are very similarly spec'ed VMs
>> > running on very similar h/w.
>> >
>> > I ran a single node for a long time and only added a slave when 303 was
>> > current.  Performance was almost identical on slaves and masters.  On the
>> > afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20
>> minute
>> > builds were taking 1 hour.  I had upgraded IcedTea on the master node in
>> the
>> > morning, so I downgraded.  This did not impact anything.  I investigated
>> > system level issues, especially the ethernet driver.  The end of that
>> story
>> > is that I can scp large files consistently from master to slave at a
>> steady
>> > 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also,
>> the
>> > builds that take one hour on the slaves via hudson still only take 15-20
>> > minutes when run manually on the slaves.
>> >
>> > I have been able to narrow the issue down (I think).  I noticed that
>> > archiving artifacts from slave to master takes an extraordinary amount of
>> > time.  jnettop puts the slave->master network usage at consistently less
>> > than 60 kB/s, even during artifact archival, which is roughly 2% of the
>> > available bandwidth given this combination of switch, drivers, ssh, etc.
>>
>> I believe I fixed this in 1.311. In my deployment this made a
>> substantial improvement.
>>
>> > I upgraded to the current Hudson build at the time (may have been 309) as
>> > part of troubleshooting, but that made no difference.  310 and 311 show
>> the
>> > same performance.  I have made some guesses inside FilePath, but have not
>> > been able to influence performance with my custom builds either.
>> >
>> > So I have a few developer questions:
>> > 1. Is anyone else seeing this with large artifact builds on a sufficient
>> > network connection (i.e., LAN)?
>> > 2. Is there any buffering/flushing logic that might be sensitive to
>> > environmental concerns?  The symptoms resemble a buffering problem, but I
>> > can't prove that.
>>
>> This is often the cause, and indeed the bug in 1.311 was caused by
>> excessive flushing by TarOutputStream.
>>
>> > 3. Any recent code changes that are likely linked to this?
>> > 4. Any suggestions to resolve this?
>> >
>> > Answers to the above would be appreciated, but in addition, is it still
>> > possible to get a YourKit license?  I am making random guesses, building,
>> > deploying, starting a build, etc.  That's not nearly as effective as
>> > measuring performance through a profiler.
>>
>> If you are a Hudson committer, I can certainly request additional
>> licenses. If not, I'll make you a committer, so let me know your
>> java.net ID.
>>
>> --
>> Kohsuke Kawaguchi
>> Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/
>>
>
> 311 is the same for me.  Your fix looked like the right thing to do.  I just
> must be hitting something else.
>
> I think I can put together a trivial test pom to expose this.  Just depend
> on hudson.war and make the pom zip dependencies.
>
> My java.net id is pamdirac.
>


--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/



--
John McNair
[hidden email]



--
John McNair
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

John McNair
I have some follow up input.  YourKit might still be useful. :)  In the meantime, I've done a little println profiling.  Slave->master file copies are roughly 50x more expensive than straight scp on two different 100 Mbit switches that I have tested.  The culprit seems to be writeObject().

FilePath uses RemoteOutputStream uses ProxyOutputStream uses Channel.  ProxyOutputStream is sending a Chunk object per buffer flush via Channel.send().  writeObject() charges a relatively high constant penalty per invocation.  This is not a big deal for larger, coarser transfers, but it is a high price to pay every 8k.

I switched Chunk to Externalizable with no benefit.  I wrapped a BufferedOutputStream around the RemoteOutputStream inside FilePath.copyTo(OutputStream) and set the buffer size to 1MB.  This helped but not enough.  And it's a bad solution anyway.  It was just a little experiment to see how the performance could be influenced.

I think the real solution would involve more of a true streaming mechanism to be used for cases such as file transfers as a opposed to using the command pattern.  This is at a fairly fundamental spot inside Hudson's core, so I would like to solicit some input from other people who probably care.  Has anyone else noticed this and spent any thought on a suitable solution?

On Thu, Jun 25, 2009 at 12:18 AM, John McNair <[hidden email]> wrote:
https://hudson.dev.java.net/issues/show_bug.cgi?id=3922


On Tue, Jun 23, 2009 at 6:04 PM, John McNair <[hidden email]> wrote:
Already a committer.  But thanks for your vigilance!


On Tue, Jun 23, 2009 at 4:56 PM, Kohsuke Kawaguchi <[hidden email]> wrote:

A test case would be very helpful. In the mean time, I'm adding you as a
Hudson committer.

John McNair wrote:
> On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <
> [hidden email]> wrote:
>
>> John McNair wrote:
>> > I have recently been experiencing severe performance problems with
>> slaves.
>> > Slaves used to run almost identically to the master node, but now take
>> 3-6
>> > times as long.  All the build machines are very similarly spec'ed VMs
>> > running on very similar h/w.
>> >
>> > I ran a single node for a long time and only added a slave when 303 was
>> > current.  Performance was almost identical on slaves and masters.  On the
>> > afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20
>> minute
>> > builds were taking 1 hour.  I had upgraded IcedTea on the master node in
>> the
>> > morning, so I downgraded.  This did not impact anything.  I investigated
>> > system level issues, especially the ethernet driver.  The end of that
>> story
>> > is that I can scp large files consistently from master to slave at a
>> steady
>> > 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also,
>> the
>> > builds that take one hour on the slaves via hudson still only take 15-20
>> > minutes when run manually on the slaves.
>> >
>> > I have been able to narrow the issue down (I think).  I noticed that
>> > archiving artifacts from slave to master takes an extraordinary amount of
>> > time.  jnettop puts the slave->master network usage at consistently less
>> > than 60 kB/s, even during artifact archival, which is roughly 2% of the
>> > available bandwidth given this combination of switch, drivers, ssh, etc.
>>
>> I believe I fixed this in 1.311. In my deployment this made a
>> substantial improvement.
>>
>> > I upgraded to the current Hudson build at the time (may have been 309) as
>> > part of troubleshooting, but that made no difference.  310 and 311 show
>> the
>> > same performance.  I have made some guesses inside FilePath, but have not
>> > been able to influence performance with my custom builds either.
>> >
>> > So I have a few developer questions:
>> > 1. Is anyone else seeing this with large artifact builds on a sufficient
>> > network connection (i.e., LAN)?
>> > 2. Is there any buffering/flushing logic that might be sensitive to
>> > environmental concerns?  The symptoms resemble a buffering problem, but I
>> > can't prove that.
>>
>> This is often the cause, and indeed the bug in 1.311 was caused by
>> excessive flushing by TarOutputStream.
>>
>> > 3. Any recent code changes that are likely linked to this?
>> > 4. Any suggestions to resolve this?
>> >
>> > Answers to the above would be appreciated, but in addition, is it still
>> > possible to get a YourKit license?  I am making random guesses, building,
>> > deploying, starting a build, etc.  That's not nearly as effective as
>> > measuring performance through a profiler.
>>
>> If you are a Hudson committer, I can certainly request additional
>> licenses. If not, I'll make you a committer, so let me know your
>> java.net ID.
>>
>> --
>> Kohsuke Kawaguchi
>> Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/
>>
>
> 311 is the same for me.  Your fix looked like the right thing to do.  I just
> must be hitting something else.
>
> I think I can put together a trivial test pom to expose this.  Just depend
> on hudson.war and make the pom zip dependencies.
>
> My java.net id is pamdirac.
>


--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/



--
John McNair
[hidden email]



--
John McNair
[hidden email]



--
John McNair
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

John McNair
I've done some more investigation.  I spiked out a mechanism to bypass Chunk.  I created a Command impl called RemoteFile that does its own file streaming.  I made it Externalizable, and it works roughly like this:

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    # This impl doesn't really exist
    IOUtils.copy(in, fileOutputStream);
}

public void writeExternal(ObjectOutput out) throws IOException {
    #This impl doesn't really exist
    IOUtils.copy(fileInputStream, out);
}

The above setup means that we can stream a file with one called to writeObject(), and we can send the file as a true stream without having to buffer too much in memory.

This did improve performance some, but I still haven't gotten better than 530 kB/s when the pipe has roughly 12 MB/s available.  I have tried 8k and 64k buffers for the file streaming which should be sufficient on a low-latency connection (same switch).

I'm now at a loss for where I'm spending time.  Any more suggestions from others out there?

I will keep investigating as time allows.

On Thu, Jun 25, 2009 at 11:05 PM, John McNair <[hidden email]> wrote:
I have some follow up input.  YourKit might still be useful. :)  In the meantime, I've done a little println profiling.  Slave->master file copies are roughly 50x more expensive than straight scp on two different 100 Mbit switches that I have tested.  The culprit seems to be writeObject().

FilePath uses RemoteOutputStream uses ProxyOutputStream uses Channel.  ProxyOutputStream is sending a Chunk object per buffer flush via Channel.send().  writeObject() charges a relatively high constant penalty per invocation.  This is not a big deal for larger, coarser transfers, but it is a high price to pay every 8k.

I switched Chunk to Externalizable with no benefit.  I wrapped a BufferedOutputStream around the RemoteOutputStream inside FilePath.copyTo(OutputStream) and set the buffer size to 1MB.  This helped but not enough.  And it's a bad solution anyway.  It was just a little experiment to see how the performance could be influenced.

I think the real solution would involve more of a true streaming mechanism to be used for cases such as file transfers as a opposed to using the command pattern.  This is at a fairly fundamental spot inside Hudson's core, so I would like to solicit some input from other people who probably care.  Has anyone else noticed this and spent any thought on a suitable solution?


On Thu, Jun 25, 2009 at 12:18 AM, John McNair <[hidden email]> wrote:
https://hudson.dev.java.net/issues/show_bug.cgi?id=3922


On Tue, Jun 23, 2009 at 6:04 PM, John McNair <[hidden email]> wrote:
Already a committer.  But thanks for your vigilance!


On Tue, Jun 23, 2009 at 4:56 PM, Kohsuke Kawaguchi <[hidden email]> wrote:

A test case would be very helpful. In the mean time, I'm adding you as a
Hudson committer.

John McNair wrote:
> On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <
> [hidden email]> wrote:
>
>> John McNair wrote:
>> > I have recently been experiencing severe performance problems with
>> slaves.
>> > Slaves used to run almost identically to the master node, but now take
>> 3-6
>> > times as long.  All the build machines are very similarly spec'ed VMs
>> > running on very similar h/w.
>> >
>> > I ran a single node for a long time and only added a slave when 303 was
>> > current.  Performance was almost identical on slaves and masters.  On the
>> > afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20
>> minute
>> > builds were taking 1 hour.  I had upgraded IcedTea on the master node in
>> the
>> > morning, so I downgraded.  This did not impact anything.  I investigated
>> > system level issues, especially the ethernet driver.  The end of that
>> story
>> > is that I can scp large files consistently from master to slave at a
>> steady
>> > 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also,
>> the
>> > builds that take one hour on the slaves via hudson still only take 15-20
>> > minutes when run manually on the slaves.
>> >
>> > I have been able to narrow the issue down (I think).  I noticed that
>> > archiving artifacts from slave to master takes an extraordinary amount of
>> > time.  jnettop puts the slave->master network usage at consistently less
>> > than 60 kB/s, even during artifact archival, which is roughly 2% of the
>> > available bandwidth given this combination of switch, drivers, ssh, etc.
>>
>> I believe I fixed this in 1.311. In my deployment this made a
>> substantial improvement.
>>
>> > I upgraded to the current Hudson build at the time (may have been 309) as
>> > part of troubleshooting, but that made no difference.  310 and 311 show
>> the
>> > same performance.  I have made some guesses inside FilePath, but have not
>> > been able to influence performance with my custom builds either.
>> >
>> > So I have a few developer questions:
>> > 1. Is anyone else seeing this with large artifact builds on a sufficient
>> > network connection (i.e., LAN)?
>> > 2. Is there any buffering/flushing logic that might be sensitive to
>> > environmental concerns?  The symptoms resemble a buffering problem, but I
>> > can't prove that.
>>
>> This is often the cause, and indeed the bug in 1.311 was caused by
>> excessive flushing by TarOutputStream.
>>
>> > 3. Any recent code changes that are likely linked to this?
>> > 4. Any suggestions to resolve this?
>> >
>> > Answers to the above would be appreciated, but in addition, is it still
>> > possible to get a YourKit license?  I am making random guesses, building,
>> > deploying, starting a build, etc.  That's not nearly as effective as
>> > measuring performance through a profiler.
>>
>> If you are a Hudson committer, I can certainly request additional
>> licenses. If not, I'll make you a committer, so let me know your
>> java.net ID.
>>
>> --
>> Kohsuke Kawaguchi
>> Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/
>>
>
> 311 is the same for me.  Your fix looked like the right thing to do.  I just
> must be hitting something else.
>
> I think I can put together a trivial test pom to expose this.  Just depend
> on hudson.war and make the pom zip dependencies.
>
> My java.net id is pamdirac.
>


--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/



--
John McNair
[hidden email]



--
John McNair
[hidden email]



--
John McNair
[hidden email]



--
John McNair
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: slow slave performance and yourkit license

Andrew Chandler
Well I can confirm that we have a fairly expensive transfer back to master from slave once our builds are done ( we have 100meg artifacts) - not sure its the same thing but it does seem unreasonable.



On Mon, 2009-06-29 at 12:16 -0400, John McNair wrote:
I've done some more investigation.  I spiked out a mechanism to bypass Chunk.  I created a Command impl called RemoteFile that does its own file streaming.  I made it Externalizable, and it works roughly like this:

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    # This impl doesn't really exist
    IOUtils.copy(in, fileOutputStream);
}

public void writeExternal(ObjectOutput out) throws IOException {
    #This impl doesn't really exist
    IOUtils.copy(fileInputStream, out);
}

The above setup means that we can stream a file with one called to writeObject(), and we can send the file as a true stream without having to buffer too much in memory.

This did improve performance some, but I still haven't gotten better than 530 kB/s when the pipe has roughly 12 MB/s available.  I have tried 8k and 64k buffers for the file streaming which should be sufficient on a low-latency connection (same switch).

I'm now at a loss for where I'm spending time.  Any more suggestions from others out there?

I will keep investigating as time allows.

On Thu, Jun 25, 2009 at 11:05 PM, John McNair <[hidden email]> wrote:
I have some follow up input.  YourKit might still be useful. :)  In the meantime, I've done a little println profiling.  Slave->master file copies are roughly 50x more expensive than straight scp on two different 100 Mbit switches that I have tested.  The culprit seems to be writeObject().

FilePath uses RemoteOutputStream uses ProxyOutputStream uses Channel.  ProxyOutputStream is sending a Chunk object per buffer flush via Channel.send().  writeObject() charges a relatively high constant penalty per invocation.  This is not a big deal for larger, coarser transfers, but it is a high price to pay every 8k.

I switched Chunk to Externalizable with no benefit.  I wrapped a BufferedOutputStream around the RemoteOutputStream inside FilePath.copyTo(OutputStream) and set the buffer size to 1MB.  This helped but not enough.  And it's a bad solution anyway.  It was just a little experiment to see how the performance could be influenced.

I think the real solution would involve more of a true streaming mechanism to be used for cases such as file transfers as a opposed to using the command pattern.  This is at a fairly fundamental spot inside Hudson's core, so I would like to solicit some input from other people who probably care.  Has anyone else noticed this and spent any thought on a suitable solution?



On Thu, Jun 25, 2009 at 12:18 AM, John McNair <[hidden email]> wrote:
https://hudson.dev.java.net/issues/show_bug.cgi?id=3922



On Tue, Jun 23, 2009 at 6:04 PM, John McNair <[hidden email]> wrote:
Already a committer.  But thanks for your vigilance!



On Tue, Jun 23, 2009 at 4:56 PM, Kohsuke Kawaguchi <[hidden email]> wrote:

A test case would be very helpful. In the mean time, I'm adding you as a
Hudson committer.


John McNair wrote:
> On Mon, Jun 22, 2009 at 7:00 PM, Kohsuke Kawaguchi <
> [hidden email]> wrote:
>
>> John McNair wrote:
>> > I have recently been experiencing severe performance problems with
>> slaves.
>> > Slaves used to run almost identically to the master node, but now take
>> 3-6
>> > times as long.  All the build machines are very similarly spec'ed VMs
>> > running on very similar h/w.
>> >
>> > I ran a single node for a long time and only added a slave when 303 was
>> > current.  Performance was almost identical on slaves and masters.  On the
>> > afternoon of 6/11, I noticed a dramatic slowdown on the slaves.  20
>> minute
>> > builds were taking 1 hour.  I had upgraded IcedTea on the master node in
>> the
>> > morning, so I downgraded.  This did not impact anything.  I investigated
>> > system level issues, especially the ethernet driver.  The end of that
>> story
>> > is that I can scp large files consistently from master to slave at a
>> steady
>> > 3.0-3.2 MB/s.  The physical hosts share a single 100 Mb switch.  Also,
>> the
>> > builds that take one hour on the slaves via hudson still only take 15-20
>> > minutes when run manually on the slaves.
>> >
>> > I have been able to narrow the issue down (I think).  I noticed that
>> > archiving artifacts from slave to master takes an extraordinary amount of
>> > time.  jnettop puts the slave->master network usage at consistently less
>> > than 60 kB/s, even during artifact archival, which is roughly 2% of the
>> > available bandwidth given this combination of switch, drivers, ssh, etc.
>>
>> I believe I fixed this in 1.311. In my deployment this made a
>> substantial improvement.
>>
>> > I upgraded to the current Hudson build at the time (may have been 309) as
>> > part of troubleshooting, but that made no difference.  310 and 311 show
>> the
>> > same performance.  I have made some guesses inside FilePath, but have not
>> > been able to influence performance with my custom builds either.
>> >
>> > So I have a few developer questions:
>> > 1. Is anyone else seeing this with large artifact builds on a sufficient
>> > network connection (i.e., LAN)?
>> > 2. Is there any buffering/flushing logic that might be sensitive to
>> > environmental concerns?  The symptoms resemble a buffering problem, but I
>> > can't prove that.
>>
>> This is often the cause, and indeed the bug in 1.311 was caused by
>> excessive flushing by TarOutputStream.
>>
>> > 3. Any recent code changes that are likely linked to this?
>> > 4. Any suggestions to resolve this?
>> >
>> > Answers to the above would be appreciated, but in addition, is it still
>> > possible to get a YourKit license?  I am making random guesses, building,
>> > deploying, starting a build, etc.  That's not nearly as effective as
>> > measuring performance through a profiler.
>>
>> If you are a Hudson committer, I can certainly request additional
>> licenses. If not, I'll make you a committer, so let me know your
>> java.net ID.
>>
>> --
>> Kohsuke Kawaguchi
>> Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/
>>
>
> 311 is the same for me.  Your fix looked like the right thing to do.  I just
> must be hitting something else.
>
> I think I can put together a trivial test pom to expose this.  Just depend
> on hudson.war and make the pom zip dependencies.
>
> My java.net id is pamdirac.
>


--
Kohsuke Kawaguchi
Sun Microsystems                   http://weblogs.java.net/blog/kohsuke/






--
John McNair
[hidden email]




--
John McNair
[hidden email]





--

John McNair
[hidden email]




--
John McNair
[hidden email]

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.