Quantcast

Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. And more.

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. And more.

GLundh
Our slaves continuously locks up, making our build-cluster unstable. Jobs on these slaves never starts the actual building process and cannot be aborted. This is a critical issue.

I see on the tracker and the mailing list that a decent number of people is running into this issue.

So I'm starting this new thread, in hope to bring a new and more active discussion on how the issue should be approach and solved. And also to avoid fragmenting the discussion over the various different issues on the tracker that touches on this behavior.

I would say the issue is so critical that it needs to be brought up again and re-prioritized. Since we know it can actually can bring down full building environments.

First. Lets take a look at the effects of the issue.
----------------------------------------------------

1.
When a job is started on a slave that has gone into a hanged state, this is the only console output:
"Building remotely on slave001"

- Nothing more will happen. Stuck forever.

The stuck jobs cannot be aborted. Disconnecting a slave (or even removing it) does not help. The slave is stuck.

2.
Due to the deadlock, scripts cannot be run on the slaves through the http://<server>/computer/<slave>/script
This is of course making debugging attempts harder.

3.
Thread dumps from the slave can not be served through the UI either. No pages can be served if there is any type communication with the affected slave through the Request.call().

This has a quite nasty side effect. For instance when users trying to browse a workspace allocated on a hanged slave, the http-request hangs. If a few users is using the build-system, it wont take long until all HTTP-serving threads are occupied and the UI finally hangs fully.

The same issue goes with all UI calls that communicates with the slave. Liek when issuing a slave threadDump.

4.
In certain circumstances, whole plug-ins / features (SCM Polling?) in Hudson can get locked down during a remote call.
For instance when a singleton:ed plug-in is calling a method like hudson.EnvVars.getRemote(EnvVars.java:202) against a deadlocked node.

When the plug-in hangs, snowballing can occur, spreading hanged jobs on all slaves. E.g. all jobs using that plug-in can get locked down/stucked.

5.
When a slave hangs, it hangs forever. This is interesting, as we will soon see that the effected streams uses TIMED_WAIT.

There is no way to break free, except trying to manually interrupting waiting threads (which of course is risky and not even remotely a solution).

6.
Issues and discussions of interest:
http://issues.hudson-ci.org/browse/HUDSON-5977
http://issues.hudson-ci.org/browse/HUDSON-5413

http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-plugin-td2304264.html
http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td1597996.html
http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html


Next. Let's take a look at the affected threads:
------------------------------------------------

The locked slaves will always have a reader thread looking like this:

        Channel reader thread: slave001

        "Channel reader thread: slave001" Id=611 Group=main TIMED_WAITING on [B@35c5a763
                at java.lang.Object.wait(Native Method)
                -  waiting on [B@35c5a763
                at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147)
                at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122)
                at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164)
                at hudson.remoting.Channel$ReaderThread.run(Channel.java:884)

E.g. a "FastPipedOutputStream.write(FastPipedOutputStream.java:147)" that never completes.

Interestingly enough, it is a TIMED_WAIT defined to 10 minutes. But it never seems to be able to get out of the FastPipedOutputStream.write()-method.

The tasks/jobs getting stuck always looks similiar:

        Executor #0 for slave001 : executing job-matrix » var1,var2,var3 #459 / waiting for hudson.remoting.Channel@408b8462:slave001

        "Executor #0 for slave001 : executing job-matrix » var1,var2,var3 #459 / waiting for hudson.remoting.Channel@408b8462:slave001" Id=80 Group=main TIMED_WAITING on hudson.remoting.UserRequest@85e26c5
                at java.lang.Object.wait(Native Method)
                -  waiting on hudson.remoting.UserRequest@85e26c5
                at hudson.remoting.Request.call(Request.java:122)
                at hudson.remoting.Channel.call(Channel.java:555)
                at hudson.FilePath.act(FilePath.java:742)
                at hudson.FilePath.act(FilePath.java:735)
                at hudson.FilePath.mkdirs(FilePath.java:801)
                at hudson.model.AbstractProject.checkout(AbstractProject.java:1036)
                at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
                at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
                at hudson.model.Run.run(Run.java:1258)
                at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
                at hudson.model.ResourceController.execute(ResourceController.java:88)
                at hudson.model.Executor.run(Executor.java:129)


The thread is never able to complete the call to hudson.remoting.Request.call(Request.java:122). Interestingly enough a 30min timeout is also defined in the hudson.remoting.Request.call(Request.java:122)-method. However, this does not seem to help the thread to break free at all.

I should probably also mention that not only regular jobs will lock up. All monitoring threads will also lock up:

        "Monitoring thread for Architecture started on Wed Aug 11 14:36:58 CEST 2010 / waiting for hudson.remoting.Channel@408b8462:slave001" Id=5254 Group=main TIMED_WAITING on hudson.remoting.UserRequest@2f27f05f
                at java.lang.Object.wait(Native Method)
                -  waiting on hudson.remoting.UserRequest@2f27f05f
                at hudson.remoting.Request.call(Request.java:122)
                at hudson.remoting.Channel.call(Channel.java:555)
                at hudson.node_monitors.ArchitectureMonitor$DescriptorImpl.monitor(ArchitectureMonitor.java:44)
                at hudson.node_monitors.ArchitectureMonitor$DescriptorImpl.monitor(ArchitectureMonitor.java:41)
                at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:200)

        "Monitoring thread for Clock Difference started on Wed Aug 11 14:36:58 CEST 2010 / waiting for hudson.remoting.Channel@408b8462:slave001" Id=5255 Group=main TIMED_WAITING on hudson.remoting.UserRequest@2b150330
                at java.lang.Object.wait(Native Method)
                -  waiting on hudson.remoting.UserRequest@2b150330
                at hudson.remoting.Request.call(Request.java:122)
                at hudson.remoting.Channel.call(Channel.java:555)
                at hudson.model.Slave.getClockDifference(Slave.java:225)
                at hudson.node_monitors.ClockMonitor$1.monitor(ClockMonitor.java:53)
                at hudson.node_monitors.ClockMonitor$1.monitor(ClockMonitor.java:49)
                at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:200)
       
...

Similiar traces can also be found for all served webpages that makes calls to the slave.

I fetched a few thread dumps from the hanged slaves using jstack. All dumps from locked slaves looks the same:


----------------- THREAD DUMP STARTS (SCROLL DOWN) ------------------
2010-08-11 16:02:05
Full thread dump Java HotSpot(TM) Server VM (17.0-b16 mixed mode):

"Attach Listener" daemon prio=10 tid=0x08527400 nid=0x90e waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"pool-1-thread-105" prio=10 tid=0x08525800 nid=0x56d waiting for monitor entry [0x758fb000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at hudson.remoting.Request.call(Request.java:101)
        - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel)
        at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:160)
        at hudson.remoting.$Proxy5.fetch2(Unknown Source)
        at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        - locked <0x7c441120> (a hudson.remoting.RemoteClassLoader)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:112)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1575)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1732)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at hudson.remoting.UserRequest.deserialize(UserRequest.java:178)
        at hudson.remoting.UserRequest.perform(UserRequest.java:98)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

"pool-1-thread-104" prio=10 tid=0x08523c00 nid=0x3d0 waiting for monitor entry [0x7594c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at hudson.remoting.Request$2.run(Request.java:281)
        - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

[Cutting out "pool-1-thread-56" to "pool-1-thread-103". Looking exactly the same]

"pool-1-thread-55" prio=10 tid=0x083dcc00 nid=0x4009 waiting for monitor entry [0x77809000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at hudson.remoting.Request$2.run(Request.java:281)
        - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

"pool-1-thread-54" prio=10 tid=0x083d6400 nid=0x3e36 runnable [0x77753000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:260)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        - locked <0x7c4926a0> (a java.io.BufferedOutputStream)
        at java.io.PrintStream.write(PrintStream.java:430)
        - locked <0x7c4403f8> (a java.io.PrintStream)
        at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
        at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1809)
        at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1294)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1146)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
        at hudson.remoting.Channel.send(Channel.java:417)
        - locked <0x7c4403a0> (a hudson.remoting.Channel)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:112)
        - locked <0x7c7df988> (a hudson.remoting.ProxyOutputStream)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:100)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        - locked <0x7c7df9a0> (a java.io.BufferedOutputStream)
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
        - locked <0x7c7dfbc8> (a java.util.zip.GZIPOutputStream)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        - locked <0x7c7dfbe8> (a hudson.util.io.TarArchiver$1)
        at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410)
        at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388)
        at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337)
        at hudson.util.io.TarArchiver.visit(TarArchiver.java:103)
        at hudson.util.DirScanner$Glob.scan(DirScanner.java:104)
        at hudson.FilePath.writeToTar(FilePath.java:1543)
        at hudson.FilePath.access$800(FilePath.java:159)
        at hudson.FilePath$33.invoke(FilePath.java:1484)
        at hudson.FilePath$33.invoke(FilePath.java:1481)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

"Ping thread for channel hudson.remoting.Channel@14da8f4:channel" daemon prio=10 tid=0x084ae400 nid=0x35fc waiting for monitor entry [0x7785a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at hudson.remoting.Channel.send(Channel.java:411)
        - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel)
        at hudson.remoting.Request.callAsync(Request.java:170)
        at hudson.remoting.Channel.callAsync(Channel.java:582)
        at hudson.remoting.PingThread.ping(PingThread.java:99)
        at hudson.remoting.PingThread.run(PingThread.java:81)

"Channel reader thread: channel" prio=10 tid=0x08488800 nid=0x35fb runnable [0x778ab000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:199)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        - locked <0x7c4410a8> (a java.io.BufferedInputStream)
        at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
        at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:869)

"Low Memory Detector" daemon prio=10 tid=0x77a04800 nid=0x35f9 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x77a02800 nid=0x35f8 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x77a00800 nid=0x35f7 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x083a0000 nid=0x35f6 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x0838fc00 nid=0x35f5 waiting for monitor entry [0x77c7b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at hudson.remoting.Channel.send(Channel.java:411)
        - waiting to lock <0x7c4403a0> (a hudson.remoting.Channel)
        at hudson.remoting.RemoteInvocationHandler.finalize(RemoteInvocationHandler.java:197)
        at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
        at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)
        at java.lang.ref.Finalizer.access$100(Finalizer.java:14)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

"Reference Handler" daemon prio=10 tid=0x0838b000 nid=0x35f4 in Object.wait() [0x77ccc000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x7c443068> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x7c443068> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x082ea000 nid=0x35ee in Object.wait() [0xb69ea000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x7c4403a0> (a hudson.remoting.Channel)
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.Channel.join(Channel.java:650)
        - locked <0x7c4403a0> (a hudson.remoting.Channel)
        at hudson.remoting.Launcher.main(Launcher.java:419)
        at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:365)
        at hudson.remoting.Launcher.run(Launcher.java:205)
        at hudson.remoting.Launcher.main(Launcher.java:167)

"VM Thread" prio=10 tid=0x08388800 nid=0x35f3 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x082f1400 nid=0x35ef runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x082f2c00 nid=0x35f0 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x082f4000 nid=0x35f1 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x082f5800 nid=0x35f2 runnable

"VM Periodic Task Thread" prio=10 tid=0x77a0e800 nid=0x35fa waiting on condition

JNI global references: 1079

----------------- THREAD DUMP ENDS ------------------

The interesting part is the locked <0x7c4403a0> (hudson.remoting.Channel).

Locked in: "pool-1-thread-54"
and for always hanged at: hudson.remoting.Channel.send(Channel.java:417)

Funny enough <0x7c4403a0> is also locked in "main" prio=10 tid=0x082ea000 nid=0x35ee in Object.wait() [0xb69ea000].
But I'm sure there is a reasonable explanation for this. I guess it is just how jstack presents previously acquired locks.

So I was wondering. Are there any core expert availible to take a look at the deadlock? Is there currently an ongoing plan or discussion on how to solve the clogging issues? Has the channel issue been touched upon on IRC?

Any hints on how to continue? There is only so much debugging you can do on a production environment.
I'm willing to spend some time on this issue, but some help is needed on the way.

Whats the way of going forward? Introducing more timers, to atleast keep the slaves locking up forever?

How about starting to draft out som QoS mechanism?

Br
Gustaf Lundh - Sony Ericsson
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

GLundh
> <original post by GLundh. UPDATED>
> Our jobs/slaves continuously locks up, making our build-cluster unstable.
>
> I see on the tracker and the mailing list that a decent number of people is running
> into this issue.
>
> So I'm starting this new thread, in hope to bring a new and more active discussion
> on how the issue should be approach and solved. And also to avoid
> fragmenting > the discussion over the various different issues on the tracker that
> touches on this behavior.
>
> I would say the issue is so critical that it needs to be brought up again and
> re-prioritized. Since we know it can actually can bring down full building environments.
>
> Issues and discussions of interest:
> http://issues.hudson-ci.org/browse/HUDSON-5977
> http://issues.hudson-ci.org/browse/HUDSON-5413
>
> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-plugin-td2304264.html
> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td1597996.html
> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
> </original post>


I have put in some time to investigate the issue and also providing a proof-of-concept solution for finally fixing this bad bug.

Let me present a typical case where this congestion issue appear and lock a job/slave:

1. Job is started.

2. The executor thread on the master sends a Request to the slave, issuing some stream operations. Example: Unzip:ing a saved workspace to the slave using the Clone Workspace SCM plugin:

(From the Master's ThreadDump)

"Executor #0 for test : executing b #19 / waiting for hudson.remoting.Channel@47338:test" prio=10 tid=0x75b28800 nid=0x702e in Object.wait() [0x75769000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x7cebedd8> (a hudson.remoting.UserRequest)
        at hudson.remoting.Request.call(Request.java:122)
        - locked <0x7cebedd8> (a hudson.remoting.UserRequest)
        at hudson.remoting.Channel.call(Channel.java:555)
        at hudson.FilePath.act(FilePath.java:742)
        at hudson.FilePath.act(FilePath.java:735)
        at hudson.FilePath.unzip(FilePath.java:415)
        at hudson.FileSystemProvisioner$Default$WorkspaceSnapshotImpl.restoreTo(FileSystemProvisioner.java:227)
        at hudson.plugins.cloneworkspace.CloneWorkspaceSCM$Snapshot.restoreTo(CloneWorkspaceSCM.java:344)
        at hudson.plugins.cloneworkspace.CloneWorkspaceSCM.checkout(CloneWorkspaceSCM.java:126)
        at hudson.model.AbstractProject.checkout(AbstractProject.java:1038)
        at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
        at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
        at hudson.model.Run.run(Run.java:1258)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
        at hudson.model.ResourceController.execute(ResourceController.java:88)
        at hudson.model.Executor.run(Executor.java:129)

3. This Request is received by the slaves ChannelReaderThread.

As most other Requests, it's being executed in its own thread, leaving the ChannelReaderThread free for new incoming Requests.

Since this particular Request involves a stream being written to the slave, ProxyOutputStream.Chunks are being sent rapidly to the slave. These chunks are being written to a buffer (FastPipedOutputStream) which is consumed by the original Request (working in its own thread - remember?).

4. Now. The main problem is that the incoming chunks are being written and executed on the ChannelReaderThread (not in a seperate thread like usual remote Requests). When the stream buffer is full, it waits and blocks the ChannelReaderThread on the slave:

(From the Slave's ThreadDump)

"Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147)
        - locked <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122)
        at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:890)

The problem is that the original Request now stumbles upon a new class, not yet available on the slave, (for instance hudson.util.IOUtils which is needed to unzip a stream) and the remote class loader is invoked.

Remember that the original request was responsible for consuming the incoming chunks. Now it's taking a break to do a request of a new class. E.g. the Slave is requesting the class to be sent from the Master:

(From the Slave's ThreadDump)

"pool-1-thread-1 / waiting for hudson.remoting.Channel@86fe26:channel" prio=10 tid=0x08d32c00 nid=0x7330 in Object.wait() [0x7745c000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
        at hudson.remoting.Request.call(Request.java:122)
        - locked <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
        at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:160)
        at hudson.remoting.$Proxy5.fetch2(Unknown Source)
        at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader)
        at hudson.FilePath.unzip(FilePath.java:476)
        at hudson.FilePath.access$000(FilePath.java:159)
        at hudson.FilePath$2.invoke(FilePath.java:417)
        at hudson.FilePath$2.invoke(FilePath.java:415)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
       
Notice the jump to the RemoteClassLoader after the "at hudson.FilePath.unzip(FilePath.java:476)". This is where a new class was stumbled upon (IOUtils).

So a request is being sent to the master, asking for the missing class. The master sends it back for the slave to load it.

But! As we remember, the ChannelReaderThread is already stuck, trying to write chunks to the already full stream buffer. Hence the Response, containing the class we asked for (in this case IOUtils) cannot be retrieved:

(From the Slave's ThreadDump)

"Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147)
        - locked <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122)
        at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:890)
       
And we cannot consume from the buffer, making way for the Class Response, since the thread responsible for consuming the stream is waiting for this non existing class. Well. You get the point.

----------
All "hanged jobs" ThreadDumps I have seen so far, points to this exact deadlock. A stuck ChannelReaderThread due to a full FastPipedOutputStream buffer - which cannot be consumed because the consuming thread is waiting for a class to be fetched.
----------

This issue would of course not exist, if the ProxyOutputStreams Chunk/EOF/Unexport/Flush was executed on its own thread (not blocking the ChannelReaderThread).

So I did a proof-of-concept solution, adding a ExecutorService on the channel, responsible for handling the ProxyStreamOutput commands in a FIFO manner (newSingleThreadExecutor()).

It seems to work perfectly fine. I even minimized the FastPipedInputStreams buffer from 65k to 256 bytes, to make sure it did not congest even in critical situations.

I do understand that the core developers may want to take their own approach to solving the issue. Maybe refining the patch for performance or similar. But since this is a critical issue, I would really like for someone to soon take a look at the proof-of-concept solution, and produce a solid patch to be integrated on the core. I can also do it myself, if I can get some code reviews and a little discussion around the issue and a suitable final solution.

Best Regards
Gustaf Lundh - Team Hudson @ Sony Ericsson
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

GLundh
And here is the patch.

Br
Gustaf Lundh

-----Original Message-----
From: Lundh, Gustaf [mailto:[hidden email]]
Sent: den 2 september 2010 14:04
To: [hidden email]
Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

> <original post by GLundh. UPDATED>
> Our jobs/slaves continuously locks up, making our build-cluster unstable.
>
> I see on the tracker and the mailing list that a decent number of people is running
> into this issue.
>
> So I'm starting this new thread, in hope to bring a new and more active discussion
> on how the issue should be approach and solved. And also to avoid
> fragmenting > the discussion over the various different issues on the tracker that
> touches on this behavior.
>
> I would say the issue is so critical that it needs to be brought up again and
> re-prioritized. Since we know it can actually can bring down full building environments.
>
> Issues and discussions of interest:
> http://issues.hudson-ci.org/browse/HUDSON-5977
> http://issues.hudson-ci.org/browse/HUDSON-5413
>
> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-plugin-td2304264.html
> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td1597996.html
> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
> </original post>

I have put in some time to investigate the issue and also providing a proof-of-concept solution for finally fixing this bad bug.

Let me present a typical case where this congestion issue appear and lock a job/slave:

1. Job is started.

2. The executor thread on the master sends a Request to the slave, issuing some stream operations. Example: Unzip:ing a saved workspace to the slave using the Clone Workspace SCM plugin:

(From the Master's ThreadDump)

"Executor #0 for test : executing b #19 / waiting for hudson.remoting.Channel@47338:test" prio=10 tid=0x75b28800 nid=0x702e in Object.wait() [0x75769000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x7cebedd8> (a hudson.remoting.UserRequest)
        at hudson.remoting.Request.call(Request.java:122)
        - locked <0x7cebedd8> (a hudson.remoting.UserRequest)
        at hudson.remoting.Channel.call(Channel.java:555)
        at hudson.FilePath.act(FilePath.java:742)
        at hudson.FilePath.act(FilePath.java:735)
        at hudson.FilePath.unzip(FilePath.java:415)
        at hudson.FileSystemProvisioner$Default$WorkspaceSnapshotImpl.restoreTo(FileSystemProvisioner.java:227)
        at hudson.plugins.cloneworkspace.CloneWorkspaceSCM$Snapshot.restoreTo(CloneWorkspaceSCM.java:344)
        at hudson.plugins.cloneworkspace.CloneWorkspaceSCM.checkout(CloneWorkspaceSCM.java:126)
        at hudson.model.AbstractProject.checkout(AbstractProject.java:1038)
        at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
        at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
        at hudson.model.Run.run(Run.java:1258)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
        at hudson.model.ResourceController.execute(ResourceController.java:88)
        at hudson.model.Executor.run(Executor.java:129)

3. This Request is received by the slaves ChannelReaderThread.

As most other Requests, it's being executed in its own thread, leaving the ChannelReaderThread free for new incoming Requests.

Since this particular Request involves a stream being written to the slave, ProxyOutputStream.Chunks are being sent rapidly to the slave. These chunks are being written to a buffer (FastPipedOutputStream) which is consumed by the original Request (working in its own thread - remember?).

4. Now. The main problem is that the incoming chunks are being written and executed on the ChannelReaderThread (not in a seperate thread like usual remote Requests). When the stream buffer is full, it waits and blocks the ChannelReaderThread on the slave:


(From the Slave's ThreadDump)

"Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147)
        - locked <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122)
        at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:890)

The problem is that the original Request now stumbles upon a new class, not yet available on the slave, (for instance hudson.util.IOUtils which is needed to unzip a stream) and the remote class loader is invoked.

Remember that the original request was responsible for consuming the incoming chunks. Now it's taking a break to do a request of a new class. E.g. the Slave is requesting the class to be sent from the Master:

(From the Slave's ThreadDump)

"pool-1-thread-1 / waiting for hudson.remoting.Channel@86fe26:channel" prio=10 tid=0x08d32c00 nid=0x7330 in Object.wait() [0x7745c000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
        at hudson.remoting.Request.call(Request.java:122)
        - locked <0xad8a0300> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
        at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:160)
        at hudson.remoting.$Proxy5.fetch2(Unknown Source)
        at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader)
        at hudson.FilePath.unzip(FilePath.java:476)
        at hudson.FilePath.access$000(FilePath.java:159)
        at hudson.FilePath$2.invoke(FilePath.java:417)
        at hudson.FilePath$2.invoke(FilePath.java:415)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
       
Notice the jump to the RemoteClassLoader after the "at hudson.FilePath.unzip(FilePath.java:476)". This is where a new class was stumbled upon (IOUtils).

So a request is being sent to the master, asking for the missing class. The master sends it back for the slave to load it.

But! As we remember, the ChannelReaderThread is already stuck, trying to write chunks to the already full stream buffer. Hence the Response, containing the class we asked for (in this case IOUtils) cannot be retrieved:

(From the Slave's ThreadDump)

"Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in Object.wait() [0x774fe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147)
        - locked <0xad849f60> (a [B)
        at hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122)
        at hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:890)
       
And we cannot consume from the buffer, making way for the Class Response, since the thread responsible for consuming the stream is waiting for this non existing class. Well. You get the point.

----------
All "hanged jobs" ThreadDumps I have seen so far, points to this exact deadlock. A stuck ChannelReaderThread due to a full FastPipedOutputStream buffer - which cannot be consumed because the consuming thread is waiting for a class to be fetched.
----------

This issue would of course not exist, if the ProxyOutputStreams Chunk/EOF/Unexport/Flush was executed on its own thread (not blocking the ChannelReaderThread).

So I did a proof-of-concept solution, adding a ExecutorService on the channel, responsible for handling the ProxyStreamOutput commands in a FIFO manner (newSingleThreadExecutor()).

It seems to work perfectly fine. I even minimized the FastPipedInputStreams buffer from 65k to 256 bytes, to make sure it did not congest even in critical situations.

I do understand that the core developers may want to take their own approach to solving the issue. Maybe refining the patch for performance or similar. But since this is a critical issue, I would really like for someone to soon take a look at the proof-of-concept solution, and produce a solid patch to be integrated on the core. I can also do it myself, if I can get some code reviews and a little discussion around the issue and a suitable final solution.

Best Regards
Gustaf Lundh - Team Hudson @ Sony Ericsson

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

deadlockpatch.txt (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Vojtech Juranek
Hi,
any feedback from Hudson core developers? Personally, it seems to me as a good
fix of this issue, but I would appreciate any comment on this from core
developers, what they think about it.
Thanks
Vojtech

> And here is the patch.
>
> Br
> Gustaf Lundh
>
> -----Original Message-----
> From: Lundh, Gustaf [mailto:[hidden email]]
> Sent: den 2 september 2010 14:04
> To: [hidden email]
> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to
> pipe clogging. PROPOSED SOLUTION!
>
> > <original post by GLundh. UPDATED>
> > Our jobs/slaves continuously locks up, making our build-cluster unstable.
> >
> > I see on the tracker and the mailing list that a decent number of people
> > is running into this issue.
> >
> > So I'm starting this new thread, in hope to bring a new and more active
> > discussion on how the issue should be approach and solved. And also to
> > avoid fragmenting > the discussion over the various different issues on
> > the tracker that touches on this behavior.
> >
> > I would say the issue is so critical that it needs to be brought up again
> > and re-prioritized. Since we know it can actually can bring down full
> > building environments.
> >
> > Issues and discussions of interest:
> > http://issues.hudson-ci.org/browse/HUDSON-5977
> > http://issues.hudson-ci.org/browse/HUDSON-5413
> >
> > http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-
> > plugin-td2304264.html
> > http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td
> > 1597996.html
> > http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
> > </original post>
>
> I have put in some time to investigate the issue and also providing a
> proof-of-concept solution for finally fixing this bad bug.
>
> Let me present a typical case where this congestion issue appear and lock a
> job/slave:
>
> 1. Job is started.
>
> 2. The executor thread on the master sends a Request to the slave, issuing
> some stream operations. Example: Unzip:ing a saved workspace to the slave
> using the Clone Workspace SCM plugin:
>
> (From the Master's ThreadDump)
>
> "Executor #0 for test : executing b #19 / waiting for
> hudson.remoting.Channel@47338:test" prio=10 tid=0x75b28800 nid=0x702e in
> Object.wait() [0x75769000] java.lang.Thread.State: TIMED_WAITING (on
> object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x7cebedd8> (a hudson.remoting.UserRequest)
> at hudson.remoting.Request.call(Request.java:122)
> - locked <0x7cebedd8> (a hudson.remoting.UserRequest)
> at hudson.remoting.Channel.call(Channel.java:555)
> at hudson.FilePath.act(FilePath.java:742)
> at hudson.FilePath.act(FilePath.java:735)
> at hudson.FilePath.unzip(FilePath.java:415)
> at
> hudson.FileSystemProvisioner$Default$WorkspaceSnapshotImpl.restoreTo(FileS
> ystemProvisioner.java:227) at
> hudson.plugins.cloneworkspace.CloneWorkspaceSCM$Snapshot.restoreTo(CloneWo
> rkspaceSCM.java:344) at
> hudson.plugins.cloneworkspace.CloneWorkspaceSCM.checkout(CloneWorkspaceSCM
> .java:126) at
> hudson.model.AbstractProject.checkout(AbstractProject.java:1038) at
> hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:479)
> at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:411)
> at hudson.model.Run.run(Run.java:1258)
> at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
> at hudson.model.ResourceController.execute(ResourceController.java:88)
> at hudson.model.Executor.run(Executor.java:129)
>
> 3. This Request is received by the slaves ChannelReaderThread.
>
> As most other Requests, it's being executed in its own thread, leaving the
> ChannelReaderThread free for new incoming Requests.
>
> Since this particular Request involves a stream being written to the slave,
> ProxyOutputStream.Chunks are being sent rapidly to the slave. These chunks
> are being written to a buffer (FastPipedOutputStream) which is consumed by
> the original Request (working in its own thread - remember?).
>
> 4. Now. The main problem is that the incoming chunks are being written and
> executed on the ChannelReaderThread (not in a seperate thread like usual
> remote Requests). When the stream buffer is full, it waits and blocks the
> ChannelReaderThread on the slave:
>
>
> (From the Slave's ThreadDump)
>
> "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in
> Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on
> object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xad849f60> (a [B)
> at
> hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147
> ) - locked <0xad849f60> (a [B)
> at
> hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122
> ) at
> hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164
> ) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890)
>
> The problem is that the original Request now stumbles upon a new class, not
> yet available on the slave, (for instance hudson.util.IOUtils which is
> needed to unzip a stream) and the remote class loader is invoked.
>
> Remember that the original request was responsible for consuming the
> incoming chunks. Now it's taking a break to do a request of a new class.
> E.g. the Slave is requesting the class to be sent from the Master:
>
> (From the Slave's ThreadDump)
>
> "pool-1-thread-1 / waiting for hudson.remoting.Channel@86fe26:channel"
> prio=10 tid=0x08d32c00 nid=0x7330 in Object.wait() [0x7745c000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xad8a0300> (a
> hudson.remoting.RemoteInvocationHandler$RPCRequest) at
> hudson.remoting.Request.call(Request.java:122)
> - locked <0xad8a0300> (a
> hudson.remoting.RemoteInvocationHandler$RPCRequest) at
> hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.jav
> a:160) at hudson.remoting.$Proxy5.fetch2(Unknown Source)
> at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:122)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
> - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
> at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
> - locked <0xadb5aa18> (a hudson.remoting.RemoteClassLoader)
> at hudson.FilePath.unzip(FilePath.java:476)
> at hudson.FilePath.access$000(FilePath.java:159)
> at hudson.FilePath$2.invoke(FilePath.java:417)
> at hudson.FilePath$2.invoke(FilePath.java:415)
> at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
> at hudson.remoting.UserRequest.perform(UserRequest.java:114)
> at hudson.remoting.UserRequest.perform(UserRequest.java:48)
> at hudson.remoting.Request$2.run(Request.java:270)
> at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.
> java:886) at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
> :908) at java.lang.Thread.run(Thread.java:619)
>
> Notice the jump to the RemoteClassLoader after the "at
> hudson.FilePath.unzip(FilePath.java:476)". This is where a new class was
> stumbled upon (IOUtils).
>
> So a request is being sent to the master, asking for the missing class. The
> master sends it back for the slave to load it.
>
> But! As we remember, the ChannelReaderThread is already stuck, trying to
> write chunks to the already full stream buffer. Hence the Response,
> containing the class we asked for (in this case IOUtils) cannot be
> retrieved:
>
> (From the Slave's ThreadDump)
>
> "Channel reader thread: channel" prio=10 tid=0x7750b000 nid=0x732e in
> Object.wait() [0x774fe000] java.lang.Thread.State: TIMED_WAITING (on
> object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xad849f60> (a [B)
> at
> hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:147
> ) - locked <0xad849f60> (a [B)
> at
> hudson.remoting.FastPipedOutputStream.write(FastPipedOutputStream.java:122
> ) at
> hudson.remoting.ProxyOutputStream$Chunk.execute(ProxyOutputStream.java:164
> ) at hudson.remoting.Channel$ReaderThread.run(Channel.java:890)
>
> And we cannot consume from the buffer, making way for the Class Response,
> since the thread responsible for consuming the stream is waiting for this
> non existing class. Well. You get the point.
>
> ----------
> All "hanged jobs" ThreadDumps I have seen so far, points to this exact
> deadlock. A stuck ChannelReaderThread due to a full FastPipedOutputStream
> buffer - which cannot be consumed because the consuming thread is waiting
> for a class to be fetched. ----------
>
> This issue would of course not exist, if the ProxyOutputStreams
> Chunk/EOF/Unexport/Flush was executed on its own thread (not blocking the
> ChannelReaderThread).
>
> So I did a proof-of-concept solution, adding a ExecutorService on the
> channel, responsible for handling the ProxyStreamOutput commands in a FIFO
> manner (newSingleThreadExecutor()).
>
> It seems to work perfectly fine. I even minimized the FastPipedInputStreams
> buffer from 65k to 256 bytes, to make sure it did not congest even in
> critical situations.
>
> I do understand that the core developers may want to take their own
> approach to solving the issue. Maybe refining the patch for performance or
> similar. But since this is a critical issue, I would really like for
> someone to soon take a look at the proof-of-concept solution, and produce
> a solid patch to be integrated on the core. I can also do it myself, if I
> can get some code reviews and a little discussion around the issue and a
> suitable final solution.
>
> Best Regards
> Gustaf Lundh - Team Hudson @ Sony Ericsson

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Alan Harder-2
you might have to wait a week or so, as Kohsuke is traveling now..

Glad to see some work on this.. maybe sometime I can remove the
workaround code in copyartifact ;-)



Vojtech Juranek wrote:

> Hi,
> any feedback from Hudson core developers? Personally, it seems to me as a good
> fix of this issue, but I would appreciate any comment on this from core
> developers, what they think about it.
> Thanks
> Vojtech
>
>  
>> And here is the patch.
>>
>> Br
>> Gustaf Lundh
>>
>> -----Original Message-----
>> From: Lundh, Gustaf [mailto:[hidden email]]
>> Sent: den 2 september 2010 14:04
>> To: [hidden email]
>> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to
>> pipe clogging. PROPOSED SOLUTION!
>>
>>    
>>> <original post by GLundh. UPDATED>
>>> Our jobs/slaves continuously locks up, making our build-cluster unstable.
>>>
>>> I see on the tracker and the mailing list that a decent number of people
>>> is running into this issue.
>>>
>>> So I'm starting this new thread, in hope to bring a new and more active
>>> discussion on how the issue should be approach and solved. And also to
>>> avoid fragmenting > the discussion over the various different issues on
>>> the tracker that touches on this behavior.
>>>
>>> I would say the issue is so critical that it needs to be brought up again
>>> and re-prioritized. Since we know it can actually can bring down full
>>> building environments.
>>>
>>> Issues and discussions of interest:
>>> http://issues.hudson-ci.org/browse/HUDSON-5977
>>> http://issues.hudson-ci.org/browse/HUDSON-5413
>>>
>>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-
>>> plugin-td2304264.html
>>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td
>>> 1597996.html
>>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
>>> </original post>
>>>      
>> I have put in some time to investigate the issue and also providing a
>> proof-of-concept solution for finally fixing this bad bug.
>>
>> Let me present a typical case where this congestion issue appear and lock a
>> job/slave:
>>
>> 1. Job is started.
>>
>> 2. The executor thread on the master sends a Request to the slave, issuing
>> some stream operations. Example: Unzip:ing a saved workspace to the slave
>> using the Clone Workspace SCM plugin:
>>    
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

GLundh
Hi, Alan.

Yes. He sure is out and traveling. I actually met him in the Copenhagen user meet-up last Monday :) (Btw, a big thanks to Kohsuke and Nokia for hosting a great meet-up!)

I did not had the chance to discuss the issue that much (quite a few people was there), but he mentioned that he had started some work on the piping issue himself.

If he is planning a larger and more time demanding rework of the channel/remoting functionality, perhaps a solution like this could work until the already ongoing work is ready for primetime.

Because the severity of the issue, and with not much happening, we are currently thinking of forking Hudson internally to get the patch in place. Running internal forks is really not a great situation to be in.
 
Br
Gustaf



-----Original Message-----
From: Alan Harder [mailto:[hidden email]]
Sent: den 8 september 2010 17:26
To: [hidden email]
Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

you might have to wait a week or so, as Kohsuke is traveling now..

Glad to see some work on this.. maybe sometime I can remove the
workaround code in copyartifact ;-)



Vojtech Juranek wrote:

> Hi,
> any feedback from Hudson core developers? Personally, it seems to me as a good
> fix of this issue, but I would appreciate any comment on this from core
> developers, what they think about it.
> Thanks
> Vojtech
>
>  
>> And here is the patch.
>>
>> Br
>> Gustaf Lundh
>>
>> -----Original Message-----
>> From: Lundh, Gustaf [mailto:[hidden email]]
>> Sent: den 2 september 2010 14:04
>> To: [hidden email]
>> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to
>> pipe clogging. PROPOSED SOLUTION!
>>
>>    
>>> <original post by GLundh. UPDATED>
>>> Our jobs/slaves continuously locks up, making our build-cluster unstable.
>>>
>>> I see on the tracker and the mailing list that a decent number of people
>>> is running into this issue.
>>>
>>> So I'm starting this new thread, in hope to bring a new and more active
>>> discussion on how the issue should be approach and solved. And also to
>>> avoid fragmenting > the discussion over the various different issues on
>>> the tracker that touches on this behavior.
>>>
>>> I would say the issue is so critical that it needs to be brought up again
>>> and re-prioritized. Since we know it can actually can bring down full
>>> building environments.
>>>
>>> Issues and discussions of interest:
>>> http://issues.hudson-ci.org/browse/HUDSON-5977
>>> http://issues.hudson-ci.org/browse/HUDSON-5413
>>>
>>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-
>>> plugin-td2304264.html
>>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td
>>> 1597996.html
>>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
>>> </original post>
>>>      
>> I have put in some time to investigate the issue and also providing a
>> proof-of-concept solution for finally fixing this bad bug.
>>
>> Let me present a typical case where this congestion issue appear and lock a
>> job/slave:
>>
>> 1. Job is started.
>>
>> 2. The executor thread on the master sends a Request to the slave, issuing
>> some stream operations. Example: Unzip:ing a saved workspace to the slave
>> using the Clone Workspace SCM plugin:
>>    
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Alan Harder-2
Out of curiosity, what is the use case that makes this a severe issue
for you?  Just wondering if a workaround similar to what copyartifact
does would help..

In the case of copyartifact, I forced a small size operation to get any
classloading to happen when there is only a small amount of traffic in
the other direction.. then when I start the potentially large operation,
no classloading happens to cause the pipe clogging..

    - Alan


Lundh, Gustaf wrote:

> Hi, Alan.
>
> Yes. He sure is out and traveling. I actually met him in the Copenhagen user meet-up last Monday :) (Btw, a big thanks to Kohsuke and Nokia for hosting a great meet-up!)
>
> I did not had the chance to discuss the issue that much (quite a few people was there), but he mentioned that he had started some work on the piping issue himself.
>
> If he is planning a larger and more time demanding rework of the channel/remoting functionality, perhaps a solution like this could work until the already ongoing work is ready for primetime.
>
> Because the severity of the issue, and with not much happening, we are currently thinking of forking Hudson internally to get the patch in place. Running internal forks is really not a great situation to be in.
>  
> Br
> Gustaf
>
>
>
> -----Original Message-----
> From: Alan Harder [mailto:[hidden email]]
> Sent: den 8 september 2010 17:26
> To: [hidden email]
> Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!
>
> you might have to wait a week or so, as Kohsuke is traveling now..
>
> Glad to see some work on this.. maybe sometime I can remove the
> workaround code in copyartifact ;-)
>
>
>
> Vojtech Juranek wrote:
>  
>> Hi,
>> any feedback from Hudson core developers? Personally, it seems to me as a good
>> fix of this issue, but I would appreciate any comment on this from core
>> developers, what they think about it.
>> Thanks
>> Vojtech
>>
>>  
>>    
>>> And here is the patch.
>>>
>>> Br
>>> Gustaf Lundh
>>>
>>> -----Original Message-----
>>> From: Lundh, Gustaf [mailto:[hidden email]]
>>> Sent: den 2 september 2010 14:04
>>> To: [hidden email]
>>> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to
>>> pipe clogging. PROPOSED SOLUTION!
>>>
>>>    
>>>      
>>>> <original post by GLundh. UPDATED>
>>>> Our jobs/slaves continuously locks up, making our build-cluster unstable.
>>>>
>>>> I see on the tracker and the mailing list that a decent number of people
>>>> is running into this issue.
>>>>
>>>> So I'm starting this new thread, in hope to bring a new and more active
>>>> discussion on how the issue should be approach and solved. And also to
>>>> avoid fragmenting > the discussion over the various different issues on
>>>> the tracker that touches on this behavior.
>>>>
>>>> I would say the issue is so critical that it needs to be brought up again
>>>> and re-prioritized. Since we know it can actually can bring down full
>>>> building environments.
>>>>
>>>> Issues and discussions of interest:
>>>> http://issues.hudson-ci.org/browse/HUDSON-5977
>>>> http://issues.hudson-ci.org/browse/HUDSON-5413
>>>>
>>>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-
>>>> plugin-td2304264.html
>>>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td
>>>> 1597996.html
>>>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
>>>> </original post>
>>>>      
>>>>        
>>> I have put in some time to investigate the issue and also providing a
>>> proof-of-concept solution for finally fixing this bad bug.
>>>
>>> Let me present a typical case where this congestion issue appear and lock a
>>> job/slave:
>>>
>>> 1. Job is started.
>>>
>>> 2. The executor thread on the master sends a Request to the slave, issuing
>>> some stream operations. Example: Unzip:ing a saved workspace to the slave
>>> using the Clone Workspace SCM plugin:
>>>    
>>>      
>>  
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Vojtech Juranek
In reply to this post by Alan Harder-2
> you might have to wait a week or so, as Kohsuke is traveling now..

thanks for info! (in fact, I was wondering why nobody comments on this
issue...:-)

> Glad to see some work on this.. maybe sometime I can remove the
> workaround code in copyartifact ;-)
>
> Vojtech Juranek wrote:
> > Hi,
> > any feedback from Hudson core developers? Personally, it seems to me as a
> > good fix of this issue, but I would appreciate any comment on this from
> > core developers, what they think about it.
> > Thanks
> > Vojtech
> >
> >> And here is the patch.
> >>
> >> Br
> >> Gustaf Lundh
> >>
> >> -----Original Message-----
> >> From: Lundh, Gustaf [mailto:[hidden email]]
> >> Sent: den 2 september 2010 14:04
> >> To: [hidden email]
> >> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due
> >> to pipe clogging. PROPOSED SOLUTION!
> >>
> >>> <original post by GLundh. UPDATED>
> >>> Our jobs/slaves continuously locks up, making our build-cluster
> >>> unstable.
> >>>
> >>> I see on the tracker and the mailing list that a decent number of
> >>> people is running into this issue.
> >>>
> >>> So I'm starting this new thread, in hope to bring a new and more active
> >>> discussion on how the issue should be approach and solved. And also to
> >>> avoid fragmenting > the discussion over the various different issues on
> >>> the tracker that touches on this behavior.
> >>>
> >>> I would say the issue is so critical that it needs to be brought up
> >>> again and re-prioritized. Since we know it can actually can bring down
> >>> full building environments.
> >>>
> >>> Issues and discussions of interest:
> >>> http://issues.hudson-ci.org/browse/HUDSON-5977
> >>> http://issues.hudson-ci.org/browse/HUDSON-5413
> >>>
> >>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SC
> >>> M- plugin-td2304264.html
> >>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-t
> >>> d 1597996.html
> >>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
> >>> </original post>
> >>
> >> I have put in some time to investigate the issue and also providing a
> >> proof-of-concept solution for finally fixing this bad bug.
> >>
> >> Let me present a typical case where this congestion issue appear and
> >> lock a job/slave:
> >>
> >> 1. Job is started.
> >>
> >> 2. The executor thread on the master sends a Request to the slave,
> >> issuing some stream operations. Example: Unzip:ing a saved workspace to
> >> the slave
>
> >> using the Clone Workspace SCM plugin:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

GLundh
In reply to this post by Alan Harder-2
I have seen various plug-ins and functionality that triggers this issue.

Like the Emma publisher (been a while since that one hanged a job though), during SCM pulling routines, Triggering plug-ins and other occasions.

Your fix works great for the CopyArtifact plugin (awesome plug-in BTW). However, since it is hard to predict the next remote call to hang the job/slave, it seems more appropriate for us to fix the underlying issue than applying/multiplying the fix all over :)

Br
Gustaf

-----Original Message-----
From: Alan Harder [mailto:[hidden email]]
Sent: den 8 september 2010 17:49
To: [hidden email]
Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Out of curiosity, what is the use case that makes this a severe issue
for you?  Just wondering if a workaround similar to what copyartifact
does would help..

In the case of copyartifact, I forced a small size operation to get any
classloading to happen when there is only a small amount of traffic in
the other direction.. then when I start the potentially large operation,
no classloading happens to cause the pipe clogging..

    - Alan


Lundh, Gustaf wrote:

> Hi, Alan.
>
> Yes. He sure is out and traveling. I actually met him in the Copenhagen user meet-up last Monday :) (Btw, a big thanks to Kohsuke and Nokia for hosting a great meet-up!)
>
> I did not had the chance to discuss the issue that much (quite a few people was there), but he mentioned that he had started some work on the piping issue himself.
>
> If he is planning a larger and more time demanding rework of the channel/remoting functionality, perhaps a solution like this could work until the already ongoing work is ready for primetime.
>
> Because the severity of the issue, and with not much happening, we are currently thinking of forking Hudson internally to get the patch in place. Running internal forks is really not a great situation to be in.
>  
> Br
> Gustaf
>
>
>
> -----Original Message-----
> From: Alan Harder [mailto:[hidden email]]
> Sent: den 8 september 2010 17:26
> To: [hidden email]
> Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!
>
> you might have to wait a week or so, as Kohsuke is traveling now..
>
> Glad to see some work on this.. maybe sometime I can remove the
> workaround code in copyartifact ;-)
>
>
>
> Vojtech Juranek wrote:
>  
>> Hi,
>> any feedback from Hudson core developers? Personally, it seems to me as a good
>> fix of this issue, but I would appreciate any comment on this from core
>> developers, what they think about it.
>> Thanks
>> Vojtech
>>
>>  
>>    
>>> And here is the patch.
>>>
>>> Br
>>> Gustaf Lundh
>>>
>>> -----Original Message-----
>>> From: Lundh, Gustaf [mailto:[hidden email]]
>>> Sent: den 2 september 2010 14:04
>>> To: [hidden email]
>>> Subject: RE: Discussion regarding HUDSON-5977, Dead lock condition due to
>>> pipe clogging. PROPOSED SOLUTION!
>>>
>>>    
>>>      
>>>> <original post by GLundh. UPDATED>
>>>> Our jobs/slaves continuously locks up, making our build-cluster unstable.
>>>>
>>>> I see on the tracker and the mailing list that a decent number of people
>>>> is running into this issue.
>>>>
>>>> So I'm starting this new thread, in hope to bring a new and more active
>>>> discussion on how the issue should be approach and solved. And also to
>>>> avoid fragmenting > the discussion over the various different issues on
>>>> the tracker that touches on this behavior.
>>>>
>>>> I would say the issue is so critical that it needs to be brought up again
>>>> and re-prioritized. Since we know it can actually can bring down full
>>>> building environments.
>>>>
>>>> Issues and discussions of interest:
>>>> http://issues.hudson-ci.org/browse/HUDSON-5977
>>>> http://issues.hudson-ci.org/browse/HUDSON-5413
>>>>
>>>> http://hudson.361315.n4.nabble.com/Hang-using-Hudson-Clone-Workspace-SCM-
>>>> plugin-td2304264.html
>>>> http://hudson.361315.n4.nabble.com/Executors-hanging-after-completion-td
>>>> 1597996.html
>>>> http://hudson.361315.n4.nabble.com/Polling-hung-td1310838.html
>>>> </original post>
>>>>      
>>>>        
>>> I have put in some time to investigate the issue and also providing a
>>> proof-of-concept solution for finally fixing this bad bug.
>>>
>>> Let me present a typical case where this congestion issue appear and lock a
>>> job/slave:
>>>
>>> 1. Job is started.
>>>
>>> 2. The executor thread on the master sends a Request to the slave, issuing
>>> some stream operations. Example: Unzip:ing a saved workspace to the slave
>>> using the Clone Workspace SCM plugin:
>>>    
>>>      
>>  
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Alan Harder-2

Lundh, Gustaf wrote:
> Your fix works great for the CopyArtifact plugin (awesome plug-in BTW). However, since it is hard to predict the next remote call to hang the job/slave, it seems more appropriate for us to fix the underlying issue than applying/multiplying the fix all over :)
>  

yup.. was just wondering if some workaround code might be easier to
maintain for a little while than a forked Hudson...

    - Alan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Andrew Bayer
Is this patch attached to an issue in JIRA yet?

While it's distinctly possible that Kohsuke's working on a rewrite of aspects of the remoting logic, that's no reason not to fix a bug in the near term. I'm also curious to see if this might have anything to do with/any similarity to the ongoing problem with extreme slowness in artifact archiving - I'm not near familiar enough with the remoting code to be able to tell, but it seems like these could be related.

In any case, I'd say that if Kohsuke doesn't have an issue with the code itself when he gets back online, I'd strongly support getting this into core.

A.

On Wed, Sep 8, 2010 at 9:22 AM, Alan Harder <[hidden email]> wrote:

Lundh, Gustaf wrote:
Your fix works great for the CopyArtifact plugin (awesome plug-in BTW). However, since it is hard to predict the next remote call to hang the job/slave, it seems more appropriate for us to fix the underlying issue than applying/multiplying the fix all over :)
 

yup.. was just wondering if some workaround code might be easier to maintain for a little while than a forked Hudson...

  - Alan



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Paul Weiss-2
In reply to this post by Alan Harder-2
I see the same issue during the post-build artifact publishing phase.  I saw your patch for the plugin and tried a similar patch for the that case.  It seems to work, but I agree with Gustaf that the underlying issue should be fixed.

I don't understand your reticence.  Who said anything about a forked Hudson?

This is a serious bug that often requires a restart of the whole farm.

-P

----- Original Message -----
From: "Alan Harder" <[hidden email]>
To: [hidden email]
Sent: Wednesday, September 8, 2010 12:22:14 PM
Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Lundh, Gustaf wrote:
> Your fix works great for the CopyArtifact plugin (awesome plug-in
> BTW). However, since it is hard to predict the next remote call to
> hang the job/slave, it seems more appropriate for us to fix the
> underlying issue than applying/multiplying the fix all over :)
>

yup.. was just wondering if some workaround code might be easier to
maintain for a little while than a forked Hudson...

- Alan


--------------------------------------------------------------------- To
unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Alan Harder-2


Paul Weiss wrote:
> I don't understand your reticence.  Who said anything about a forked Hudson?
>
> This is a serious bug that often requires a restart of the whole farm.
>  

Perhaps you read the thread too quickly.. Gustaf did, in reference to
working around the problem until this bug is fixed.  I was offering a
suggestion towards that end, nothing more.  I have no objection to the
bug being fixed.


> -P
>
> ----- Original Message -----
> From: "Alan Harder" <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, September 8, 2010 12:22:14 PM
> Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!
>
> Lundh, Gustaf wrote:
>  
>> Your fix works great for the CopyArtifact plugin (awesome plug-in
>> BTW). However, since it is hard to predict the next remote call to
>> hang the job/slave, it seems more appropriate for us to fix the
>> underlying issue than applying/multiplying the fix all over :)
>>
>>    
>
> yup.. was just wondering if some workaround code might be easier to
> maintain for a little while than a forked Hudson...
>
> - Alan
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

GLundh
Alan Harder-2 wrote
Paul Weiss wrote:
> I don't understand your reticence.  Who said anything about a forked Hudson?
>
> This is a serious bug that often requires a restart of the whole farm.
>  

Perhaps you read the thread too quickly.. Gustaf did, in reference to
working around the problem until this bug is fixed.  I was offering a
suggestion towards that end, nothing more.  I have no objection to the
bug being fixed.
Yes. This is correct.

I was merely trying to emphasize how critical this issue is for our environment. And if we cannot get the issue patched, we have to fork Hudson internally at our company. And maintaining a private fork with exclusive patches is really not a preferable situation.

Kohsuke has now commented on the issue accordingly:

"I'm testing a patch similar to what's proposed by GLundh, except it does throttling to avoid unbounded growth of the queue. Plan to integrate it to the trunk soon, but I'm bit cautious as the remoting affects many things.

If there are other people willing to test the modification, I'll post the modified binary."

I hope we can get this one into core really soon. And that the code is posted on the SVN. Would really like to help out testing it.

Br
Gustaf Lundh
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Kohsuke Kawaguchi
Administrator
My plan was to cut the RC branch today and then merge back those
changes to the trunk.

I agree with the importance of the fix.

2010/9/15 GLundh <[hidden email]>:

>
>
> Alan Harder-2 wrote:
>>
>> Paul Weiss wrote:
>>> I don't understand your reticence.  Who said anything about a forked
>>> Hudson?
>>>
>>> This is a serious bug that often requires a restart of the whole farm.
>>>
>>
>> Perhaps you read the thread too quickly.. Gustaf did, in reference to
>> working around the problem until this bug is fixed.  I was offering a
>> suggestion towards that end, nothing more.  I have no objection to the
>> bug being fixed.
>>
>
> Yes. This is correct.
>
> I was merely trying to emphasize how critical this issue is for our
> environment. And if we cannot get the issue patched, we have to fork Hudson
> internally at our company. And maintaining a private fork with exclusive
> patches is really not a preferable situation.
>
> Kohsuke has now commented on the issue accordingly:
>
> "I'm testing a patch similar to what's proposed by GLundh, except it does
> throttling to avoid unbounded growth of the queue. Plan to integrate it to
> the trunk soon, but I'm bit cautious as the remoting affects many things.
>
> If there are other people willing to test the modification, I'll post the
> modified binary."
>
> I hope we can get this one into core really soon. And that the code is
> posted on the SVN. Would really like to help out testing it.
>
> Br
> Gustaf Lundh
>
> --
> View this message in context: http://hudson.361315.n4.nabble.com/Discussion-regarding-HUDSON-5977-Dead-lock-condition-due-to-pipe-clogging-And-more-tp2322497p2540718.html
> Sent from the Hudson dev mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Kohsuke Kawaguchi

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Andrew Bayer
Might make sense to put the changes on a branch for now, so we can build it at ci.hudson-labs.org and get some real testing in on it before it gets to trunk - just a thought.

A.

On Wed, Sep 15, 2010 at 8:28 AM, Kohsuke Kawaguchi <[hidden email]> wrote:
My plan was to cut the RC branch today and then merge back those
changes to the trunk.

I agree with the importance of the fix.

2010/9/15 GLundh <[hidden email]>:
>
>
> Alan Harder-2 wrote:
>>
>> Paul Weiss wrote:
>>> I don't understand your reticence.  Who said anything about a forked
>>> Hudson?
>>>
>>> This is a serious bug that often requires a restart of the whole farm.
>>>
>>
>> Perhaps you read the thread too quickly.. Gustaf did, in reference to
>> working around the problem until this bug is fixed.  I was offering a
>> suggestion towards that end, nothing more.  I have no objection to the
>> bug being fixed.
>>
>
> Yes. This is correct.
>
> I was merely trying to emphasize how critical this issue is for our
> environment. And if we cannot get the issue patched, we have to fork Hudson
> internally at our company. And maintaining a private fork with exclusive
> patches is really not a preferable situation.
>
> Kohsuke has now commented on the issue accordingly:
>
> "I'm testing a patch similar to what's proposed by GLundh, except it does
> throttling to avoid unbounded growth of the queue. Plan to integrate it to
> the trunk soon, but I'm bit cautious as the remoting affects many things.
>
> If there are other people willing to test the modification, I'll post the
> modified binary."
>
> I hope we can get this one into core really soon. And that the code is
> posted on the SVN. Would really like to help out testing it.
>
> Br
> Gustaf Lundh
>
> --
> View this message in context: http://hudson.361315.n4.nabble.com/Discussion-regarding-HUDSON-5977-Dead-lock-condition-due-to-pipe-clogging-And-more-tp2322497p2540718.html
> Sent from the Hudson dev mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Kohsuke Kawaguchi

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Paul Weiss-2
In reply to this post by Alan Harder-2
> Because the severity of the issue, and with not much happening, we are  
> currently thinking of forking Hudson internally to get the patch in  
> place. Running internal forks is really not a great situation to be in.

Why so he did.  My apologies.  Although, admittedly, when I read this  
initially I took it to mean "running a patched version", which I do all  
the time, not actually publishing a new fork for the world to see.

In any event, I'm glad this now appears to be converging.

-P




On Wed, 15 Sep 2010 10:42:53 -0400, Alan Harder <[hidden email]>  
wrote:

>
>
> Paul Weiss wrote:
>> I don't understand your reticence.  Who said anything about a forked  
>> Hudson?
>>
>> This is a serious bug that often requires a restart of the whole farm.
>>
>
> Perhaps you read the thread too quickly.. Gustaf did, in reference to  
> working around the problem until this bug is fixed.  I was offering a  
> suggestion towards that end, nothing more.  I have no objection to the  
> bug being fixed.
>
>
>> -P
>>
>> ----- Original Message -----
>> From: "Alan Harder" <[hidden email]>
>> To: [hidden email]
>> Sent: Wednesday, September 8, 2010 12:22:14 PM
>> Subject: Re: Discussion regarding HUDSON-5977, Dead lock condition due  
>> to pipe clogging. PROPOSED SOLUTION!
>>
>> Lundh, Gustaf wrote:
>>
>>> Your fix works great for the CopyArtifact plugin (awesome plug-in
>>> BTW). However, since it is hard to predict the next remote call to
>>> hang the job/slave, it seems more appropriate for us to fix the
>>> underlying issue than applying/multiplying the fix all over :)
>>>
>>>
>>
>> yup.. was just wondering if some workaround code might be easier to
>> maintain for a little while than a forked Hudson...
>>
>> - Alan
>>
>>
>> --------------------------------------------------------------------- To
>> unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

GLundh
In reply to this post by Kohsuke Kawaguchi
Kohsuke Kawaguchi wrote
My plan was to cut the RC branch today and then merge back those
changes to the trunk.

I agree with the importance of the fix.

--
Kohsuke Kawaguchi

Since the patch, things seem to actually have gone worse. Archiving hangs more often than ever, but we also see hangs during other phases. Sometimes in the middle of a build.

A very typical stack-dump looks like this:

-----------------------------
Example 1, Hang on archiving:
-----------------------------
On master:
-----------

Channel reader thread: slave001

"Channel reader thread: slave001" Id=742 Group=main RUNNABLE (in native)
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:199)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        -  locked java.io.BufferedInputStream@6c191b43
        at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
        at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:925)
       

Executor #0 for slave001 : executing Nightly_Project #18

"Executor #0 for slave001 : executing Nightly_Project #18" Id=214 Group=main TIMED_WAITING on [B@5f91f3e1
        at java.lang.Object.wait(Native Method)
        -  waiting on [B@5f91f3e1
        at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:171)
        at hudson.util.HeadBufferingStream.read(HeadBufferingStream.java:61)
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:221)
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92)
        at org.apache.tools.tar.TarBuffer.readBlock(TarBuffer.java:257)
        at org.apache.tools.tar.TarBuffer.readRecord(TarBuffer.java:223)
        at hudson.org.apache.tools.tar.TarInputStream.read(TarInputStream.java:345)
        at java.io.FilterInputStream.read(FilterInputStream.java:90)
        at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
        at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
        at hudson.util.IOUtils.copy(IOUtils.java:33)
        at hudson.FilePath.readFromTar(FilePath.java:1565)
        at hudson.FilePath.copyRecursiveTo(FilePath.java:1491)
        at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:117)
        at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
        at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
        at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:580)
        at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:558)
        at hudson.model.Build$RunnerImpl.post2(Build.java:157)
        at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
        at hudson.model.Run.run(Run.java:1303)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
        at hudson.model.ResourceController.execute(ResourceController.java:88)
        at hudson.model.Executor.run(Executor.java:137)
       
Pipe writer thread: slave001

"Pipe writer thread: slave001" Id=1071 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42218847
        at sun.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42218847
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)
       
        stderr copier for remote agent on slave001

"stderr copier for remote agent on slave001" Id=314 Group=main RUNNABLE (in native)
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:177)
        at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)


Slave:
------

Thread Dump
Channel reader thread: channel

"Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native)
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:199)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        -  locked java.io.BufferedInputStream@10fbabc
        at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
        at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:925)


main

"main" Id=1 Group=main WAITING on hudson.remoting.Channel@bfea1d
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.Channel@bfea1d
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.Channel.join(Channel.java:706)
        at hudson.remoting.Launcher.main(Launcher.java:419)
        at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:365)
        at hudson.remoting.Launcher.run(Launcher.java:205)
        at hudson.remoting.Launcher.main(Launcher.java:167)


Ping thread for channel hudson.remoting.Channel@bfea1d:channel

"Ping thread for channel hudson.remoting.Channel@bfea1d:channel" Id=10 Group=main TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at hudson.remoting.PingThread.run(PingThread.java:86)


Pipe writer thread: channel

"Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1d934a6
        at sun.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1d934a6
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)


pool-1-thread-43

"pool-1-thread-43" Id=308 Group=main WAITING on hudson.remoting.PipeWindow$Real@192aa08
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.PipeWindow$Real@192aa08
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:108)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        -  locked java.io.BufferedOutputStream@8bd2f5
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
        -  locked java.util.zip.GZIPOutputStream@19d6ea2
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        -  locked hudson.util.io.TarArchiver$1@1beddbb
        at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410)
        at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388)
        at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337)
        at hudson.util.io.TarArchiver.visit(TarArchiver.java:103)
        at hudson.util.DirScanner$Glob.scan(DirScanner.java:104)
        at hudson.FilePath.writeToTar(FilePath.java:1543)
        at hudson.FilePath.access$800(FilePath.java:159)
        at hudson.FilePath$33.invoke(FilePath.java:1484)
        at hudson.FilePath$33.invoke(FilePath.java:1481)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@1cf2917


pool-1-thread-59

"pool-1-thread-59" Id=496 Group=main WAITING on hudson.remoting.PipeWindow$Real@19b2abb
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.PipeWindow$Real@19b2abb
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:108)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        -  locked java.io.BufferedOutputStream@d6bfd4
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
        -  locked java.util.zip.GZIPOutputStream@2a0635
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        -  locked hudson.util.io.TarArchiver$1@19aea1a
        at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410)
        at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388)
        at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337)
        at hudson.util.io.TarArchiver.visit(TarArchiver.java:103)
        at hudson.util.DirScanner$Glob.scan(DirScanner.java:104)
        at hudson.FilePath.writeToTar(FilePath.java:1543)
        at hudson.FilePath.access$800(FilePath.java:159)
        at hudson.FilePath$33.invoke(FilePath.java:1484)
        at hudson.FilePath$33.invoke(FilePath.java:1481)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@17f8f55


pool-1-thread-62

"pool-1-thread-62" Id=503 Group=main WAITING on hudson.remoting.PipeWindow$Real@1ec2d3b
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.PipeWindow$Real@1ec2d3b
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        -  locked java.io.BufferedOutputStream@b3e1b4
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:161)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:118)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:72)
        -  locked java.util.zip.GZIPOutputStream@1eebddc
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        -  locked hudson.util.io.TarArchiver$1@c51653
        at org.apache.tools.tar.TarBuffer.writeBlock(TarBuffer.java:410)
        at org.apache.tools.tar.TarBuffer.writeRecord(TarBuffer.java:388)
        at hudson.org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:337)
        at hudson.util.io.TarArchiver.visit(TarArchiver.java:103)
        at hudson.util.DirScanner$Glob.scan(DirScanner.java:104)
        at hudson.FilePath.writeToTar(FilePath.java:1543)
        at hudson.FilePath.access$800(FilePath.java:159)
        at hudson.FilePath$33.invoke(FilePath.java:1484)
        at hudson.FilePath$33.invoke(FilePath.java:1481)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@13caab8


pool-1-thread-73

"pool-1-thread-73" Id=518 Group=main RUNNABLE
        at sun.management.ThreadImpl.dumpThreads0(Native Method)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374)
        at hudson.Functions.getThreadInfos(Functions.java:775)
        at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:75)
        at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:71)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@d941af


Finalizer

"Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@1ac9043
        at java.lang.Object.wait(Native Method)
        -  waiting on java.lang.ref.ReferenceQueue$Lock@1ac9043
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)


Reference Handler

"Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@332be0
        at java.lang.Object.wait(Native Method)
        -  waiting on java.lang.ref.Reference$Lock@332be0
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)


Signal Dispatcher

"Signal Dispatcher" Id=4 Group=system RUNNABLE



------------------------------------
Example 2, Hang on Emma publisher:
------------------------------------
Master:
--------

Channel reader thread: slave002

"Channel reader thread: slave002" Id=848 Group=main RUNNABLE (in native)
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:199)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        -  locked java.io.BufferedInputStream@20ec980d
        at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
        at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:925)
       
       
Executor #0 for slave002 : executing Nightly_Project_2 #19 / waiting for hudson.remoting.Channel@1291a472:slave002

"Executor #0 for slave002 : executing Nightly_Project_2 #19 / waiting for hudson.remoting.Channel@1291a472:slave002" Id=381 Group=main TIMED_WAITING on hudson.remoting.UserRequest@60459d77
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.UserRequest@60459d77
        at hudson.remoting.Request.call(Request.java:122)
        at hudson.remoting.Channel.call(Channel.java:607)
        at hudson.FilePath.act(FilePath.java:742)
        at hudson.FilePath.act(FilePath.java:735)
        at hudson.FilePath.copyTo(FilePath.java:1360)
        at hudson.FilePath.copyTo(FilePath.java:1335)
        at hudson.plugins.emma.EmmaPublisher.saveCoverageReports(EmmaPublisher.java:91)
        at hudson.plugins.emma.EmmaPublisher.perform(EmmaPublisher.java:123)
        at hudson.tasks.BuildStepMonitor$3.perform(BuildStepMonitor.java:36)
        at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
        at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:580)
        at hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:558)
        at hudson.model.Build$RunnerImpl.post2(Build.java:157)
        at hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:528)
        at hudson.model.Run.run(Run.java:1303)
        at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
        at hudson.model.ResourceController.execute(ResourceController.java:88)
        at hudson.model.Executor.run(Executor.java:137)
       
       
Pipe writer thread: slave002

"Pipe writer thread: slave002" Id=971 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6577a5f4
        at sun.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6577a5f4
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)


stderr copier for remote agent on slave002

"stderr copier for remote agent on slave002" Id=414 Group=main RUNNABLE (in native)
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:177)
        at hudson.util.StreamCopyThread.run(StreamCopyThread.java:60)


Slave:
------


Thread Dump
/bin/sh -xe /tmp/hudson5168499638544882529.sh: stdout copier

"/bin/sh -xe /tmp/hudson5168499638544882529.sh: stdout copier" Id=325 Group=main WAITING on hudson.remoting.PipeWindow$Real@1f61ce1
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.PipeWindow$Real@1f61ce1
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:144)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:112)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103)
        at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:110)
        at hudson.util.StreamCopyThread.run(StreamCopyThread.java:61)


Channel reader thread: channel

"Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native)
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:199)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        -  locked java.io.BufferedInputStream@507660
        at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
        at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
        at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at hudson.remoting.Channel$ReaderThread.run(Channel.java:925)


main

"main" Id=1 Group=main WAITING on hudson.remoting.Channel@bfea1d
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.Channel@bfea1d
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.Channel.join(Channel.java:706)
        at hudson.remoting.Launcher.main(Launcher.java:419)
        at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:365)
        at hudson.remoting.Launcher.run(Launcher.java:205)
        at hudson.remoting.Launcher.main(Launcher.java:167)


Ping thread for channel hudson.remoting.Channel@bfea1d:channel

"Ping thread for channel hudson.remoting.Channel@bfea1d:channel" Id=10 Group=main TIMED_WAITING
        at java.lang.Thread.sleep(Native Method)
        at hudson.remoting.PingThread.run(PingThread.java:86)


Pipe writer thread: channel

"Pipe writer thread: channel" Id=12 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@e20b3f
        at sun.misc.Unsafe.park(Native Method)
        -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@e20b3f
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)


pool-1-thread-134

"pool-1-thread-134" Id=837 Group=main RUNNABLE
        at sun.management.ThreadImpl.dumpThreads0(Native Method)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374)
        at hudson.Functions.getThreadInfos(Functions.java:775)
        at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:75)
        at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:71)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@1481207


pool-1-thread-83

"pool-1-thread-83" Id=548 Group=main WAITING on hudson.remoting.PipeWindow$Real@fd8c0e
        at java.lang.Object.wait(Native Method)
        -  waiting on hudson.remoting.PipeWindow$Real@fd8c0e
        at java.lang.Object.wait(Object.java:485)
        at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:99)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:133)
        at hudson.remoting.ProxyOutputStream._write(ProxyOutputStream.java:108)
        at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:103)
        at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:110)
        at hudson.Util.copyStream(Util.java:418)
        at hudson.FilePath$30.invoke(FilePath.java:1365)
        at hudson.FilePath$30.invoke(FilePath.java:1360)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1899)
        at hudson.remoting.UserRequest.perform(UserRequest.java:114)
        at hudson.remoting.UserRequest.perform(UserRequest.java:48)
        at hudson.remoting.Request$2.run(Request.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@1e368ec


Finalizer

"Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@8bb582
        at java.lang.Object.wait(Native Method)
        -  waiting on java.lang.ref.ReferenceQueue$Lock@8bb582
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)


Reference Handler

"Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@397fa0
        at java.lang.Object.wait(Native Method)
        -  waiting on java.lang.ref.Reference$Lock@397fa0
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)


Signal Dispatcher

"Signal Dispatcher" Id=4 Group=system RUNNABLE


Best Regards
Gustaf Lundh
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion regarding HUDSON-5977, Dead lock condition due to pipe clogging. PROPOSED SOLUTION!

Paul Weiss-2
I see similar failures.  Multiple hangs, mostly in archiving, but  
sometimes it appears to hang when writing to stdout.
-P

On Tue, 21 Sep 2010 04:53:03 -0400, GLundh <[hidden email]>  
wrote:

> Kohsuke Kawaguchi wrote:
>>
>> My plan was to cut the RC branch today and then merge back those
>> changes to the trunk.
>>
>> I agree with the importance of the fix.
>>
>> --
>> Kohsuke Kawaguchi
>>
> Since the patch, things seem to actually have gone worse. Archiving hangs
> more often than ever, but we also see hangs during other phases.  
> Sometimes
> in the middle of a build.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...