[JIRA] Created: (HUDSON-8370) Job is getting stuck before it can even start building

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[JIRA] Created: (HUDSON-8370) Job is getting stuck before it can even start building

Kohsuke Kawaguchi
Administrator
Job is getting stuck before it can even start building
-------------------------------------------------------

                 Key: HUDSON-8370
                 URL: http://issues.hudson-ci.org/browse/HUDSON-8370
             Project: Hudson
          Issue Type: Bug
          Components: master-slave
    Affects Versions: current
         Environment: Master on Windows 2003 Server. Slaves on various Windows versions connected by SSH running on cygwin.
            Reporter: geoffrey_crandall


Opening a bug case regarding previous discussion: http://hudson.361315.n4.nabble.com/Job-is-getting-stuck-before-it-can-even-start-building-td3053144.html

After upgrading to version 1.386 the issue doesn't appear so often, but I caught one again and dug into the master thread dump.

The stuck job is the parent job of a multiconfiguration job but usually it's the child jobs that get stuck. In this case the child jobs have all completed, but the parent is now stuck with a thread dump:

"Executor #-1 for slave1 : executing NightlyTest #522" Id=3185 Group=main BLOCKED on hudson.remoting.Channel@8fb08f owned by "Monitoring thread for Response Time started on Wed Dec 29 02:30:59 EET 2010" Id=3078
        at hudson.remoting.Request.call(Request.java:100)
        -  blocked on hudson.remoting.Channel@8fb08f
        at hudson.remoting.Channel.call(Channel.java:629)
        at hudson.FilePath.act(FilePath.java:742)
        at hudson.FilePath.act(FilePath.java:735)
        at hudson.FilePath.mkdirs(FilePath.java:801)
        at hudson.model.AbstractProject.checkout(AbstractProject.java:1117)
        at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:480)
        at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:412)
        at hudson.model.Run.run(Run.java:1325)
        at hudson.matrix.MatrixBuild.run(MatrixBuild.java:152)
        at hudson.model.ResourceController.execute(ResourceController.java:88)
        at hudson.model.Executor.run(Executor.java:139)
        at hudson.model.OneOffExecutor.run(OneOffExecutor.java:61)

The thread is apparently waiting for this thread:

"Monitoring thread for Response Time started on Wed Dec 29 02:30:59 EET 2010" Id=3078 Group=main RUNNABLE (in native)
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(Unknown Source)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        -  locked java.io.BufferedOutputStream@15a57c6
        at java.io.FilterOutputStream.flush(Unknown Source)
        at hudson.remoting.BinarySafeStream$2.flush(BinarySafeStream.java:305)
        at java.io.ObjectOutputStream$BlockDataOutputStream.flush(Unknown Source)
        at java.io.ObjectOutputStream.flush(Unknown Source)
        at hudson.remoting.Channel.send(Channel.java:472)
        -  locked hudson.remoting.Channel@8fb08f
        at hudson.remoting.Request.callAsync(Request.java:170)
        at hudson.remoting.Channel.callAsync(Channel.java:656)
        at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:58)
        at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:52)
        at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:200)

No idea why that thread is stuck, but that thread is launching itself again every hour and all those new threads are getting stuck again:

"Monitoring thread for Response Time started on Wed Dec 29 03:30:59 EET 2010" Id=3363 Group=main BLOCKED on hudson.remoting.Channel@8fb08f owned by "Monitoring thread for Response Time started on Wed Dec 29 02:30:59 EET 2010" Id=3078
        at hudson.remoting.Channel.send(Channel.java:465)
        -  blocked on hudson.remoting.Channel@8fb08f
        at hudson.remoting.Request.callAsync(Request.java:170)
        at hudson.remoting.Channel.callAsync(Channel.java:656)
        at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:58)
        at hudson.node_monitors.ResponseTimeMonitor$1.monitor(ResponseTimeMonitor.java:52)
        at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:200)

In the thread dump I can see hundreds of threads that are blocked on the same hudson.remoting.Channel@8fb08f. Like:

Monitoring thread for Architecture started on Wed Dec 29 02:30:59 EET 2010
Monitoring thread for Clock Difference started on Wed Dec 29 02:30:59 EET 2010
Monitoring thread for Free Disk Space started on Wed Dec 29 02:30:59 EET 2010
Monitoring thread for Free Swap Space started on Wed Dec 29 02:30:59 EET 2010
Monitoring thread for Free Temp Space started on Wed Dec 29 02:30:59 EET 2010

And they all relaunch every hour and get blocked. If I try to get the system information for the slave with the stuck job, the request gets blocked like all the other threads:

"Handling GET /hudson/computer/slave1/systemInfo : http-80-8" Id=299 Group=main BLOCKED on hudson.remoting.Channel@8fb08f owned by "Monitoring thread for Response Time started on Wed Dec 29 02:30:59 EET 2010" Id=3078
        at hudson.remoting.Request.call(Request.java:100)
        -  blocked on hudson.remoting.Channel@8fb08f
        at hudson.remoting.Channel.call(Channel.java:629)
        at hudson.util.RemotingDiagnostics.getSystemProperties(RemotingDiagnostics.java:55)
        at hudson.model.Computer.getSystemProperties(Computer.java:730)
        ...

With all these new threads launching and getting blocked they are starting to pile up and the only solution to the problem is to restart Hudson. Any way I could further investigate why the Response Time thread is getting stuck and causing problems for everyone else?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.hudson-ci.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira