Quantcast

Patch to fix concurrent build problem

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Patch to fix concurrent build problem

Paul Weiss-2
Without this patch, it is possible for jobs to be run concurrently even if  
they don't have the "Execute concurrent builds if necessary" box checked.

You will only run into this problem if the job in question is a  
parameterized build, and there are several versions of the job queued with  
different parameters.

The problem is that each executor chooses the next job to run  
independently, and there is a race condition:  more than one executor can  
choose a job of the same project to run -- as long as it hasn't actually  
started the job another executor can choose a job of the same type.  This  
patch closes the gap.

I have been running this for a while.  I only noticed it again when I  
started running a released version rather than my patched one, and this  
problem resurfaced.

Please have a quick look before I check in.

The idea is that when we pop an item off the Queue we record it (with a  
timestamp) in a hash table.  We use this hash table to avoid popping  
another of the same type.  When an executor then starts the job it removes  
it from the hash table.



-P
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

diffs.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Alan Harder-2
seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
refactor that closes the gap w/o introducing another data structure.

if this popped Map is used, here are my comments:

- can't the "popped.remove" call in taskIsPopped generate a
ConcurrentModificationException?  perhaps the loop should use an
Iterator and call Iterator.remove() to remove an element.

- taskIsPopped may modify the map but is not synchronized.
  (could use a ConcurrentHashMap and then neither this method nor
completePop would need synchronization?)

- just to match the style of existing Hudson code, maybe remove the
spaces before () in method definitions

Thanks,

    - Alan

FYI, see:
http://issues.hudson-ci.org/browse/HUDSON-5653
http://issues.hudson-ci.org/browse/HUDSON-5229



Paul G. Weiss wrote:

> Without this patch, it is possible for jobs to be run concurrently
> even if they don't have the "Execute concurrent builds if necessary"
> box checked.
>
> You will only run into this problem if the job in question is a
> parameterized build, and there are several versions of the job queued
> with different parameters.
>
> The problem is that each executor chooses the next job to run
> independently, and there is a race condition:  more than one executor
> can choose a job of the same project to run -- as long as it hasn't
> actually started the job another executor can choose a job of the same
> type.  This patch closes the gap.
>
> I have been running this for a while.  I only noticed it again when I
> started running a released version rather than my patched one, and
> this problem resurfaced.
>
> Please have a quick look before I check in.
>
> The idea is that when we pop an item off the Queue we record it (with
> a timestamp) in a hash table.  We use this hash table to avoid popping
> another of the same type.  When an executor then starts the job it
> removes it from the hash table.
>
>
>
> -P
> ------------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Paul Weiss-2
Alan,

Thanks for commenting.  My responses are interspersed below.

-P

On Fri, 28 May 2010 16:52:00 -0400, Alan Harder <[hidden email]>
wrote:

> seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate  
> refactor that closes the gap w/o introducing another data structure.
>

No argument there.  I would certainly welcome a better solution.  This is  
the best that I came
up with w/o a lot of restructuring, and it solved the problem.

> if this popped Map is used, here are my comments:
>
> - can't the "popped.remove" call in taskIsPopped generate a  
> ConcurrentModificationException?  perhaps the loop should use an  
> Iterator and call Iterator.remove() to remove an element.
>
> - taskIsPopped may modify the map but is not synchronized.
>   (could use a ConcurrentHashMap and then neither this method nor  
> completePop would need synchronization?)
>

The taskIsPopped method is only called from within allowNewBuildableTask,  
which is only called from with maintain, which is synchronized.  And  
completePop, which does the "popped.remove" is also synchronized.  The  
third use of "popped", is in Queue.pop, but in a synchronized section.  
Furthermore, none of these methods occurs in the dynamic scope of the  
others, so I don't see how a ConcurrentModificationException can occur.

That being said, I suppose there is no harm is adding synchronization to  
taskIsPopped, just in case it is called from another place some day.

> - just to match the style of existing Hudson code, maybe remove the  
> spaces before () in method definitions
>

Done.  I'll commit these changes after I build and test, unless you have  
any other suggestions.


> Thanks,
>
>     - Alan
>
> FYI, see:
> http://issues.hudson-ci.org/browse/HUDSON-5653
> http://issues.hudson-ci.org/browse/HUDSON-5229
>
>
>
> Paul G. Weiss wrote:
>> Without this patch, it is possible for jobs to be run concurrently even  
>> if they don't have the "Execute concurrent builds if necessary" box  
>> checked.
>>
>> You will only run into this problem if the job in question is a  
>> parameterized build, and there are several versions of the job queued  
>> with different parameters.
>>
>> The problem is that each executor chooses the next job to run  
>> independently, and there is a race condition:  more than one executor  
>> can choose a job of the same project to run -- as long as it hasn't  
>> actually started the job another executor can choose a job of the same  
>> type.  This patch closes the gap.
>>
>> I have been running this for a while.  I only noticed it again when I  
>> started running a released version rather than my patched one, and this  
>> problem resurfaced.
>>
>> Please have a quick look before I check in.
>>
>> The idea is that when we pop an item off the Queue we record it (with a  
>> timestamp) in a hash table.  We use this hash table to avoid popping  
>> another of the same type.  When an executor then starts the job it  
>> removes it from the hash table.
>>
>>
>>
>> -P
>> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Kohsuke Kawaguchi-3
In reply to this post by Alan Harder-2

If I understand the root cause correctly, the problem is that the
following two events do not happen atomically from the queue scheduling
perspective:

   (1) the removal of a task from a queue
   (2) creation of new build that puts the executor into the building
       state

Given that, I think a better fix is to expand the scope of the lock on
Queue so that the calling executor would lock Queue for the entire
duration of (1) and (2) --- namely from grabJob() to
task.createExecutable().

This requires a bit of re-ordering of notifications and so on, but I
think it's all quite feasible.

I believe extending this atomicity is also necessary to fix the
locks&latch behavior.

I'll test and commit the change once 1.361 ships. If Paul could also
verify the fix that would be greatly appreciated.


On 05/28/2010 01:52 PM, Alan Harder wrote:

> seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
> refactor that closes the gap w/o introducing another data structure.
>
> if this popped Map is used, here are my comments:
>
> - can't the "popped.remove" call in taskIsPopped generate a
> ConcurrentModificationException?  perhaps the loop should use an
> Iterator and call Iterator.remove() to remove an element.
>
> - taskIsPopped may modify the map but is not synchronized.
>    (could use a ConcurrentHashMap and then neither this method nor
> completePop would need synchronization?)
>
> - just to match the style of existing Hudson code, maybe remove the
> spaces before () in method definitions
>
> Thanks,
>
>      - Alan
>
> FYI, see:
> http://issues.hudson-ci.org/browse/HUDSON-5653
> http://issues.hudson-ci.org/browse/HUDSON-5229
>
>
>
> Paul G. Weiss wrote:
>>  Without this patch, it is possible for jobs to be run concurrently
>>  even if they don't have the "Execute concurrent builds if necessary"
>>  box checked.
>>
>>  You will only run into this problem if the job in question is a
>>  parameterized build, and there are several versions of the job queued
>>  with different parameters.
>>
>>  The problem is that each executor chooses the next job to run
>>  independently, and there is a race condition:  more than one executor
>>  can choose a job of the same project to run -- as long as it hasn't
>>  actually started the job another executor can choose a job of the same
>>  type.  This patch closes the gap.
>>
>>  I have been running this for a while.  I only noticed it again when I
>>  started running a released version rather than my patched one, and
>>  this problem resurfaced.
>>
>>  Please have a quick look before I check in.
>>
>>  The idea is that when we pop an item off the Queue we record it (with
>>  a timestamp) in a hash table.  We use this hash table to avoid popping
>>  another of the same type.  When an executor then starts the job it
>>  removes it from the hash table.
>>
>>
>>
>>  -P
>>  ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Kohsuke Kawaguchi | InfraDNA, Inc. | http://infradna.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Paul Weiss-2
On Sat, 05 Jun 2010 02:17:16 -0400, Kohsuke Kawaguchi  
<[hidden email]> wrote:

>
> If I understand the root cause correctly, the problem is that the  
> following two events do not happen atomically from the queue scheduling  
> perspective:
>
>    (1) the removal of a task from a queue
>    (2) creation of new build that puts the executor into the building
>        state
>
> Given that, I think a better fix is to expand the scope of the lock on  
> Queue so that the calling executor would lock Queue for the entire  
> duration of (1) and (2) --- namely from grabJob() to  
> task.createExecutable().
>
> This requires a bit of re-ordering of notifications and so on, but I  
> think it's all quite feasible.
>
> I believe extending this atomicity is also necessary to fix the  
> locks&latch behavior.
>
> I'll test and commit the change once 1.361 ships. If Paul could also  
> verify the fix that would be greatly appreciated.
>
>

Thanks, Kohsuke.  Will do.
-P


> On 05/28/2010 01:52 PM, Alan Harder wrote:
>> seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
>> refactor that closes the gap w/o introducing another data structure.
>>
>> if this popped Map is used, here are my comments:
>>
>> - can't the "popped.remove" call in taskIsPopped generate a
>> ConcurrentModificationException?  perhaps the loop should use an
>> Iterator and call Iterator.remove() to remove an element.
>>
>> - taskIsPopped may modify the map but is not synchronized.
>>    (could use a ConcurrentHashMap and then neither this method nor
>> completePop would need synchronization?)
>>
>> - just to match the style of existing Hudson code, maybe remove the
>> spaces before () in method definitions
>>
>> Thanks,
>>
>>      - Alan
>>
>> FYI, see:
>> http://issues.hudson-ci.org/browse/HUDSON-5653
>> http://issues.hudson-ci.org/browse/HUDSON-5229
>>
>>
>>
>> Paul G. Weiss wrote:
>>>  Without this patch, it is possible for jobs to be run concurrently
>>>  even if they don't have the "Execute concurrent builds if necessary"
>>>  box checked.
>>>
>>>  You will only run into this problem if the job in question is a
>>>  parameterized build, and there are several versions of the job queued
>>>  with different parameters.
>>>
>>>  The problem is that each executor chooses the next job to run
>>>  independently, and there is a race condition:  more than one executor
>>>  can choose a job of the same project to run -- as long as it hasn't
>>>  actually started the job another executor can choose a job of the same
>>>  type.  This patch closes the gap.
>>>
>>>  I have been running this for a while.  I only noticed it again when I
>>>  started running a released version rather than my patched one, and
>>>  this problem resurfaced.
>>>
>>>  Please have a quick look before I check in.
>>>
>>>  The idea is that when we pop an item off the Queue we record it (with
>>>  a timestamp) in a hash table.  We use this hash table to avoid popping
>>>  another of the same type.  When an executor then starts the job it
>>>  removes it from the hash table.
>>>
>>>
>>>
>>>  -P
>>>  ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Alan Harder-2
In reply to this post by Kohsuke Kawaguchi-3
Hopefully the refactor will also address the problem mentioned here:
http://issues.hudson-ci.org/browse/HUDSON-5125?focusedCommentId=139223&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_139223



Kohsuke Kawaguchi wrote:

>
> If I understand the root cause correctly, the problem is that the
> following two events do not happen atomically from the queue
> scheduling perspective:
>
>   (1) the removal of a task from a queue
>   (2) creation of new build that puts the executor into the building
>       state
>
> Given that, I think a better fix is to expand the scope of the lock on
> Queue so that the calling executor would lock Queue for the entire
> duration of (1) and (2) --- namely from grabJob() to
> task.createExecutable().
>
> This requires a bit of re-ordering of notifications and so on, but I
> think it's all quite feasible.
>
> I believe extending this atomicity is also necessary to fix the
> locks&latch behavior.
>
> I'll test and commit the change once 1.361 ships. If Paul could also
> verify the fix that would be greatly appreciated.
>
>
> On 05/28/2010 01:52 PM, Alan Harder wrote:
>> seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
>> refactor that closes the gap w/o introducing another data structure.
>>
>> if this popped Map is used, here are my comments:
>>
>> - can't the "popped.remove" call in taskIsPopped generate a
>> ConcurrentModificationException?  perhaps the loop should use an
>> Iterator and call Iterator.remove() to remove an element.
>>
>> - taskIsPopped may modify the map but is not synchronized.
>>    (could use a ConcurrentHashMap and then neither this method nor
>> completePop would need synchronization?)
>>
>> - just to match the style of existing Hudson code, maybe remove the
>> spaces before () in method definitions
>>
>> Thanks,
>>
>>      - Alan
>>
>> FYI, see:
>> http://issues.hudson-ci.org/browse/HUDSON-5653
>> http://issues.hudson-ci.org/browse/HUDSON-5229
>>
>>
>>
>> Paul G. Weiss wrote:
>>>  Without this patch, it is possible for jobs to be run concurrently
>>>  even if they don't have the "Execute concurrent builds if necessary"
>>>  box checked.
>>>
>>>  You will only run into this problem if the job in question is a
>>>  parameterized build, and there are several versions of the job queued
>>>  with different parameters.
>>>
>>>  The problem is that each executor chooses the next job to run
>>>  independently, and there is a race condition:  more than one executor
>>>  can choose a job of the same project to run -- as long as it hasn't
>>>  actually started the job another executor can choose a job of the same
>>>  type.  This patch closes the gap.
>>>
>>>  I have been running this for a while.  I only noticed it again when I
>>>  started running a released version rather than my patched one, and
>>>  this problem resurfaced.
>>>
>>>  Please have a quick look before I check in.
>>>
>>>  The idea is that when we pop an item off the Queue we record it (with
>>>  a timestamp) in a hash table.  We use this hash table to avoid popping
>>>  another of the same type.  When an executor then starts the job it
>>>  removes it from the hash table.
>>>
>>>
>>>
>>>  -P
>>>  ------------------------------------------------------------------------
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

drazi
Hi,

Alan Harder-2 wrote:

> Hopefully the refactor will also address the problem mentioned here:
> http://issues.hudson-ci.org/browse/HUDSON-5125?focusedCommentId=139223&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_139223

I'm the one who wrote that comment. It seems I've been duplicating effort here - I managed to completely miss this discussion.

Kohsuke Kawaguchi wrote:
>
> If I understand the root cause correctly, the problem is that the
> following two events do not happen atomically from the queue
> scheduling perspective:
>
>   (1) the removal of a task from a queue
>   (2) creation of new build that puts the executor into the building
>       state
>
> Given that, I think a better fix is to expand the scope of the lock on
> Queue so that the calling executor would lock Queue for the entire
> duration of (1) and (2) --- namely from grabJob() to
> task.createExecutable().
>
> This requires a bit of re-ordering of notifications and so on, but I
> think it's all quite feasible.
>

This all sounds good. As long as jobs move out of the queue and into building state atomically, all should be well.

That is only half the story though. As I described on the HUDSON-5125 ticket, I found there's a similar problem when a build finishes. The fix I suggested is simple but probably not ideal. I guess the better way would be to make the triggering of downstream builds happen before the build goes into postprocessing state.

I'm keen to see this issue resolved so I'd be happy to help with any testing etc.

Cheers

Stu

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Kohsuke Kawaguchi-3
In reply to this post by Paul Weiss-2

I pushed this change to the trunk as rev.31906. This was done after the
RC branch creation to maximize the soak time, so this change is
scheduled for 1.363 currently.

All the tests pass with this change, so there's no obvious problem. I
hope people have a chance to try it out.


On 06/06/2010 05:09 AM, Paul G. Weiss wrote:

> On Sat, 05 Jun 2010 02:17:16 -0400, Kohsuke Kawaguchi
> <[hidden email]>  wrote:
>
>>
>>  If I understand the root cause correctly, the problem is that the
>>  following two events do not happen atomically from the queue scheduling
>>  perspective:
>>
>>     (1) the removal of a task from a queue
>>     (2) creation of new build that puts the executor into the building
>>         state
>>
>>  Given that, I think a better fix is to expand the scope of the lock on
>>  Queue so that the calling executor would lock Queue for the entire
>>  duration of (1) and (2) --- namely from grabJob() to
>>  task.createExecutable().
>>
>>  This requires a bit of re-ordering of notifications and so on, but I
>>  think it's all quite feasible.
>>
>>  I believe extending this atomicity is also necessary to fix the
>>  locks&latch behavior.
>>
>>  I'll test and commit the change once 1.361 ships. If Paul could also
>>  verify the fix that would be greatly appreciated.
>>
>>
>
> Thanks, Kohsuke.  Will do.
> -P
>
>
>>  On 05/28/2010 01:52 PM, Alan Harder wrote:
>>>  seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
>>>  refactor that closes the gap w/o introducing another data structure.
>>>
>>>  if this popped Map is used, here are my comments:
>>>
>>>  - can't the "popped.remove" call in taskIsPopped generate a
>>>  ConcurrentModificationException?  perhaps the loop should use an
>>>  Iterator and call Iterator.remove() to remove an element.
>>>
>>>  - taskIsPopped may modify the map but is not synchronized.
>>>     (could use a ConcurrentHashMap and then neither this method nor
>>>  completePop would need synchronization?)
>>>
>>>  - just to match the style of existing Hudson code, maybe remove the
>>>  spaces before () in method definitions
>>>
>>>  Thanks,
>>>
>>>       - Alan
>>>
>>>  FYI, see:
>>>  http://issues.hudson-ci.org/browse/HUDSON-5653
>>>  http://issues.hudson-ci.org/browse/HUDSON-5229
>>>
>>>
>>>
>>>  Paul G. Weiss wrote:
>>>>   Without this patch, it is possible for jobs to be run concurrently
>>>>   even if they don't have the "Execute concurrent builds if necessary"
>>>>   box checked.
>>>>
>>>>   You will only run into this problem if the job in question is a
>>>>   parameterized build, and there are several versions of the job queued
>>>>   with different parameters.
>>>>
>>>>   The problem is that each executor chooses the next job to run
>>>>   independently, and there is a race condition:  more than one executor
>>>>   can choose a job of the same project to run -- as long as it hasn't
>>>>   actually started the job another executor can choose a job of the same
>>>>   type.  This patch closes the gap.
>>>>
>>>>   I have been running this for a while.  I only noticed it again when I
>>>>   started running a released version rather than my patched one, and
>>>>   this problem resurfaced.
>>>>
>>>>   Please have a quick look before I check in.
>>>>
>>>>   The idea is that when we pop an item off the Queue we record it (with
>>>>   a timestamp) in a hash table.  We use this hash table to avoid popping
>>>>   another of the same type.  When an executor then starts the job it
>>>>   removes it from the hash table.
>>>>
>>>>
>>>>
>>>>   -P
>>>>   ------------------------------------------------------------------------
>>>
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: [hidden email]
>>>  For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Kohsuke Kawaguchi | InfraDNA, Inc. | http://infradna.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

vtintillier
Hi Kohsuke,

when you say it will fix the lock and latches behavior, what do you mean?

We use it and we sometimes have an issue where jobs with a common lock will start at the same time (I mean will be removed from the queue and put on an executor). Then there is a second check of who have the lock, and one of the job is in fact sleeping waiting for the lock, but is shown as building in the UI, and thus a build is created. This causes issues with Perforce polling, which checks in previous build (ie the one that is sleeping) and find the workspace as being offline, and thus trigger another (unnecessary build).

So I understand there may be flaws in the Perforce plugin, but I wanted to know if your fix will prevent two builds to be removed from the queue at the same time, despite having a common lock.

Thanks for the great work.

Vivien

On Thu, Jun 10, 2010 at 4:31 AM, Kohsuke Kawaguchi <[hidden email]> wrote:

I pushed this change to the trunk as rev.31906. This was done after the RC branch creation to maximize the soak time, so this change is scheduled for 1.363 currently.

All the tests pass with this change, so there's no obvious problem. I hope people have a chance to try it out.



On 06/06/2010 05:09 AM, Paul G. Weiss wrote:
On Sat, 05 Jun 2010 02:17:16 -0400, Kohsuke Kawaguchi
<[hidden email]>  wrote:


 If I understand the root cause correctly, the problem is that the
 following two events do not happen atomically from the queue scheduling
 perspective:

   (1) the removal of a task from a queue
   (2) creation of new build that puts the executor into the building
       state

 Given that, I think a better fix is to expand the scope of the lock on
 Queue so that the calling executor would lock Queue for the entire
 duration of (1) and (2) --- namely from grabJob() to
 task.createExecutable().

 This requires a bit of re-ordering of notifications and so on, but I
 think it's all quite feasible.

 I believe extending this atomicity is also necessary to fix the
 locks&latch behavior.

 I'll test and commit the change once 1.361 ships. If Paul could also
 verify the fix that would be greatly appreciated.



Thanks, Kohsuke.  Will do.
-P


 On 05/28/2010 01:52 PM, Alan Harder wrote:
 seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
 refactor that closes the gap w/o introducing another data structure.

 if this popped Map is used, here are my comments:

 - can't the "popped.remove" call in taskIsPopped generate a
 ConcurrentModificationException?  perhaps the loop should use an
 Iterator and call Iterator.remove() to remove an element.

 - taskIsPopped may modify the map but is not synchronized.
   (could use a ConcurrentHashMap and then neither this method nor
 completePop would need synchronization?)

 - just to match the style of existing Hudson code, maybe remove the
 spaces before () in method definitions

 Thanks,

     - Alan

 FYI, see:
 http://issues.hudson-ci.org/browse/HUDSON-5653
 http://issues.hudson-ci.org/browse/HUDSON-5229



 Paul G. Weiss wrote:
 Without this patch, it is possible for jobs to be run concurrently
 even if they don't have the "Execute concurrent builds if necessary"
 box checked.

 You will only run into this problem if the job in question is a
 parameterized build, and there are several versions of the job queued
 with different parameters.

 The problem is that each executor chooses the next job to run
 independently, and there is a race condition:  more than one executor
 can choose a job of the same project to run -- as long as it hasn't
 actually started the job another executor can choose a job of the same
 type.  This patch closes the gap.

 I have been running this for a while.  I only noticed it again when I
 started running a released version rather than my patched one, and
 this problem resurfaced.

 Please have a quick look before I check in.

 The idea is that when we pop an item off the Queue we record it (with
 a timestamp) in a hash table.  We use this hash table to avoid popping
 another of the same type.  When an executor then starts the job it
 removes it from the hash table.



 -P
 ------------------------------------------------------------------------

 ---------------------------------------------------------------------
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Kohsuke Kawaguchi | InfraDNA, Inc. | http://infradna.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Rich Carr
This post has NOT been accepted by the mailing list yet.
In reply to this post by Kohsuke Kawaguchi-3
This sounds very much like the fix I'm looking for.  I hope it treats queue removal/job initiation as atomic for flyweight jobs as well as for "real" jobs.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Rich Carr
This post has NOT been accepted by the mailing list yet.
I downloaded the war file from the lastSuccessfulBuild and I still see downstream matrix/flyweight builds being started after an upstream matrix/flyweightbuild has left the queue but before it has been initiated.

- Rich
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Paul Weiss-2
In reply to this post by Kohsuke Kawaguchi-3
On Wed, 09 Jun 2010 22:31:19 -0400, Kohsuke Kawaguchi  
<[hidden email]> wrote:

>
> I pushed this change to the trunk as rev.31906. This was done after the  
> RC branch creation to maximize the soak time, so this change is  
> scheduled for 1.363 currently.
>
> All the tests pass with this change, so there's no obvious problem. I  
> hope people have a chance to try it out.
>
>

This isn't working.  We're still seeing concurrent builds.  We had this  
problem licked with my fix, so I suggest we go back to that.

-P


> On 06/06/2010 05:09 AM, Paul G. Weiss wrote:
>> On Sat, 05 Jun 2010 02:17:16 -0400, Kohsuke Kawaguchi
>> <[hidden email]>  wrote:
>>
>>>
>>>  If I understand the root cause correctly, the problem is that the
>>>  following two events do not happen atomically from the queue  
>>> scheduling
>>>  perspective:
>>>
>>>     (1) the removal of a task from a queue
>>>     (2) creation of new build that puts the executor into the building
>>>         state
>>>
>>>  Given that, I think a better fix is to expand the scope of the lock on
>>>  Queue so that the calling executor would lock Queue for the entire
>>>  duration of (1) and (2) --- namely from grabJob() to
>>>  task.createExecutable().
>>>
>>>  This requires a bit of re-ordering of notifications and so on, but I
>>>  think it's all quite feasible.
>>>
>>>  I believe extending this atomicity is also necessary to fix the
>>>  locks&latch behavior.
>>>
>>>  I'll test and commit the change once 1.361 ships. If Paul could also
>>>  verify the fix that would be greatly appreciated.
>>>
>>>
>>
>> Thanks, Kohsuke.  Will do.
>> -P
>>
>>
>>>  On 05/28/2010 01:52 PM, Alan Harder wrote:
>>>>  seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
>>>>  refactor that closes the gap w/o introducing another data structure.
>>>>
>>>>  if this popped Map is used, here are my comments:
>>>>
>>>>  - can't the "popped.remove" call in taskIsPopped generate a
>>>>  ConcurrentModificationException?  perhaps the loop should use an
>>>>  Iterator and call Iterator.remove() to remove an element.
>>>>
>>>>  - taskIsPopped may modify the map but is not synchronized.
>>>>     (could use a ConcurrentHashMap and then neither this method nor
>>>>  completePop would need synchronization?)
>>>>
>>>>  - just to match the style of existing Hudson code, maybe remove the
>>>>  spaces before () in method definitions
>>>>
>>>>  Thanks,
>>>>
>>>>       - Alan
>>>>
>>>>  FYI, see:
>>>>  http://issues.hudson-ci.org/browse/HUDSON-5653
>>>>  http://issues.hudson-ci.org/browse/HUDSON-5229
>>>>
>>>>
>>>>
>>>>  Paul G. Weiss wrote:
>>>>>   Without this patch, it is possible for jobs to be run concurrently
>>>>>   even if they don't have the "Execute concurrent builds if  
>>>>> necessary"
>>>>>   box checked.
>>>>>
>>>>>   You will only run into this problem if the job in question is a
>>>>>   parameterized build, and there are several versions of the job  
>>>>> queued
>>>>>   with different parameters.
>>>>>
>>>>>   The problem is that each executor chooses the next job to run
>>>>>   independently, and there is a race condition:  more than one  
>>>>> executor
>>>>>   can choose a job of the same project to run -- as long as it hasn't
>>>>>   actually started the job another executor can choose a job of the  
>>>>> same
>>>>>   type.  This patch closes the gap.
>>>>>
>>>>>   I have been running this for a while.  I only noticed it again  
>>>>> when I
>>>>>   started running a released version rather than my patched one, and
>>>>>   this problem resurfaced.
>>>>>
>>>>>   Please have a quick look before I check in.
>>>>>
>>>>>   The idea is that when we pop an item off the Queue we record it  
>>>>> (with
>>>>>   a timestamp) in a hash table.  We use this hash table to avoid  
>>>>> popping
>>>>>   another of the same type.  When an executor then starts the job it
>>>>>   removes it from the hash table.
>>>>>
>>>>>
>>>>>
>>>>>   -P
>>>>>   ------------------------------------------------------------------------
>>>>
>>>>  ---------------------------------------------------------------------
>>>>  To unsubscribe, e-mail: [hidden email]
>>>>  For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Kohsuke Kawaguchi-3
In reply to this post by vtintillier
On 06/10/2010 12:16 AM, Vivien Tintillier wrote:
> Hi Kohsuke,
>
> when you say it will fix the lock and latches behavior, what do you mean?

I haven't looked into this in enough details yet, but IIRC, Stephen told
me that the reason his locks&latches plugin doesn't fully work is
because of the lack of atomicity in certain operations. I believe my fix
had fixed that underlying problem.

This is captured in http://issues.hudson-ci.org/browse/HUDSON-1083

> We use it and we sometimes have an issue where jobs with a common lock will
> start at the same time (I mean will be removed from the queue and put on an
> executor). Then there is a second check of who have the lock, and one of the
> job is in fact sleeping waiting for the lock, but is shown as building in
> the UI, and thus a build is created. This causes issues with Perforce
> polling, which checks in previous build (ie the one that is sleeping) and
> find the workspace as being offline, and thus trigger another (unnecessary
> build).

It sounds like the same symptom as HUDSON-1083, made more complicated by
a bug in the Perforce plugin.


> So I understand there may be flaws in the Perforce plugin, but I wanted to
> know if your fix will prevent two builds to be removed from the queue at the
> same time, despite having a common lock.

Paul mentioned that it's not working, so maybe there's still a problem here.

>
> Thanks for the great work.
>
> Vivien
>
> On Thu, Jun 10, 2010 at 4:31 AM, Kohsuke Kawaguchi<[hidden email]>wrote:
>
>>
>>  I pushed this change to the trunk as rev.31906. This was done after the RC
>>  branch creation to maximize the soak time, so this change is scheduled for
>>  1.363 currently.
>>
>>  All the tests pass with this change, so there's no obvious problem. I hope
>>  people have a chance to try it out.
>>
>>
>>
>>  On 06/06/2010 05:09 AM, Paul G. Weiss wrote:
>>
>>>  On Sat, 05 Jun 2010 02:17:16 -0400, Kohsuke Kawaguchi
>>>  <[hidden email]>   wrote:
>>>
>>>
>>>>   If I understand the root cause correctly, the problem is that the
>>>>   following two events do not happen atomically from the queue scheduling
>>>>   perspective:
>>>>
>>>>     (1) the removal of a task from a queue
>>>>     (2) creation of new build that puts the executor into the building
>>>>         state
>>>>
>>>>   Given that, I think a better fix is to expand the scope of the lock on
>>>>   Queue so that the calling executor would lock Queue for the entire
>>>>   duration of (1) and (2) --- namely from grabJob() to
>>>>   task.createExecutable().
>>>>
>>>>   This requires a bit of re-ordering of notifications and so on, but I
>>>>   think it's all quite feasible.
>>>>
>>>>   I believe extending this atomicity is also necessary to fix the
>>>>   locks&latch behavior.
>>>>
>>>>   I'll test and commit the change once 1.361 ships. If Paul could also
>>>>   verify the fix that would be greatly appreciated.
>>>>
>>>>
>>>>
>>>  Thanks, Kohsuke.  Will do.
>>>  -P
>>>
>>>
>>>    On 05/28/2010 01:52 PM, Alan Harder wrote:
>>>>
>>>>>   seems kind of bandaid-ish.. perhaps Kohsuke will suggest an alternate
>>>>>   refactor that closes the gap w/o introducing another data structure.
>>>>>
>>>>>   if this popped Map is used, here are my comments:
>>>>>
>>>>>   - can't the "popped.remove" call in taskIsPopped generate a
>>>>>   ConcurrentModificationException?  perhaps the loop should use an
>>>>>   Iterator and call Iterator.remove() to remove an element.
>>>>>
>>>>>   - taskIsPopped may modify the map but is not synchronized.
>>>>>     (could use a ConcurrentHashMap and then neither this method nor
>>>>>   completePop would need synchronization?)
>>>>>
>>>>>   - just to match the style of existing Hudson code, maybe remove the
>>>>>   spaces before () in method definitions
>>>>>
>>>>>   Thanks,
>>>>>
>>>>>       - Alan
>>>>>
>>>>>   FYI, see:
>>>>>   http://issues.hudson-ci.org/browse/HUDSON-5653
>>>>>   http://issues.hudson-ci.org/browse/HUDSON-5229
>>>>>
>>>>>
>>>>>
>>>>>   Paul G. Weiss wrote:
>>>>>
>>>>>>   Without this patch, it is possible for jobs to be run concurrently
>>>>>>   even if they don't have the "Execute concurrent builds if necessary"
>>>>>>   box checked.
>>>>>>
>>>>>>   You will only run into this problem if the job in question is a
>>>>>>   parameterized build, and there are several versions of the job queued
>>>>>>   with different parameters.
>>>>>>
>>>>>>   The problem is that each executor chooses the next job to run
>>>>>>   independently, and there is a race condition:  more than one executor
>>>>>>   can choose a job of the same project to run -- as long as it hasn't
>>>>>>   actually started the job another executor can choose a job of the same
>>>>>>   type.  This patch closes the gap.
>>>>>>
>>>>>>   I have been running this for a while.  I only noticed it again when I
>>>>>>   started running a released version rather than my patched one, and
>>>>>>   this problem resurfaced.
>>>>>>
>>>>>>   Please have a quick look before I check in.
>>>>>>
>>>>>>   The idea is that when we pop an item off the Queue we record it (with
>>>>>>   a timestamp) in a hash table.  We use this hash table to avoid popping
>>>>>>   another of the same type.  When an executor then starts the job it
>>>>>>   removes it from the hash table.
>>>>>>
>>>>>>
>>>>>>
>>>>>>   -P
>>>>>>
>>>>>>   ------------------------------------------------------------------------
>>>>>>
>>>>>
>>>>>   ---------------------------------------------------------------------
>>>>>   To unsubscribe, e-mail: [hidden email]
>>>>>   For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>>
>>>>
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: [hidden email]
>>>  For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>
>>  --
>>  Kohsuke Kawaguchi | InfraDNA, Inc. | http://infradna.com/
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: [hidden email]
>>  For additional commands, e-mail: [hidden email]
>>
>>
>


--
Kohsuke Kawaguchi | InfraDNA, Inc. | http://infradna.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Stephen Haberman

> I believe my fix had fixed that underlying problem

I just tried Hudson trunk ([32268]) and locks and latches is still not working.

However, it looks like both HUDSON-6819 and HUDSON-5125 are fixed.

This is what I'm seeing with jobs base1, child1, child2, child3 where child1/2/3 are downstream of base1 and all 4 have a lock via the locks & latches plugin.

1. Push a change of base1/filea, child1/fileb in the same svn changeset
- I briefly see both base1 and child1 queued to build
2. Only base1 runs, child1 stays queued and waits (good!, this is HUDSON-5125 seemingly fixed)
3. base1 completes and:
- child1 starts running
- child2 and child3 get queued as downstream projects of base1 and wait, both say they are blocked on child1 (good!, locks and latches is working here, blocking against a build already in the queue)
4. child1 completes and:
- child2 and child3 both start running (fail!, locks and latches should have prevented that--it looks like since neither child2/child3 was already in the queue, they didn't see each other and both went ahead)
- both child2 and child3 are now only queued once (good!, this looks like HUDSON-6819 is fixed)

I'm encouraged--Hudson's concurrent build/locks and latches behavior has always been frustrating, but now it is the closest to working that I've ever seen. It would be really great to get the one remaining "fail" case fixed.

Note that I also have synchronous polling turned on so that dependent projects get queued in order--wondering out loud, if you have maximumThreads=1, does that essentially get you syn polling? Given the syn polling option is not available in the UI (I changed the XML file manually), perhaps it is an old option that should be removed?

- Stephen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Stephen Haberman

> 4. child1 completes and:
> - child2 and child3 both start running (fail!, locks and latches should have
> prevented that--it looks like since neither child2/child3 was already in the
> queue, they didn't see each other and both went ahead)

I may be mistaken here--I just noticed that even though child2 and
child3 were both running concurrently on different executors, the
console output of one of them was blocked and didn't proceed until
immediately after the other finished.

I need to watch this a few more times, but it looks like locks and
latches did it's thing at least twice (now three) times in a row.

It's really strange for both jobs to be in executors but one blocking on
the other. For one it really screws with build times--the build time of
child2 becomes time-waiting-for-child1 + child2-regular-time.

It would be great if locks and latches did it's thing before jobs get
assigned to executors, not after.

- Stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

Paul Weiss-2
In reply to this post by Paul Weiss-2
Tried again with 1.364.  Still awful.  I can't go beyond 1.362 now or I  
constantly get concurrent builds when I shouldn't.

Attached is a screen shot (is that allowed?) that shows the problem.    
Notice that there are three builds running for a job that is not marked as  
concurrent.  Notice also that they started a second apart.  Also, there is  
a fourth build pending that is not being run because of concurrency.

Can we please go back to my solution?  Although it was criticized as being  
"bandaid-ish", it did work.

-P




On Wed, 23 Jun 2010 16:54:39 -0400, Paul G. Weiss <[hidden email]> wrote:

> On Wed, 09 Jun 2010 22:31:19 -0400, Kohsuke Kawaguchi  
> <[hidden email]> wrote:
>
>>
>> I pushed this change to the trunk as rev.31906. This was done after the  
>> RC branch creation to maximize the soak time, so this change is  
>> scheduled for 1.363 currently.
>>
>> All the tests pass with this change, so there's no obvious problem. I  
>> hope people have a chance to try it out.
>>
>>
>
> This isn't working.  We're still seeing concurrent builds.  We had this  
> problem licked with my fix, so I suggest we go back to that.
>
> -P
>
>
>> On 06/06/2010 05:09 AM, Paul G. Weiss wrote:
>>> On Sat, 05 Jun 2010 02:17:16 -0400, Kohsuke Kawaguchi
>>> <[hidden email]>  wrote:
>>>
>>>>
>>>>  If I understand the root cause correctly, the problem is that the
>>>>  following two events do not happen atomically from the queue  
>>>> scheduling
>>>>  perspective:
>>>>
>>>>     (1) the removal of a task from a queue
>>>>     (2) creation of new build that puts the executor into the building
>>>>         state
>>>>
>>>>  Given that, I think a better fix is to expand the scope of the lock  
>>>> on
>>>>  Queue so that the calling executor would lock Queue for the entire
>>>>  duration of (1) and (2) --- namely from grabJob() to
>>>>  task.createExecutable().
>>>>
>>>>  This requires a bit of re-ordering of notifications and so on, but I
>>>>  think it's all quite feasible.
>>>>
>>>>  I believe extending this atomicity is also necessary to fix the
>>>>  locks&latch behavior.
>>>>
>>>>  I'll test and commit the change once 1.361 ships. If Paul could also
>>>>  verify the fix that would be greatly appreciated.
>>>>
>>>>
>>>
>>> Thanks, Kohsuke.  Will do.
>>> -P
>>>
>>>
>>>>  On 05/28/2010 01:52 PM, Alan Harder wrote:
>>>>>  seems kind of bandaid-ish.. perhaps Kohsuke will suggest an  
>>>>> alternate
>>>>>  refactor that closes the gap w/o introducing another data structure.
>>>>>
>>>>>  if this popped Map is used, here are my comments:
>>>>>
>>>>>  - can't the "popped.remove" call in taskIsPopped generate a
>>>>>  ConcurrentModificationException?  perhaps the loop should use an
>>>>>  Iterator and call Iterator.remove() to remove an element.
>>>>>
>>>>>  - taskIsPopped may modify the map but is not synchronized.
>>>>>     (could use a ConcurrentHashMap and then neither this method nor
>>>>>  completePop would need synchronization?)
>>>>>
>>>>>  - just to match the style of existing Hudson code, maybe remove the
>>>>>  spaces before () in method definitions
>>>>>
>>>>>  Thanks,
>>>>>
>>>>>       - Alan
>>>>>
>>>>>  FYI, see:
>>>>>  http://issues.hudson-ci.org/browse/HUDSON-5653
>>>>>  http://issues.hudson-ci.org/browse/HUDSON-5229
>>>>>
>>>>>
>>>>>
>>>>>  Paul G. Weiss wrote:
>>>>>>   Without this patch, it is possible for jobs to be run concurrently
>>>>>>   even if they don't have the "Execute concurrent builds if  
>>>>>> necessary"
>>>>>>   box checked.
>>>>>>
>>>>>>   You will only run into this problem if the job in question is a
>>>>>>   parameterized build, and there are several versions of the job  
>>>>>> queued
>>>>>>   with different parameters.
>>>>>>
>>>>>>   The problem is that each executor chooses the next job to run
>>>>>>   independently, and there is a race condition:  more than one  
>>>>>> executor
>>>>>>   can choose a job of the same project to run -- as long as it  
>>>>>> hasn't
>>>>>>   actually started the job another executor can choose a job of the  
>>>>>> same
>>>>>>   type.  This patch closes the gap.
>>>>>>
>>>>>>   I have been running this for a while.  I only noticed it again  
>>>>>> when I
>>>>>>   started running a released version rather than my patched one, and
>>>>>>   this problem resurfaced.
>>>>>>
>>>>>>   Please have a quick look before I check in.
>>>>>>
>>>>>>   The idea is that when we pop an item off the Queue we record it  
>>>>>> (with
>>>>>>   a timestamp) in a hash table.  We use this hash table to avoid  
>>>>>> popping
>>>>>>   another of the same type.  When an executor then starts the job it
>>>>>>   removes it from the hash table.
>>>>>>
>>>>>>
>>>>>>
>>>>>>   -P
>>>>>>   ------------------------------------------------------------------------
>>>>>
>>>>>  ---------------------------------------------------------------------
>>>>>  To unsubscribe, e-mail: [hidden email]
>>>>>  For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

concurrency.png (197K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

mesc
This post was updated on .
In reply to this post by Stephen Haberman
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Re: Patch to fix concurrent build problem

martin.loiselle@medrium.com
This post has NOT been accepted by the mailing list yet.
On vacation:
I will be out of the office starting the Dec. 23rd and
will return on January 10.

for any problem related to your email or computer, please send an email to [hidden email].

For any emergency, call at 1-418-264-6911
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Patch to fix concurrent build problem

mesc
In reply to this post by Stephen Haberman
Stephen Haberman wrote
It would be great if locks and latches did it's thing before jobs get
assigned to executors, not after.
Is there any update/progress for this issue?
Locks and Latches is a very helpful plugin IMHO - would be even better when it would leave the builds blocked out in Hudson's build queue ... :-)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Re: Patch to fix concurrent build problem

martin.loiselle@medrium.com
This post has NOT been accepted by the mailing list yet.
On vacation:
I will be out of the office starting the Dec. 23rd and
will return on January 10.

for any problem related to your email or computer, please send an email to [hidden email].

For any emergency, call at 1-418-264-6911
Loading...