Bug 402867 - Hudson job configuration update not atomic and kills executors
Summary: Hudson job configuration update not atomic and kills executors
Status: NEW
Alias: None
Product: Hudson
Classification: Technology
Component: Core (show other bugs)
Version: 3.0.0   Edit
Hardware: All All
: P3 major (vote)
Target Milestone: ---   Edit
Assignee: Winston Prakash CLA
QA Contact: Geoff Waymark CLA
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-11 04:41 EDT by Alexander Link CLA
Modified: 2013-03-11 04:53 EDT (History)
1 user (show)

See Also:


Attachments
Screenshots of Dead Executors and Details (33.18 KB, application/x-zip-compressed)
2013-03-11 04:41 EDT, Alexander Link CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Link CLA 2013-03-11 04:41:34 EDT
Created attachment 228179 [details]
Screenshots of Dead Executors and Details

Hudson job configuration update via /hudson/job/<job>/config.xml is not atomic and can kill (all) Executors!

In our Hudson based build infrastructure we have currently 4000 jobs which are created and updated by a job generator. This generator sends POST requests to /hudson/job/<job>/config.xml to update the job configurations.
During load tests after some hours almost all executors (master and slaves) were dead because an IllegalArgumentException in BaseProjectProperty killed them.
The reason for this Exception is that the update procedure via AbstractItem.doConfigDotXml() is not atomic and between XmlFile.unmarshal (AbstractItem:488) and onLoad (AbstractItem:489) the job has been grabbed from the queue to run it. Although this is not very likely it happened regularly in our scenario.

We were able to reproduce this issue.
(Git Tag: hudson-parent-3.0.0)
* Breakpoint at hudson.model.Queue.schedule(Queue.java:426)
* Breakpoint at org.eclipse.hudson.model.project.property.BaseProjectProperty.setKey(BaseProjectProperty.java:52)
* Trigger Test job in Hudson UI
* Keep Queue.schedule breakpoint suspended
* Send POST request with config.xml content to http://localhost:8080/hudson/job/Test/config.xml to update the job config
* Keep BaseProjectProperty.setKey breakpoint suspended
* Release Queue.schedule breakpoint and wait some seconds
* End debug session...

You will notice at least one executor died (see screenshot) and the details show this Exception:
java.lang.IllegalArgumentException: Project property should have not null propertyKey
	at org.eclipse.hudson.model.project.property.BaseProjectProperty.getCascadingValue(BaseProjectProperty.java:93)
	at org.eclipse.hudson.model.project.property.BaseProjectProperty.getValue(BaseProjectProperty.java:120)
	at hudson.model.BaseBuildableProject.getBuildersList(BaseBuildableProject.java:153)
	at hudson.model.Project.getResourceActivities(Project.java:54)
	at hudson.model.AbstractProject.getResourceList(AbstractProject.java:1485)
	at hudson.model.Queue.isBuildBlocked(Queue.java:921)
	at hudson.model.Queue.maintain(Queue.java:969)
	at hudson.model.Queue.pop(Queue.java:806)
	at hudson.model.Executor.grabJob(Executor.java:183)
	at hudson.model.Executor.run(Executor.java:113)

Here you can see the stack from AbstractItem.doConfigDotXml to BaseProjectProperty.setKey:
Daemon Thread [Handling POST /hudson/job/Test/config.xml : http-8080-7] (Suspended (entry into method setKey in BaseProjectProperty))	
	BooleanProjectProperty(BaseProjectProperty<T>).setKey(String) line: 53	
	FreeStyleProject(Job<JobT,RunT>).buildProjectProperties() line: 400	
	FreeStyleProject(AbstractProject<P,R>).buildProjectProperties() line: 351	
	FreeStyleProject(BaseBuildableProject<P,B>).buildProjectProperties() line: 100	
	FreeStyleProject.buildProjectProperties() line: 87	
	FreeStyleProject(Job<JobT,RunT>).onLoad(ItemGroup<Item>, String) line: 356	
	FreeStyleProject(AbstractProject<P,R>).onLoad(ItemGroup<Item>, String) line: 323	
	FreeStyleProject(BaseBuildableProject<P,B>).onLoad(ItemGroup<Item>, String) line: 91	
	FreeStyleProject(AbstractItem).doConfigDotXml(StaplerRequest, StaplerResponse) line: 489	
	[...]
Comment 1 Alexander Link CLA 2013-03-11 04:53:18 EDT
PS: Do you have any recommendations how to work around this issue?