Bug 472651 - Exception on startup without useful information - _runmap
Summary: Exception on startup without useful information - _runmap
Status: NEW
Alias: None
Product: Hudson
Classification: Technology
Component: Core (show other bugs)
Version: 3.2.2   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Winston Prakash CLA
QA Contact: Geoff Waymark CLA
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-07-14 14:27 EDT by Stuart Lorber CLA
Modified: 2015-07-15 13:53 EDT (History)
5 users (show)

See Also:


Attachments
runmap exception on startup (3.27 KB, text/plain)
2015-07-14 14:27 EDT, Stuart Lorber CLA
no flags Details
Hudson log after deletion of all _runmap.xml files. (78.01 KB, application/octet-stream)
2015-07-14 17:54 EDT, Stuart Lorber CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stuart Lorber CLA 2015-07-14 14:27:56 EDT
Created attachment 255193 [details]
runmap exception on startup

Exception on startup says runmap cannot be created.

There's no way of knowing from the attached exception which job has bad data.
Comment 1 Bob Foster CLA 2015-07-14 15:40:22 EDT
You don't need to know which job. Shutdown, delete all the _runmap.xml files and restart.
Comment 2 Stuart Lorber CLA 2015-07-14 17:54:11 EDT
Unfortunately this didn't work (see attached hudson.log).

I've tried this before.

If there's "bad" data on disk - for instance a build that wasn't deleted completely and it leaves and empty directory or a directory with just a build.xml file in it then you get exceptions as shown in the attached log.

I know I have this type of corruption on some jobs which I will clean up as part of my upgrade to 3.3.0 but even after cleanup in a trial run I got this exception so I missed some bad juju somewhere.
Comment 3 Stuart Lorber CLA 2015-07-14 17:54:37 EDT
Created attachment 255194 [details]
Hudson log after deletion of all _runmap.xml files.
Comment 4 Bob Foster CLA 2015-07-14 19:08:44 EDT
Well, it works in the sense the _runmap.xml files will be recreated from data on disk, but it doesn't work if the data on disk is corrupt or inconsistent.

In the past I have written programs to scan a Hudson installation, detect and report errors and inconsistencies, and optionally fix them. This was very useful in the early "team" days, as the changing implementation left almost all Hudson homes corrupt in some way. But perhaps we should resurrect something of the sort for data problems people see in the field?

I'm not sure and invite thoughts on this. This would be a utility and would need to be run when Hudson is not running. As such it would not be part of Hudson, though we might be able to figure out how to make it part of the distribution.

In one sense, it seems like a cop-out, coming along behind and cleaning up after what are arguably bugs to begin with. On the other hand, we can't go back and undo damage caused by past versions - except maybe at startup, which I haven't really considered - and use of the file system for all Hudson-related data allows inconsistencies caused by server crashes, file system lockups, network failures, etc.

Be interested to hear what you and Winston and any others think.
Comment 5 Stuart Lorber CLA 2015-07-15 09:48:31 EDT
My suggestions:

1. Make it a CLI plugin so it can be updated easily and is "part" of Hudson.
2. Make it check config files versus reality.
   For instance:
   1. teams.xml
   2. views.xml (?)
   3. _runmap.xml
      Compare what the _runmap.xml file says versus what's on disk.
      - Extra directories that aren't referenced in _runmap.xml.  These are essentially "dead" directories as the user can't see them through the UI.  These would be incorrectly purged or deleted - removed from _runmap.xml but not from disk - partly or completely (from disk).
   4. Scan jobs for bad build directories - empty directories, directories with just a build.xml with them.
   5. Run a "dummy" startup routine that would build the _runmap.xml files as though they didn't exist and report those that would cause an exception.
   6. Do not fix anything automatically.  Produce a report.  Auto-fix might be something for down the road.  The report can then contain whatever system info you want to include (i.e. Hudson version).
   7. Option to email report results to a Hudson email address so you guys can scan these for the types of things that are occurring on customer systems.

If it's a report, I guess Hudson could be running and it could be a CLI.  If it did auto-repairing it would have to be a stand-alone utility like you mention but it would make distribution / updates harder.
Comment 6 Bob Foster CLA 2015-07-15 11:39:55 EDT
> Do not fix anything automatically.

Was only contemplating fixing as an option.

> Make it a CLI

If it's a CLI command, it can never repair. 

I think we have different perspectives. You're reluctant to trust a tool to hack your Hudson home; I'm wary of admins hacking the disk base. :)

I'd worry about scenarios like this:

- Command reports a dozen or so problems.
- User takes Hudson down (or not!), does a bunch of "fixing".
- User restarts Hudson, gets a bunch of exceptions because of incorrect "fixes".
- User writes bug report, perhaps complaining of lost data.
- Loop.

Of course, the "user" in that scenario isn't you, but it could be me. Manual changes to the disk base are error-prone, there is no audit trail and there is no recovery.

Maybe the "fix on startup" option was the best.
Comment 7 Stuart Lorber CLA 2015-07-15 12:55:24 EDT
Understood.

I think the scenarios have caused problems that I list - or things to check - still applies.

These are places where I've had problems - some of them consistently.

If you come up with something and want it tested...please let me know.
Comment 8 Bob Foster CLA 2015-07-15 13:53:01 EDT
I appreciate the list.

Yeah, it comes down to "if you come up with something". So many bugs, so little time. But these chronic problems are annoying, for sure.