Community
Participate
Working Groups
Created attachment 255193 [details] runmap exception on startup Exception on startup says runmap cannot be created. There's no way of knowing from the attached exception which job has bad data.
You don't need to know which job. Shutdown, delete all the _runmap.xml files and restart.
Unfortunately this didn't work (see attached hudson.log). I've tried this before. If there's "bad" data on disk - for instance a build that wasn't deleted completely and it leaves and empty directory or a directory with just a build.xml file in it then you get exceptions as shown in the attached log. I know I have this type of corruption on some jobs which I will clean up as part of my upgrade to 3.3.0 but even after cleanup in a trial run I got this exception so I missed some bad juju somewhere.
Created attachment 255194 [details] Hudson log after deletion of all _runmap.xml files.
Well, it works in the sense the _runmap.xml files will be recreated from data on disk, but it doesn't work if the data on disk is corrupt or inconsistent. In the past I have written programs to scan a Hudson installation, detect and report errors and inconsistencies, and optionally fix them. This was very useful in the early "team" days, as the changing implementation left almost all Hudson homes corrupt in some way. But perhaps we should resurrect something of the sort for data problems people see in the field? I'm not sure and invite thoughts on this. This would be a utility and would need to be run when Hudson is not running. As such it would not be part of Hudson, though we might be able to figure out how to make it part of the distribution. In one sense, it seems like a cop-out, coming along behind and cleaning up after what are arguably bugs to begin with. On the other hand, we can't go back and undo damage caused by past versions - except maybe at startup, which I haven't really considered - and use of the file system for all Hudson-related data allows inconsistencies caused by server crashes, file system lockups, network failures, etc. Be interested to hear what you and Winston and any others think.
My suggestions: 1. Make it a CLI plugin so it can be updated easily and is "part" of Hudson. 2. Make it check config files versus reality. For instance: 1. teams.xml 2. views.xml (?) 3. _runmap.xml Compare what the _runmap.xml file says versus what's on disk. - Extra directories that aren't referenced in _runmap.xml. These are essentially "dead" directories as the user can't see them through the UI. These would be incorrectly purged or deleted - removed from _runmap.xml but not from disk - partly or completely (from disk). 4. Scan jobs for bad build directories - empty directories, directories with just a build.xml with them. 5. Run a "dummy" startup routine that would build the _runmap.xml files as though they didn't exist and report those that would cause an exception. 6. Do not fix anything automatically. Produce a report. Auto-fix might be something for down the road. The report can then contain whatever system info you want to include (i.e. Hudson version). 7. Option to email report results to a Hudson email address so you guys can scan these for the types of things that are occurring on customer systems. If it's a report, I guess Hudson could be running and it could be a CLI. If it did auto-repairing it would have to be a stand-alone utility like you mention but it would make distribution / updates harder.
> Do not fix anything automatically. Was only contemplating fixing as an option. > Make it a CLI If it's a CLI command, it can never repair. I think we have different perspectives. You're reluctant to trust a tool to hack your Hudson home; I'm wary of admins hacking the disk base. :) I'd worry about scenarios like this: - Command reports a dozen or so problems. - User takes Hudson down (or not!), does a bunch of "fixing". - User restarts Hudson, gets a bunch of exceptions because of incorrect "fixes". - User writes bug report, perhaps complaining of lost data. - Loop. Of course, the "user" in that scenario isn't you, but it could be me. Manual changes to the disk base are error-prone, there is no audit trail and there is no recovery. Maybe the "fix on startup" option was the best.
Understood. I think the scenarios have caused problems that I list - or things to check - still applies. These are places where I've had problems - some of them consistently. If you come up with something and want it tested...please let me know.
I appreciate the list. Yeah, it comes down to "if you come up with something". So many bugs, so little time. But these chronic problems are annoying, for sure.