Bug 487044 - Migrate some (most) cron jobs to Releng HIPP Instance
Summary: Migrate some (most) cron jobs to Releng HIPP Instance
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Releng (show other bugs)
Version: 4.6   Edit
Hardware: PC Linux
: P1 enhancement (vote)
Target Milestone: 4.6.1   Edit
Assignee: David Williams CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 490516 492493 495761 496316 496335
Blocks: 496280 490440 490554 495750 496281 496282 496315
  Show dependency tree
 
Reported: 2016-02-02 16:13 EST by David Williams CLA
Modified: 2016-06-17 18:27 EDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Williams CLA 2016-02-02 16:13:17 EST
The purpose of this bug is to track "migration" of certain cronjobs to a 
Hudson HIPP Instance "owned" by Platform releng. By making is "specific" to releng team, it allows some work to be done for Platform and Equinox (see bug 486401) without opening up security issues too much. 

This may be a "temporary" solution until all our entire builds can "run on Hudson". But since that is a year or more away, this allows incremental progress to be made. 

The "first wave" will be to move what some committers have as "cronjobs" to this Hudson instance. 

The "second wave" would be to use this Hudson instance instead of "e4Build" to do more work. This minimizes the number of committers (or build IDs) needed with "shell access" to eclipse.org. 

The reason I am motivated to do this now, is so that it would be easier for other committers to "back up" our core release engineering team (i.e. me :) in the
event they were not available ... or, others, such as John Arthorne had a cronjob related to Equinox Download page, and since he is becoming inactive at Eclipse, having that cronjob run here allow others to pick up that work, if changes needed, or similar.
Comment 1 David Williams CLA 2016-02-02 17:06:48 EST
I should have provided the URL: 
https://hudson.eclipse.org/releng/
Comment 2 David Williams CLA 2016-03-13 17:59:58 EDT
One thing I have learned already is that our "BUILD_ID" is a name clash with Hudson's BUILD_ID. 

I have made this change so we always compute our own BUILD_ID, and do not let it be "passed in". Not sure if there will be other impacts, or not.

http://git.eclipse.org/c/platform/eclipse.platform.releng.aggregator.git/commit/?id=54170664b8d511c8cef175942e7b9b9d22b56f81

I think there are Hudson plugins that let you define format of BUILD_ID, but ... not sure it is worth that, since normally we compute it anyway.
Comment 3 David Williams CLA 2016-03-26 19:03:31 EDT
One of the latest attempts failed during a "post build" step. The one for "createReports.sh". 

16:39:33  /shared/eclipse/builds/4N

16:48:43  Proxy tunneling failed: Bad RequestUnable to establish SSL connection.
16:48:43  tar: org.eclipse.cbi.p2repo.analyzers.product-linux.gtk.x86_64.tar.gz: Cannot open: No such file or directory
16:48:43  tar: Error is not recoverable: exiting now
16:48:43  /shared/eclipse/builds/4N/production/createReports.sh: line 55: /shared/eclipse/builds/4N/siteDir/eclipse/downloads/drops4/N20160326-1500/reportApplication/p2analyze: No such file or directory
16:48:43  /shared/eclipse/builds/4N/production/createReports.sh: line 60: /shared/eclipse/builds/4N/siteDir/eclipse/downloads/drops4/N20160326-1500/reportApplication/p2analyze: No such file or directory
16:48:43  
16:48:43     ERROR. exit code: 127
16:48:43     ERROR. message: Error occurred during createReports.sh
16:48:43  

I've tried to fix by fulling specifying more path info. And will try again.
Comment 4 David Williams CLA 2016-03-28 17:05:46 EDT
For some reason, initiating a job from Hudson's Releng HIPP doesn't handle proxies or external links as when ran from e4Build. See bug 490516.
Comment 5 David Williams CLA 2016-03-28 21:13:02 EDT
(In reply to David Williams from comment #4)
> For some reason, initiating a job from Hudson's Releng HIPP doesn't handle
> proxies or external links as when ran from e4Build. See bug 490516.

I could not fix the proxy issue even after several attempts, so will revert to using cron jobs from e4Build account. If nothing else, it will confirm if the "https://" errors continue to show up, or if indeed a "real" proxy setting issue. 

My *guess* is some of the "Hudson Variables" will need to be added directly to the mvn command that does the build -- that we may "losing" them, as we invoke once process and then another.
Comment 6 David Williams CLA 2016-03-30 15:45:08 EDT
I will add an extra "work item" to this bug (since not worth its own entry) but once 'cron jobs' are working well, another "task" that is sometimes required of "release engineering". is to remove a build from an existing composite, if the build is "bad". Removing it not only prevents "early testers" from installing a bad build but removing from composites such as 4.6-N-builds is sometimes needed to allow the "Gerrit jobs" to continue to work until a good build can be produced. 

Assuming it is easy to do "at the same time" will track in this bug, but if turns out to be complicated will branch off.
Comment 7 David Williams CLA 2016-03-30 15:48:36 EDT
(In reply to David Williams from comment #6)
Markus, I added you to CC since I heard to mention documenting "how to delete a bad build from composite" -- and it is fine if you do -- but if things work out, I'd like to have more such tasks controllable from the Releng HIPP instance. 

For the same reason, that "someone else can do it" more easily that logging into a shell terminal (and risk doing it wrong :) -- not that YOU would. 

Just FYI.
Comment 8 David Williams CLA 2016-05-01 13:43:24 EDT
Just to add another problem discovered. 

I tried to move some of the "clean up" cron jobs to this HIPP instance. These are the jobs that remove, for example, old builds from build machine, etc. 

I did not try the cleanup jobs of "download server" but the clean up job for the "build server" did not work. The "genie.releng" user gets "permission denied" when it tried to remove "drop directories". 

The reason seems to be the "group" is not set correctly on those directories. 
The directory itself, such as 
/shared/public/eclipse/builds/4I/siteDir/eclipse/downloads/drops4/I20160430-2000
has the correct user and group (e4Build and eclipse.platform.releng) but it does not have the guid "s-bit" set correctly (even though 'drops2 does) so everything under that has the group set to "common". 

I've opened bug 492493 to get webmaster help with that, but since not urgent, set its status to minor.
Comment 9 David Williams CLA 2016-06-08 15:53:56 EDT
Since we are starting up N-builds again, I think a good time to focus more on this transition. 

If I recall, the N-builds (and I-builds, etc.) would "run fine" using a "cron job" on Hudson, except that the "proxy settings" apparently were not correct when the JavaDoc was being created and that program could not form links to the "oracle" website (or ... something like that). 

My guess was it was a matter that "ANT_OPTS" needs to be "passed through" to the build process, and made use of by the JavaDoc step. Not sure why it worked when "e4Build" was the user id running the cronjobs, but it could have to do with "user settings" I suppose. And, would be best to not rely on "user settings" but to have everything independent of any particular user. 

So, for now, I am going to be running the N-builds from 
https://hudson.eclipse.org/releng/job/N-build/
even though the JavaDoc will be broken. And, perhaps we can get enough data and run enough experiments to get JavaDoc running well again.
Comment 10 David Williams CLA 2016-06-16 23:12:44 EDT
This work is nearly done. 

All the cron jobs that were under "e4Build" ID have been migrated -- these were the cronjobs that "did the build". 

All have been migrated except one, which was a relatively minor niceity to remind me to comment out any rebulds that we had done. I have opened bug 496280 to implement that, someday. 

= = = = = = 

The other place where there were cronjobs were under my own Eclipse committer Id. 

Those where the ones to "promote" builds to download server, and similar. 

All the ones involving "promotion" or "collecting test results" have been migrated. 

But in fact, this whole process can be simplified now that we are running on Hudson with a "blessed" id. I have opened bug 496281 to handle "promotion of the builds". 

Gathering and processing the test results can also be simplified, but is a bit more complicated, so entered another bug for that, bug 496282. 

= = = = = =

What remains? Only 2 related to Eclipse builds and tests, and one related to Orbit which I won't cover here. 

The two for Eclipse are related to the Derby database server -- used for the performance tests -- currently running on "build.eclipse.org". 

One job would just check several times a day to make sure it is still running. And, honestly, I added this at the beginning becaues it would sometimes "go down" -- I think out of memory issues or similar -- and it has never "gone down" for probably over a year or so, so am not sure we need that any more -- but if we did, we could move that to Hudson Releng HIPP. 

The other was a cronjob executed on "reboot" -- it would start up the database. 

It does not really make sense to have that sort of "reboot" job on Hudson, since completely different machines. We might be able to get webmasters to add it to ?something? to start it up on reboot.  Or, perhaps we could simply have some Hudson jobs to issue "start" to it, and manually start it if we ever found it to be "down". And, honetly, I am only 70% sure it needs to be running before the performance tests (or analysis jobs) are ran -- for all I know, they might be able to "start it up" on their own? 

= = = = =

Still need to do some testing to confirm all works, but so far it is. 

The "N-builds" have been running for a week or more from the Hudson HIPP. The significant change I made just tonight was to removed the "production" section on the build machine and the "sdk" section on the build machine. Both of these held "standard scripts" and would have to be updated "by hand" from time to time if there were changes made. But, no more, the Releng HIPP simply gets the latest from master and executes.
Comment 11 David Williams CLA 2016-06-17 09:59:06 EDT
It is significant that last night's N-build worked complete from 
https://hudson.eclipse.org/releng/

That is there was no "production" or "sdk" on the build machine. All scripts to execute came from https://hudson.eclipse.org/releng/. 

One previously untested part was the "performance tests". Previously, to "run the analysis" for the perforamnce tests required we start "xvfb" on the build machine. I added logic that "if running_on_hudson" to NOT use that xvfb route (it depended on a script that was "buld time only" and should not be checked into Git) so instead we use Xvnc from Hudson and I was releived to see that worked! 

= = = = = =

We do still use the file system on /shared/eclipse to "store the build repo and various results" but all execution comes from the HIPP instance. (They seeem to run a little faster, presumably because that is a higher powered machine, or because it is simply "less shared" than build.eclipse.org which often has a heavy load. 

One area on file /shared/eclipse/builds we use looks like the following 

/shared/eclipse/builds/4N
/shared/eclipse/builds/4N/siteDir
/shared/eclipse/builds/4N/tmp
/shared/eclipse/builds/4N/gitCache
/shared/eclipse/builds/4N/localMavenRepo

This is "the build" area for "4N". Similar structures for 4I, 4P, 4M, and 4Y. 
After a build, there is also a "localMavenRepo.bak" directory which is where we move the previous localMavenRepo before a build so that localMavenRepo is "fresh" each time. We mostly keep the "bak" in case there is some mysterious failure. We could at least "diff" localMavenRepo and localMavenRepo.bak to see if there were any unexpected changes in the localMavenRepo. [In all honesty, I have never used this ability to look at "diffs" and as long as we do not use "SNAPSHOTS" of maven prereqs, we would not expect any differences. 

= = = = = =

The area we use for the "test results ques" is 
/shared/eclipse/testjobqueue

Its listing as of right now shows: 

$ ls -tr1
ERROR_testjobdata201606160244.txt
RAN_testjobdata201606162339.txt
RAN_testjobdata20160617013346338026881.txt
RAN_testjobdata20160617015622605463639.txt
RAN_testjobdata201606170245.txt
RAN_testjobdata20160617030330230811723.txt
RAN_testjobdata20160617065400276755927.txt

The "ERROR" one is from some previous build, I assume, befoere I had all changes "in". 
When they are first put in the queue, they have names such as 
testjobdata20160617065400276755927.txt 
and it is that "testjobdata*" that the cronjob looks for. 
As the cronjob processes those results, the name is changed to 
"RUNNING_testjobdata..."
and when finished, it is changed to 
RAN_testjobdata...  if successful, or 
ERROR_testjobdata... if an error occurred. 

Those with long timestamps are "unit tests", those with short ones are performance tests. They should all probably be long time stamps. The long timestames were added when one time over the past several years two unit tests finished at the "same time" and ended up with the same (short) timestamp. So, that is why I made them longer. 

This will be obvious from looking at the code, but the contents of those files look like the following: 

$ cat RAN_testjobdata20160617065400276755927.txt
ep46N-unit-win32 274 N20160616-2001 4.6.0 4533c0a4bd1db8316958d914a6f1e1b41a3d7cd7

That is Hudson job name, Hudson build number, buildId, Stream, and "hash" of the "aggregation build" to use to "finish processing" (a shallow copy of aggregator is also made in "siteDir" so that multiiple builds and tests can be taking place and each knows which version of the aggregator to use -- just in case chnages made. 

In fact, that aggregator, with that hash, is convered to a "zip" and uploaded to "downloads" with the rest of the build. It is used during unit tests to "get the scripts it needs" plus has the advantage that test can be re-ran at a much later time, even if aggregator changes a lot. (This are all shallow copies of the aggregator, soes not recursively fetch submodules!) 

Next: promotion queues.
Comment 12 David Williams CLA 2016-06-17 10:19:14 EDT
The other two areas of /shared/eclipse/ we use for "data" are

/shared/eclipse/promotion/queue
and 
/shared/eclipse/equinox/promotion/queue/

These are the scripts that cronjobs (running on Releng HIPP) look for in order to "promote" what is on build machine to download server. 


The equinox one is the simpliest (and, I can see a "bug" now that I look :) 

The current contents of the equinox que is 

RAN_promote-4.6.0-N20160612-2000.sh
RAN_promote-4.6.0-N20160613-0915.sh
RAN_promote-4.6.0-N20160613-2000.sh
RAN_promote-4.6.0-N20160614-2001.sh
RAN_promote-Neon.sh
manual-promote-Neon.sh
RAN_promote-4.6.0-N20160615-2001.sh
promote-4.6.0-N20160616-2001.sh

The "bug" is that the promote-4.6.0-N20160616-2001.sh had not ran yet, thought there would have been plenty of time. 

the manual-promote-Neon.sh it waiting for the "makeVisible" job that Sravan is going to run next Wednesday. 

(The bug is *might* be related to that ... "exits loop" when it finds an unexpected file? Just a guess). 

= = = = =

Gotta run to appointment, I'll write more and look at bug this afternoon.
Comment 13 David Williams CLA 2016-06-17 13:05:47 EDT
The bug for equinox promotion turned out to be a simple oversite in the Hudson job. For all "equinox jobs" I had not yet added the step where we make a shallow clone of utilities and execute from that clone. It was still trying to execute from /shared/eclipse/sdk/promotion/...sh 
so the fix was easy, and once done both the "clean up" and "promotion" worked as expected. 

So to finish the equinox topic, the contents of those "promote-*" scripts, as an example, is a simple rsync: 

$ cat RAN_promote-4.6.0-N20160616-2001.sh
# promotion script created at 201606162113
rsync --times --omit-dir-times --recursive "/shared/eclipse/builds/4N/siteDir/equinox/drops/N20160616-2001" "/home/data/httpd/download.eclipse.org/equinox/drops/"


So that should be easy  to fix in bug 496281. 
Just find the spot where we write that rsync, and then instead of writing it to a file, simply execute it.
Comment 14 David Williams CLA 2016-06-17 13:34:52 EDT
To leave details about "SDK promotion", is accomplished by writing a script to 

/shared/eclipse/promotion/queue

The scripts are named similar to equinox
promote-4.6.0-N20160616-2001.sh

And are renamed "RUNNING_promote..." while running, and end up named RAN_promote... or ERROR_promote...

The contents are more complicated than equinox, though. It calls another script which does the rsync both to "downloads" and to "updates" (depending on build type). And then addes the "child repo" to the composiste. 

$ cat RAN_promote-4.6.0-N20160616-2001.sh
#!/usr/bin/env bash
# promotion script created at 201606162113
/home/hudson/genie.releng/.hudson/jobs/N-build/workspace/utilities/production/sdk/promotion/syncDropLocation.sh 4.6.0 N20160616-2001 4533c0a4bd1db8316958d914a6f1e1b41a3d7cd7 /shared/eclipse/builds/4N/siteDir/eclipse/downloads/drops4/N20160616-2001/buildproperties.shsource

Note on the above, recent example, it "points back" to the Hudson machine where we make the shallow clone under 'utilities' to find the script it is to run:
 
/home/hudson/genie.releng/.hudson/jobs/N-build/workspace/utilities/production/sdk/promotion/syncDropLocation.sh

The arguments are pretty obvious except for the last

Stream: 
4.6.0 

Build_id
N20160616-2001 

hash of aggregator used to produce the build (and eventaully calls some methods 
from *that* shallow copy of aggregator, instead of the "master" version on Hudson. 

4533c0a4bd1db8316958d914a6f1e1b41a3d7cd7 

And this file, produced by the build, and stored under the build machines version of "download site" recoreds a lot of values that are relevant during the build. I forget which ones we literally use during this "promote" step, but, probably 3 or 4. 

/shared/eclipse/builds/4N/siteDir/eclipse/downloads/drops4/N20160616-2001/buildproperties.shsource
Comment 15 David Williams CLA 2016-06-17 13:59:23 EDT
I have opened bug 496316 to try the database stuff, and once that is done will consider this bug "finished".
Comment 16 David Williams CLA 2016-06-17 17:29:27 EDT
I am counting this as "fixed". All the cronjobs previously under "e4Build" have been removed, and all under my ID have been removed, except for the @reboot one mentioned in bug 496316. I will send a note to webmaster to see if he has suggestions about that one.
Comment 17 David Williams CLA 2016-06-17 18:27:18 EDT
Almost forgot about the ability to "tag" from the Releng HIPP instance so opened bug 496335. (Had previously sent email). 

If that's not approved, all this will have been for nought. :( 
[But, pretty sure it will be, since it is just the same as we are doing now!]