Bug 378303 - Auto testing failed due to "file not available" ... timing problem
Summary: Auto testing failed due to "file not available" ... timing problem
Status: RESOLVED INVALID
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Releng (show other bugs)
Version: 4.2   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: David Williams CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 377365
  Show dependency tree
 
Reported: 2012-05-02 16:41 EDT by David Williams CLA
Modified: 2012-05-09 22:18 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Williams CLA 2012-05-02 16:41:16 EDT
The recent attempt to automatically start unit testing after a build failed, with "file not found" errors getting the zips that had already been uploaded (with rsync) ... but appears there is just a delay from when the files are uploaded, and when they can been "seen" by others. Similar to bug 378038. 

I think the fix in this case is to put in a "sleep" such as for 5 minutes (10 minutes?) before starting the tests. (I recall Kim's code originally had a sleep at some point, I didn't understand why, so removed it (since so much else was changing anyway).  


platformIndependentZips:
      [get] Getting: http://download.eclipse.org/eclipse/downloads/drops4/I20120502-1300/eclipse-Automated-Tests-I20120502-1300.zip
      [get] To: c:\hb\workspace\JUnit-win2\workarea\I20120502-1300\eclipse-Automated-Tests-I20120502-1300.zip
      [get] Error opening connection java.io.FileNotFoundException: http://download.eclipse.org/eclipse/downloads/drops4/I20120502-1300/eclipse-Automated-Tests-I20120502-1300.zip
      [get] Error opening connection java.io.FileNotFoundException: http://download.eclipse.org/eclipse/downloads/drops4/I20120502-1300/eclipse-Automated-Tests-I20120502-1300.zip
      [get] Error opening connection java.io.FileNotFoundException: http://download.eclipse.org/eclipse/downloads/drops4/I20120502-1300/eclipse-Automated-Tests-I20120502-1300.zip
      [get] Can't get http://download.eclipse.org/eclipse/downloads/drops4/I20120502-1300/eclipse-Automated-Tests-I20120502-1300.zip to c:\hb\workspace\JUnit-win2\workarea\I20120502-1300\eclipse-Automated-Tests-I20120502-1300.zip
  [antcall] Exiting c:\hb\workspace\JUnit-win2\org.eclipse.releng.eclipsebuilder\runTests2.xml.

BUILD FAILED
Comment 1 David Williams CLA 2012-05-02 16:56:51 EDT
put 5 minute sleep in masterBuild.sh (before invoking ant to run tests. 

Hope that's enough.
Comment 2 David Williams CLA 2012-05-02 21:41:39 EDT
5 minutes wasn't enough, I'll try 10. 

But, not I'm getting concerned about "always waiting" just to submit the job. 

Sometimes, for example, the job is put in the hudson job que, and may have to wait here a long time before finally executing, so in those cases we wouldn't have to wait up front. 

A better solution might be to wrap each "get" in a "retry" task: 


<retry retrycount="3">
  <get src="http://www.unreliable-server.com/unreliable.tar.gz" 
       dest="/home/retry/unreliable.tar.gz" />
</retry>


get is supposed to default to 3 retries anyway ... and not sure how long it waits between retries ... I'm thinking it'd have to be 300 or something to start being equivalent to 10 or 15 minutes. 

(For the record, retries attribute on the ant task is only since ant 1.8.0, so might be able to use that .. will have to make sure we can use that on all hudson instances/slaves (some are set as "default" which is unknown, usually), whereas the <retry element has been available since Ant 1.7.1 ... pretty much gaurenteed.)
Comment 3 David Williams CLA 2012-05-03 00:05:40 EDT
I think I've found an improved method, instead of blanket wait, to retry, 20 times, currently, with 1 minute of sleep in between retrys. Surely it'll be ready by then? (And, don't have to "wait longer than needed". 


        <retry retrycount="20">
            <get verbose="true" 
                src="${archiveLocation}/eclipse-Automated-Tests-${buildId}.zip"
                dest="${testDir}/eclipse-Automated-Tests-${buildId}.zip" />
            <sleep minutes="1" />
        </retry>
Comment 4 David Williams CLA 2012-05-03 01:13:14 EDT
(In reply to comment #3)
If its not one thing, its another, when I tried the above code, I got an error message that said "retry can only contain one element, which may be a sequential container element" So, I'll try 

<retry>
 <sequential>
    <get
       verbose="true"
       src="${archiveLocation}/eclipse-Automated-Tests-${buildId}.zip"
        dest="${testDir}/eclipse-Automated-Tests-${buildId}.zip" />
    <sleep minutes="1" />
 </sequential>
</retry>
Comment 5 David Williams CLA 2012-05-03 18:16:31 EDT
This can not be right. 20 minutes, and still not available?! 

Webmasters, I'll need your help here. Why would it take so long for files on "downloads" to be available to a hudson job? 

I'm beginning to think it's not "time" related at all. "same process" related, or something? (could not literally be the same process, of course, totally different machines, but what could it be?  

I was not watching "minute by minute" as it retried, but the file is definitely available when I checked later. 

And, I know the hudson jobs can find the files when re-start the job later. 

20 minutes is excessive, right? Like impossibly excessive? (to be "visible" via HTTP?) 




    [retry] Attempt [19]: error occurred; retrying...
      [get] Getting: http://download.eclipse.org/eclipse/downloads/drops4/I20120503-1500/eclipse-Automated-Tests-I20120503-1500.zip
      [get] To: c:\hb\workspace\JUnit-win2\workarea\I20120503-1500\eclipse-Automated-Tests-I20120503-1500.zip
      [get] Error opening connection java.io.FileNotFoundException: http://download.eclipse.org/eclipse/downloads/drops4/I20120503-1500/eclipse-Automated-Tests-I20120503-1500.zip
      [get] Error opening connection java.io.FileNotFoundException: http://download.eclipse.org/eclipse/downloads/drops4/I20120503-1500/eclipse-Automated-Tests-I20120503-1500.zip
      [get] Error opening connection java.io.FileNotFoundException: http://download.eclipse.org/eclipse/downloads/drops4/I20120503-1500/eclipse-Automated-Tests-I20120503-1500.zip
Comment 6 David Williams CLA 2012-05-03 20:57:01 EDT
Here's a theory ... while watching a log "live" it did not seem to be "waiting" for 20 minutes, but rapidly trying 20 times in a row. So, my theory is, if the "get" fails, then everything else in the container is skipped, so it doesn't really sleep in between retries. So ... I've move the "sleep" to the first part of of the container. 

Also increased wait to 2 minutes, and specified "httpusecaches" as false (which, is just a hint, not sure if the inner Eclipse infrastructure will cache http states or not, or honor an request not to). 

It'll be a while before I can test, so will leave the bug open a bit longer. 

<retry retrycount="20">
     <sequential>
       <sleep minutes="2" />
       <get
          verbose="true" httpusecaches="false" 
          src="${archiveLocation}/eclipse-Automated-Tests-${buildId}.zip"
          dest="${testDir}/eclipse-Automated-Tests-${buildId}.zip" />
     </sequential>
</retry>
Comment 7 David Williams CLA 2012-05-09 22:18:07 EDT
I'm marking this as invalid, since, the fundamental problem is not a problem at all ... but, built in to the way things work. We do still need to "wait", but the main issue is that the build produces the build, with e4Build id, but can not promote to downloads server, since only committer can. So, under my committer ID, I have a cron job that checks every 10 minutes if there is anything to promote. Once it detects there is, it takes several minutes (3 to 8?) to actually copy everything in place. The build itself "kicks off" the test jobs right at the end of the build, and it knows nothing about the promotion time, so ... duh ... its perfectly normal for there to be a 10 minute or so delay. 

So, we do need the "wait loop", but there's nothing wrong with infrastructure.