Bug 372319 - Unexpected 'The local variable i may not have been initialized' compile error
Summary: Unexpected 'The local variable i may not have been initialized' compile error
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.8   Edit
Hardware: PC Windows 7
: P3 major (vote)
Target Milestone: 3.8 M7   Edit
Assignee: Stephan Herrmann CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 372944 374651 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-02-23 07:23 EST by Dani Megert CLA
Modified: 2012-04-30 07:24 EDT (History)
20 users (show)

See Also:
srikanth_sankaran: review+


Attachments
example of parallel compilation file (9.29 KB, text/xml)
2012-03-02 02:18 EST, David Williams CLA
no flags Details
Cleanup for field finallyInfo (1.30 KB, patch)
2012-03-02 18:45 EST, Stephan Herrmann CLA
no flags Details | Diff
ResultsSummaryGenerator (37.93 KB, text/plain)
2012-03-06 14:05 EST, Carl Anderson CLA
no flags Details
ValidationConfiguration (40.17 KB, text/plain)
2012-03-06 14:14 EST, Carl Anderson CLA
no flags Details
static non-final fields in jdt.core (4.60 KB, text/plain)
2012-03-11 12:11 EDT, Stephan Herrmann CLA
no flags Details
Table of the problems (10.00 KB, application/octet-stream)
2012-03-14 04:02 EDT, Satyam Kandula CLA
no flags Details
Test project being used to reproduce the problem (6.77 KB, application/octet-stream)
2012-03-15 11:13 EDT, Satyam Kandula CLA
no flags Details
Patch for force hashCode methods to return 0 (39.47 KB, patch)
2012-03-17 04:07 EDT, Srikanth Sankaran CLA
no flags Details | Diff
proposed fix (1.39 KB, patch)
2012-03-19 15:25 EDT, Stephan Herrmann CLA
no flags Details | Diff
improved fix (3.59 KB, patch)
2012-03-19 17:55 EDT, Stephan Herrmann CLA
no flags Details | Diff
same patch with a regression test (13.43 KB, patch)
2012-03-19 19:29 EDT, Stephan Herrmann CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dani Megert CLA 2012-02-23 07:23:09 EST
JDT Core plug-in version: 3.8 M5 (org.eclipse.jdt.core_3.8.1.v20120125-1544).

The compiler reports an error in N20120222-2000 which is not expected:

http://download.eclipse.org/eclipse/downloads/drops/N20120222-2000/compilelogs/plugins/org.eclipse.ui.tests.harness_1.2.0.N20120222-2000/@dot.html

This is the code which caused the unexpected error.

	public static void pressKeyCodeCombination(Display display, int[] keyCodes,
			boolean runEventQueue) {
		for (int i = 0; i < keyCodes.length; i++)
			keyCodeDown(display, keyCodes[i], runEventQueue);
		for (int i = keyCodes.length - 1; i >= 0; i--)
			keyCodeUp(display, keyCodes[i], runEventQueue);
	}

For used compiler options and other details, see:
http://download.eclipse.org/eclipse/downloads/drops/N20120222-2000/compilelogs/plugins/org.eclipse.ui.tests.harness_1.2.0.N20120222-2000/@dot.xml


I tried to reproduce it with 3.8 M5 and N20120222-2000, but no "luck". What worries me, is that no error was reported on the same source for previous N-builds.
Comment 1 Stephan Herrmann CLA 2012-02-23 07:53:09 EST
Worrisome, indeed.

When I first saw this I couldn't find which version JDT/Core was used, so...

> JDT Core plug-in version: 3.8 M5 (org.eclipse.jdt.core_3.8.1.v20120125-1544).

... that would be 3.8M5.
But using that I have no "luck" either.

Even using the exact preferences from http://git.eclipse.org/c/platform/eclipse.platform.ui.git/plain/tests/org.eclipse.ui.tests.harness/.settings/org.eclipse.jdt.core.prefs didn't help.

"hardware failure" ;-/
Comment 2 Carl Anderson CLA 2012-03-01 13:56:02 EST
We're seeing the same sorts of issues with the WTP build, along with some CodeCannotBeReached "errors".  In every instance, it is a false positive, and does not repeat in the next build.

Some examples (which will be deleted in a few days):

http://build.eclipse.org/webtools/committers/wtp4x-R3.4.0-I/20120301032607/I-3.4.0-20120301032607/testcompilelogs/plugins/org.eclipse.wtp.releng.tools_1.2.1.v201103080400/wtpRelengTools.jar.bin.html

1. ERROR: UninitializedLocalVariable

The local variable amountRead may not have been initialized

ResultsSummaryGenerator.java :

580 : while (amountRead != -1);

The lines of code in question begin with:

int amountRead = -1;

And:

http://build.eclipse.org/webtools/committers/wtp4x-R3.4.0-I/20120301163247/I-3.4.0-20120301163247/compilelogs/plugins/org.eclipse.wst.validation_1.2.400.v201202090400/@dot.html

1. ERROR: CodeCannotBeReached

Unreachable code

ValidationConfiguration.java :

95 : return buffer.toString(); 

That is a very short method, with that line being the only return, no throws, and only a simple for loop before it.

We were using Eclipse 3.8 M4, and are now using Eclipse 3.8 I20120228-0800, but we still encounter these random issues.
Comment 3 Stephan Herrmann CLA 2012-03-01 16:51:00 EST
Random compiler errors that disappear on the next build without relevant source changes are very irritating. I haven't seen anything like that.

If there is non-determinism, that would likely hint at a concurrency issue.

The relevant information is stored in instances of UnconditionalFlowInfo. Normally, these instances are short lived and never stored in any fields, so visibility across threads is unlikely.
An exception got recently introduced: bug 359334 introduced the field BlockScope.finallyInfo, which means that this one flow info is now visible for the life time of a BlockScope, which in turn is tied to the life time of an AbstractMethodDeclaration.
Ergo, should an AbstractMethodDeclaration ever be shared across threads this could possibly affect this one flow info, too. But if we get so far, concurrency related havoc on flow info still seems much more unlikely than on any other object inside the compiler.

I have no idea, how the same AST would be accessed concurrently. Can this ever be triggered?

Other potential sources for semi-non-determinism:
- memory addresses contributing to hashCode?
  I don't see any hashtables in the neighborhood of flow analysis,
  so this isn't a very hot track either.
- JIT??

That kind of issue has been observed with 3.8 M4, M5 and the most recent integration build, where the most recent symptom is (related but) slightly different.

Maybe we should also collect data on which JVM the bug was observed...
Comment 4 Srikanth Sankaran CLA 2012-03-01 21:42:15 EST
Satyam, could you study this in detail and identify some scenarios/evolve
some hypotheses in conjunction with Stephan ? This is very worrisome.
Comment 5 Srikanth Sankaran CLA 2012-03-01 21:54:18 EST
The first instance of the problem seems to have been
with N20120110-2000, where there were some bogus unreachable
code diagnostics from 


http://git.eclipse.org/c/equinox/rt.equinox.p2.git/tree/bundles/org.eclipse.equinox.p2.repository.tools/src/org/eclipse/equinox/p2/internal/repository/comparator/java/Disassembler.java
Comment 6 Satyam Kandula CLA 2012-03-01 22:35:15 EST
(In reply to comment #3)
It is interesting to see that the spurious errors in both WTP and SDK are similar.

Stephan, nice explanation of the possible cause of errors. Here are some of my comments on this. 

- Concurrency related problems.
 I believe only the compiler component is used in both the cases, and if so there could be hardly any concurrency issues. I will need to confirm this. 

- Hashtable 
  I don't understand this.

- JIT:
  This does seem to fit. As the errors are happening probably at the same place in WTP and SDK, this is likely possible. 

Kim, Can you confirm if the build would have used Java 7 in N20120110-2000 and N20120222-2000? Do we know the versions in this case? 
Carl, Can you confirm if your build uses Java 7 and if so, which version.
Comment 7 Srikanth Sankaran CLA 2012-03-02 00:20:34 EST
(In reply to comment #6)

> Kim, Can you confirm if the build would have used Java 7 in N20120110-2000 and
> N20120222-2000? Do we know the versions in this case? 
> Carl, Can you confirm if your build uses Java 7 and if so, which version.

Just curious - why are you zeroing in on JVM7 ?
Comment 8 David Williams CLA 2012-03-02 01:49:43 EST
WTP (and, I think platform) do use "parallelCompilation" so think it could be a thread-related problem. In one of our build.properties files, we have (complete with original comments :):

# Properties introduced in 3.5 M3 (PDE base builder R35_M2). See
# http://aniefer.blogspot.com/2008/10/sorting-bundles-and-parallel.html

flattenDependencies=true
parallelCompilation=true

We (WTP) don't use VM7 to run the compiler ... we do use an IBM VM ... I'd have to dig to figure out which if 5 or 6 (I think 6) and which version ... but, my money's on the concurrency :) FYI we do use the 32 bit VM even though running on the Eclipse Foundation's 64 bit hardware, in case that'd matter.
Comment 9 Satyam Kandula CLA 2012-03-02 01:55:28 EST
Thanks David, this atleast rules out JIT problems in Java7. Will try to see if see there could be some issues with parallel compilation.
Comment 10 Srikanth Sankaran CLA 2012-03-02 02:03:57 EST
Thanks David, if you happen to remember any recent (last 3-4 months) 
build environment or test  environment changes, let us know.
Comment 11 David Williams CLA 2012-03-02 02:18:33 EST
Created attachment 211937 [details]
example of parallel compilation file

This file would not literally have been from the build that had a failure (we don't keep them around) but assume would be identical to it. One odd failure we saw was in 

org.eclipse.wtp.releng.tools

(line 21)

But, the file shows threadsPerProcessor='3' (we don't change the default) but fairly large number of parallel groups.) 

I add this file just as a concrete illustration ... doubt its of any practical value in reproducing or tracking down. 

I don't recall any changes to hardware or environment, but Carl might remember better than I ... I'm just commenting since I'm up late; to take advantage of time zone differences :)
Comment 12 Stephan Herrmann CLA 2012-03-02 08:16:24 EST
(In reply to comment #6)
> - Hashtable 
>   I don't understand this.

The semi-non-deterministic behavior could potentially be caused, e.g., by iterating over a hashtable, in which case the memory addresses of contained objects could - via hashCode() - influence the order of elements. 
Just a theory.
Comment 13 Stephan Herrmann CLA 2012-03-02 08:17:46 EST
(In reply to comment #12)
> The semi-non-deterministic behavior

I guess "quasi-non-deterministic" is the term I was searching for :)
Comment 14 Stephan Herrmann CLA 2012-03-02 08:32:17 EST
Following up on the concurrency hypothesis:

- isn't each Compiler instance uniquely connected to one ProcessTaskManager?
  => Is there any chance for any instances (of any relevant type) to be shared
  among several concurrent ProcessTaskManagers??

- is there any static field in the entire compiler that could leak relevant
  state to another thread?

From the top of my head I'd answer "no" to both questions. But if concurrency were the culprit, we'd need to find at least one "yes", hmm...
Comment 15 Stephan Herrmann CLA 2012-03-02 18:45:50 EST
Created attachment 212011 [details]
Cleanup for field finallyInfo

The field BlockScope.finallyInfo mentioned in comment 3 could indeed use some cleanup. The patch simply nulls the field when it is no longer needed.

Should concurrent access to any FlowInfo be involved in this bug, then the patch could indeed help to confine those objects to a very short life cycle.

Still we don't have any specific theory, how this could happen.

On another track, since we observe an inconsistency/data loss between
  flowInfo.markAsDefinitelyAssigned(localBinding);
and
  flowInfo.isDefinitelyAssigned(localBinding)
also any confusion of localBinding (or its .id) could cause this kind of havoc.
Comment 16 David Williams CLA 2012-03-03 16:04:52 EST
(In reply to comment #15)

> Still we don't have any specific theory, how this could happen.

I'm not exactly sure what you mean by this ("no theory"), and have not looked at the complete code (just the patch snippet), but, I'm wondering ... if you are saying multiple threads can read/write this field, if it might be a case where the "volatile" keyword is needed. I've heard conflicting things about "volatile", that's its usually not required, usually doesn't do what you think it would, and usually just slows things down ... but ... it does seem required in some threading issues? 

http://www.javamex.com/tutorials/synchronization_volatile.shtml

I myself know just enough about threading to be dangerous (to quote John D. :)
Comment 17 Satyam Kandula CLA 2012-03-05 05:27:58 EST
(In reply to comment #12)
> (In reply to comment #6)
> > - Hashtable 
> >   I don't understand this.
> 
> The semi-non-deterministic behavior could potentially be caused, e.g., by
> iterating over a hashtable, in which case the memory addresses of contained
> objects could - via hashCode() - influence the order of elements. 
> Just a theory.
Thanks for the info.
Comment 18 Satyam Kandula CLA 2012-03-05 05:32:43 EST
(In reply to comment #15)
> Created attachment 212011 [details]
> Cleanup for field finallyInfo.
I doubt if this could help, but the changes look good.
Comment 19 Stephan Herrmann CLA 2012-03-05 06:39:12 EST
(In reply to comment #16)
> (In reply to comment #15)
> 
> > Still we don't have any specific theory, how this could happen.
> 
> I'm not exactly sure what you mean by this ("no theory"), and have not looked
> at the complete code (just the patch snippet), but, I'm wondering ... if you
> are saying multiple threads can read/write this field,

I meant to say: I have no idea how *any* objects in the neighborhood of the problem could ever be shared among threads. But *even if* so, any FlowInfo instances are among the most short-lived and never-stored (with this single exception) objects.

> if it might be a case where the "volatile" keyword is needed. 

Well, I recently had one of those "impossible bugs" where synchronization looked complete - until you read about the Java memory model. But that's a different story: you know about concurrency, you have protected your code via "synchronized" and still field access gives funny results.

In this bug we have to find the trick how sharing objects across thread could become possible in the first place. If we don't find that we have no chance of understanding the problem as a concurrency issue.
Comment 20 Olivier Thomann CLA 2012-03-05 08:45:04 EST
(In reply to comment #8)
> # Properties introduced in 3.5 M3 (PDE base builder R35_M2). See
> # http://aniefer.blogspot.com/2008/10/sorting-bundles-and-parallel.html
> 
> flattenDependencies=true
> parallelCompilation=true
In order to track it down, it would be good to know if this is a new issue in the compiler or if this is an old issue that can now be visible because of these new settings during the compilation.

Would it be possible to patch the build infrastructure to use an old compiler with these settings ?
Comment 21 Srikanth Sankaran CLA 2012-03-06 07:13:15 EST
(In reply to comment #18)
> (In reply to comment #15)
> > Created attachment 212011 [details]
> > Cleanup for field finallyInfo.
> I doubt if this could help, but the changes look good.

I agree. If anything, making this field thread local would have been
a more "appropriate" fix.

Kim, we are going to need some help on this from you - could you pin
down the exact build environment compiler version used to build
N20120222-2000 and N20120110-2000.

Likewise David and Carl, we need the exact version of the build
environment compiler using which these problems were observed.

Until we pin that down the picture could be confusing, but here
is some analysis: 

In the disassembler case, we can make some assertions regarding
some intermediate steps that "must have happened":

Here is the concerned client code:

	switch (c) {
	    case '\b' :
		buffer.append("\\b"); // <<----ALL is well here 
		break;  // <<<< Unreachable code 

Here is an extract of the relevant code from
SwitchStatement#AnalyseCode:

for (int i = 0, max = this.statements.length; i < max; i++) {
    Statement statement = this.statements[i];
 
    ...
    ...

    if ((complaintLevel = statement.complainIfUnreachable(caseInits, this.scope, complaintLevel, true)) < Statement.COMPLAINED_UNREACHABLE) {
	caseInits = statement.analyseCode(this.scope, switchContext, caseInits);
	...
        ...
    }
}

The only way we would have reported a "Unreachable code" error at
the "break;" statement is if the analyseCode invocation on the
message send marked its return flowinfo value as UNREACHABLE.

Now MessageSend#analyzeCode returns its argument flowInfo (which
is known not to be UNREACHABLE as otherwise we would have complained
on the message send itself), after possibly modifying itself at 4
places.

(1) flowInfo = this.receiver.analyseCode(currentScope, flowContext, flowInfo, nonStatic).unconditionalInits();

(2) flowInfo = flowInfo.mergedWith(assertInfo.nullInfoLessUnconditionalCopy()).
addInitializationsFrom(assertWhenTrueInfo.discardInitializationInfo()); 

(3) flowInfo = this.arguments[i].analyseCode(currentScope, flowContext, flowInfo).unconditionalInits();

(4) flowInfo = FakedTrackingVariable.markPassedToOutside(currentScope, argument, flowInfo, false);

Of these 4 modifications to the input flowInfo, (1) and (3) are
very old and can be discounted. Likewise (2) was introduced very recently
well after the problems were observed.

So it would appear the smoking gun points to (4).

While I say this "must have happened", looking at HEAD as well as looking
at some past revisions of concerned files, it is unclear to me how "this
which must have happend could have happened". 

Hence it is absolutely imperative we reconstruct the build environment
compiler as a source project and then ask questions about how it
could have come to pass that flowinfo at a certain point described
it as being unreachable while analyzing a message send.
Comment 22 Srikanth Sankaran CLA 2012-03-06 08:02:32 EST
(In reply to comment #21)

> Of these 4 modifications to the input flowInfo, (1) and (3) are
> very old and can be discounted. Likewise (2) was introduced very recently
> well after the problems were observed.
> 
> So it would appear the smoking gun points to (4).

A more open line of inquiry would be to just eliminate (2) and ask
how (1) or (3) or (4) could have resulted in an unreachable status
for the break statement and what non-deterministic agent could have
caused it.
Comment 23 Carl Anderson CLA 2012-03-06 09:58:22 EST
The uninitialized local variable was built with Eclipse 3.8 M4.

The unreachable code error was built with Eclipse 3.8 I20120228-0800.

Is that specific enough?  Or do you need specific plugin versions or something else?
Comment 24 Srikanth Sankaran CLA 2012-03-06 10:59:29 EST
(In reply to comment #23)

> Is that specific enough?  Or do you need specific plugin versions or something
> else?

Thanks, this is good enough - sorry I see that you have already provided
this in comment#2.

Would it be possible to attach 

ResultsSummaryGenerator.java
ValidationConfiguration.java

here ? TIA.
Comment 25 Carl Anderson CLA 2012-03-06 14:05:05 EST
Created attachment 212154 [details]
ResultsSummaryGenerator
Comment 26 Carl Anderson CLA 2012-03-06 14:14:00 EST
Created attachment 212156 [details]
ValidationConfiguration
Comment 27 Srikanth Sankaran CLA 2012-03-06 19:55:19 EST
(In reply to comment #26)
> Created attachment 212156 [details]
> ValidationConfiguration


Thanks. Very very interestingly, we have the same buffer.append
method call here as in the disassembler case - so all preliminary
analysis done in comment#21 should hold true:

Here is the method:

public static String getEnabledElementsAsString(ValidatorMetaData[] elements) {
	StringBuffer buffer = new StringBuffer();
	for (ValidatorMetaData vmd : elements) {
		buffer.append(vmd.getValidatorUniqueName());
		buffer.append(ConfigurationConstants.ELEMENT_SEPARATOR);
	}
	return buffer.toString(); // <<<---- UNREACHABLE CODE ERROR HERE
}

Carl, one more question: what is the type of ConfigurationConstants.ELEMENT_SEPARATOR ? (in the disassembler case the
argument to append call was a string literal)
Comment 28 Srikanth Sankaran CLA 2012-03-07 00:16:04 EST
(In reply to comment #21)

> Now MessageSend#analyzeCode returns its argument flowInfo (which
> is known not to be UNREACHABLE as otherwise we would have complained
> on the message send itself), after possibly modifying itself at 4
> places.

That ignores the places where the object is address exposed, so we
need to look at all places where the argument flowinfo was passed
around. Still this looks like a promising line of inquiry to me.

Consider that before the invocation of MessageSend.analyzeCode,
things were just fine and were broken just after ...
Comment 29 Carl Anderson CLA 2012-03-07 00:48:00 EST
(In reply to comment #27)

> Carl, one more question: what is the type of
> ConfigurationConstants.ELEMENT_SEPARATOR ? (in the disassembler case the
> argument to append call was a string literal)

public interface ConfigurationConstants {
String ELEMENT_SEPARATOR = ";"; //$NON-NLS-1$
Comment 30 Srikanth Sankaran CLA 2012-03-07 03:14:35 EST
(In reply to comment #28)

> Consider that before the invocation of MessageSend.analyzeCode,
> things were just fine and were broken just after ...

I stepped through the code trying to imagine what could go wrong and
drew a blank. This one is going to be a tough nut to crack. Even as
we consider various approaches, we should also look into whether we
want to instrument the compiler to gather some data - the unreachable
code problem looks way more amenable than the uninitialized variable
problem.

Given something goes wrong in MessageSend.analyzeCode, we can check
after every use of flowinfo (13 of them) to see if it has been polluted
already (is DEAD_END) and if so log/print a message.

What other data would we want ? 

We need to carefully think about it - we have limited to enable this,
and have it running before we will have to remove it for M7 or so.
Comment 31 Krzysztof Daniel CLA 2012-03-08 02:27:48 EST
*** Bug 372944 has been marked as a duplicate of this bug. ***
Comment 32 Satyam Kandula CLA 2012-03-08 02:49:46 EST
(In reply to comment #31)
> *** Bug 372944 has been marked as a duplicate of this bug. ***
Umm.. one more unitialized bug again coming from a for loop. While Srikanth is exploring instrumentation from the 'dead code' view, I will explore instrumentation from this point of view.
Comment 33 Srikanth Sankaran CLA 2012-03-08 04:43:58 EST
@Override
	public String toString() {
		StringBuffer buf= new StringBuffer();
		for (int i= 0; i < fParts.length; i++) {
			buf.append(fParts[i].length());
			buf.append(',');
			buf.append(fParts[i]);
		}
		return buf.toString();
	}


is the concerned method from bug 372944. 

So here too, we have StringBuffer and append call, Just after the append call
sonmething has gone wrong !!!!!
Comment 34 Ayushman Jain CLA 2012-03-08 05:55:54 EST
(In reply to comment #33)
>[..]
> So here too, we have StringBuffer and append call, Just after the append call
> sonmething has gone wrong !!!!!

The point (4) in comment 21 looks closer to the culprit now, since StringBuffer is a resource, I guess.
Comment 35 Srikanth Sankaran CLA 2012-03-08 06:50:57 EST
(In reply to comment #34)
> (In reply to comment #33)
> >[..]
> > So here too, we have StringBuffer and append call, Just after the append call
> > sonmething has gone wrong !!!!!
> 
> The point (4) in comment 21 looks closer to the culprit now, since StringBuffer
> is a resource, I guess.

Actually it is not a resource and also (4) from comment#21 gets short
circuited internally for the most part for some of the failing cases.
Comment 36 Srikanth Sankaran CLA 2012-03-08 08:01:57 EST
(In reply to comment #33)
> @Override
>     public String toString() {
>         StringBuffer buf= new StringBuffer();
>         for (int i= 0; i < fParts.length; i++) {
>             buf.append(fParts[i].length());
>             buf.append(',');
>             buf.append(fParts[i]);

Since we don't complain about the use of i in buf.append(fParts[i])
it tells us that the flowinfo was kosher at the time of the call to

flowInfo = this.arguments[i].analyseCode(currentScope, flowContext, flowInfo).unconditionalInits();

in  MessageSend.analyzeCode.

Either this call "corrupted" it after it made of the kosher data and
thereby polluting the flowinfo lhs variable or some subsequent use
of flowinfo in MessageSend.analyzeCode did it.

Subsequent uses:

(1) flowInfo = FakedTrackingVariable.markPassedToOutside(currentScope, this.arguments[i], flowInfo, false);

I cannot see how this can cause a problem as most of the code is
short circuited and is effective only for single name references
or allocation expressions - fParts[i] is neither.

(2) analyseArguments(currentScope, flowContext, flowInfo, this.binding, this.arguments);

This should have a bearing only if the method parameters are annotated
with null annotations - the method is from JRE and is not annotated.

(3) manageSyntheticAccessIfNecessary(currentScope, flowInfo);

Again, I cannot see how this can have any bearing.


On another note: both this test case and the one from Results
SummaryGenerator exhibit another interesting commonality: the
last line of the loop contains a reference which we don't
complain about, but we complain about loop terminating check
references. Perhaps there is something wrong in the code that
merges flow data - though what would explain the non-determinism
is beyond comprehension at the moment.
Comment 37 Srikanth Sankaran CLA 2012-03-08 08:04:59 EST
(In reply to comment #36)

> On another note: both this test case and the one from Results
> SummaryGenerator exhibit another interesting commonality: the
> last line of the loop contains a reference which we don't
> complain about, but we complain about loop terminating check
> references. Perhaps there is something wrong in the code that
> merges flow data - though what would explain the non-determinism
> is beyond comprehension at the moment.

i.e in

int amountRead = -1;
do {
    int amountRequested = Math.max(stream.available(), DEFAULT_READING_SIZE);
    if (contentsLength + amountRequested > contents.length) {
	System.arraycopy(contents, 0, contents = new byte[contentsLength + amountRequested], 0, contentsLength);
    }

    amountRead = stream.read(contents, contentsLength, amountRequested);

    if (amountRead > 0) {
        // remember length of contents
	contentsLength += amountRead;
    }
} while (amountRead != -1);

we don't complain about the last use of amountRead inside the loop,
but we complain about the use inside the while check.
Comment 38 Srikanth Sankaran CLA 2012-03-08 10:55:57 EST
(In reply to comment #8)
> WTP (and, I think platform) do use "parallelCompilation" so think it could be a
> thread-related problem. In one of our build.properties files, we have (complete
> with original comments :):
> 
> # Properties introduced in 3.5 M3 (PDE base builder R35_M2). See
> # http://aniefer.blogspot.com/2008/10/sorting-bundles-and-parallel.html
> 
> flattenDependencies=true
> parallelCompilation=true
> 

Does somebody know what exactly these setting do ? My naive understanding
is that these flags enable ANT to launch parallel builds, if so there are
multiple instances of the compiler running on separate VMs ? This would
not raise the sort of spurious data modification problems we are seeing ?

> but, my
> money's on the concurrency :) 

So is mine - only we are stumped at the moment trying to come up with
a scenario where the data would have an unanticipated agent modifying it.

Continuing to investigate ...
Comment 39 Stephan Herrmann CLA 2012-03-08 18:59:41 EST
Sorry, I don't have much time to investigate these days, but my gut feeling says: if concurrency succeeds to creep into the compiler the bug patterns could be much more diverse than what we are seeing. The internal code is not designed for thread safety so every single object in the compiler is vulnerable as soon as some of these are shared between threads. Why only hit on the flowInfo?

I'd like to put some money on JIT.

We don't have much info in this category, only that David reported an IBM JVM 5 or 6. Doesn't sound like sufficient data to rule out a JIT bug.

A JIT bug would fit the pattern that perhaps a specific method (some flowInfo manipulation) is affected - iff JIT-optimized. Perhaps some project who observes the bug "frequently" could disable JIT for the JVM running the compiler for a while?
Comment 40 Srikanth Sankaran CLA 2012-03-08 19:22:02 EST
(In reply to comment #39)
> Sorry, I don't have much time to investigate these days, but my gut feeling
> says: if concurrency succeeds to creep into the compiler the bug patterns could
> be much more diverse than what we are seeing. The internal code is not designed
> for thread safety so every single object in the compiler is vulnerable as soon
> as some of these are shared between threads. Why only hit on the flowInfo?

Indeed, none of the methods in LookupEnvironment is synchronized or otherwise
written to ensure for serial access. AND for the record, we have a single thread
running compiler processing tasks, so these data structures that are getting
stomped on are not being accessed in a concurrent fashion at all.

> A JIT bug would fit the pattern that perhaps a specific method (some flowInfo
> manipulation) is affected - iff JIT-optimized. Perhaps some project who
> observes the bug "frequently" could disable JIT for the JVM running the
> compiler for a while?

What would this show ? Absence of sighting is not going to be good enough ?
Alternately, is there a way to force the VM to always jit optimize a class
or set of classes ? That way we can force the bug to manifest ?
Comment 41 Srikanth Sankaran CLA 2012-03-08 19:29:10 EST
(In reply to comment #8)

> We (WTP) don't use VM7 to run the compiler ... we do use an IBM VM ... I'd have
> to dig to figure out which if 5 or 6 (I think 6) and which version ... but, my
> money's on the concurrency :) FYI we do use the 32 bit VM even though running
> on the Eclipse Foundation's 64 bit hardware, in case that'd matter.

David, could you please dig up this information for us ? The exact version
of the VM + some deployment history (most recent change) would help.
Comment 42 Srikanth Sankaran CLA 2012-03-08 19:30:10 EST
Kim, likewise, could you post here the exact version of the VM used to
run the builds (not the tests) and its deployment date ? TIA,
Comment 43 Srikanth Sankaran CLA 2012-03-08 19:33:37 EST
(In reply to comment #39)

> A JIT bug would fit the pattern that perhaps a specific method (some flowInfo
> manipulation) is affected - iff JIT-optimized. 

I guess there are multiple (at least more than one) translations involved
here too - translate and optimize based on how hot the code is - need to check
with the VM documentation.
Comment 44 Markus Keller CLA 2012-03-08 20:24:26 EST
A strategy to trigger a JIT bug could be to compile a file that already failed once over and over again. JITting is usually triggered by heavy usage of a method.

For an Oracle VM, you could run with -XX:+PrintCompilation to see the JITting:
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Comment 45 Carl Anderson CLA 2012-03-08 21:50:18 EST
(In reply to comment #41)
> (In reply to comment #8)
> 
> > We (WTP) don't use VM7 to run the compiler ... we do use an IBM VM ... I'd have
> > to dig to figure out which if 5 or 6 (I think 6) and which version ... but, my
> > money's on the concurrency :) FYI we do use the 32 bit VM even though running
> > on the Eclipse Foundation's 64 bit hardware, in case that'd matter.
> 
> David, could you please dig up this information for us ? The exact version
> of the VM + some deployment history (most recent change) would help.

I'm not David, but I am the WTP Build guy, and I can answer this question- the Java that we use can be found at /shared/webtools/apps/ibm-java2-i386-50-SR12FP5 and the -fullversion reports:
java full version "J2RE 1.5.0 IBM Linux build pxi32devifx-20110627 (SR12 FP5 )"

(Note that the BREE for the plugins containing both of these files is 1.5)
We updated our Java levels on 2011-08-13
Comment 46 David Williams CLA 2012-03-08 22:19:17 EST
(In reply to comment #38)
> (In reply to comment #8)
> > WTP (and, I think platform) do use "parallelCompilation" so think it could be a
> > thread-related problem. In one of our build.properties files, we have (complete
> > with original comments :):
> > 
> > # Properties introduced in 3.5 M3 (PDE base builder R35_M2). See
> > # http://aniefer.blogspot.com/2008/10/sorting-bundles-and-parallel.html
> > 
> > flattenDependencies=true
> > parallelCompilation=true
> > 
> 
> Does somebody know what exactly these setting do ? My naive understanding
> is that these flags enable ANT to launch parallel builds, if so there are
> multiple instances of the compiler running on separate VMs ? This would
> not raise the sort of spurious data modification problems we are seeing ?
> 

No, not separate VM's, at least from Ant's point of view. Same VM. I think they control how PDE created the compile.<featurename> like I attached to comment 11. And ... now that I actually read about the 'parallel' ant task ... 
http://ant.apache.org/manual/Tasks/parallel.html
it makes it sound like "compiling" is not something that should be parallelized?!! 

But, depends a little on the "build.jar" target. It might launch a separate VM, but doubt it ... compiling would take forever then! 

Maybe we've just "been lucky" all these years, until some recent changes broke what little thread safety you had?  Even if separate instances of the "Compiler class" are created to compile each file (which would be my guess), from some snippets of your code, it sounds like you use some class (static) variables or methods? Such as in comment 21, you have 

flowInfo = FakedTrackingVariable.markPassedToOutside(currentScope,
argument, flowInfo, false);

Is FakedTrackingVariable.markPassedToOutside a static method? Sounds like it, from naming convention. If so, even multiple instances of compiler class (in same VM) would interfere with each other. That woudln't happen to be your first use of static methods, would it?
Comment 47 David Williams CLA 2012-03-08 22:50:14 EST
For what its worth, I've skim read the "build.jar" PDE task and it doesn't appear to create a new VM, at least in no obvious way. Appears to just (eventually) call the build.xml file it creates for each bundle ... which compiles the source of that bundle, with an ant task similar to 

                <!-- compile the source code -->
                <javac destdir="${build.result.folder}/@dot" failonerror="${javacFailOnError}" verbose="${javacVerbose}" debug="${javacDebugInfo}" includeAntRuntime="no" bootclasspath="${bundleBootClasspath}" source="${bundleJavacSource}" target="${bundleJavacTarget}" errorProperty="compilation.error.occured"          >
                        <compilerarg line="${compilerArg}" compiler="${build.compiler}"/>
                        <classpath refid="@dot.classpath" />
                        <src path="src-catalog/"                        />
                        <src path="src-contentmodel/"                   />
                        <src path="src-validation/"                     />
                        <src path="src/"                        />
                        <src path="src-emfModelSynch/"                  />
                        <src path="src-search/"                 />
                        <compilerarg value="@${basedir}/javaCompiler...args" compiler="org.eclipse.jdt.core.JDTCompilerAdapter"                 />
                        <compilerarg line="-log &apos;${build.result.folder}/@dot${logExtension}&apos;" compiler="org.eclipse.jdt.core.JDTCompilerAdapter"                      />
                </javac>

(this happens to be for our xml.core bundle ... not one that's had any problems) ... I paste it in here to show how it all (sort of) comes down to 

compiler="org.eclipse.jdt.core.JDTCompilerAdapter"

figured you might have first hand knowledge of that class? :) 
I assume it creates an _instance_ of the compiler? to use for that source? 

So, pretty sure we have multiple instances of the compiler class, each running in its own thread. So, not sure "instances" have to be thread safe ... but ... if this is your first use of static methods, such as FakedTrackingVariable.markPassedToOutside, then that might explain everything? 

I'm obviously just making guesses based on names starting with capital letter. 

If you have always done that sort of thing ... maybe we have just been lucky for years?
Comment 48 Satyam Kandula CLA 2012-03-08 23:10:40 EST
(In reply to comment #47)
Thanks David for the build command. JDTCompilerAdapter is the entry point into ecj instead of using javac. The attribute fork is not set and hence it means the compilation is happening in the same process and hence there could be parallel threads.

FakedTrackingVariable.markPassedToOutside is a static method but doesn't cause any sharing of any data. They are no static,non-final fields.
Comment 49 Srikanth Sankaran CLA 2012-03-09 00:08:39 EST
(In reply to comment #44)
> A strategy to trigger a JIT bug could be to compile a file that already failed
> once over and over again. JITting is usually triggered by heavy usage of a
> method.

Good point.

Very conveniently for us, the utility class
org.eclipse.jdt.launching/launching/org/eclipse/jdt/internal/launching/CompositeId.java referenced in https://bugs.eclipse.org/bugs/show_bug.cgi?id=372944 can be compiled all by itself with no dependencies etc.

Satyam, could you create a test based on this file where we would repeatedly
compile this file a few thousand times and see if anything shows up ?

We need to use the identical VM used by  bug 372944, I have asked for it
there.

With a bit of effort, we can isolate a smaller case for Disassembler.java 
too.

Satyam, could you look into this ? Thanks.
Comment 50 Satyam Kandula CLA 2012-03-09 05:21:38 EST
(In reply to comment #49)
> Satyam, could you create a test based on this file where we would repeatedly
> compile this file a few thousand times and see if anything shows up ?

This could be a better idea. I am trying to setup the complete build and having problems. I will try out this. Thanks.
Comment 51 Eclipse Webmaster CLA 2012-03-09 10:28:26 EST
Just to answer a question David asked:

There haven't been any hardware changes to build.eclipse.org since the original PPC box failed and was replaced.  

-M.
Comment 52 Satyam Kandula CLA 2012-03-09 11:07:12 EST
Hi David/Carl,  Can you please let us know the jvm options that you think the
build uses? Thanks.
Comment 53 Kim Moir CLA 2012-03-09 11:29:32 EST
The 3.8 build doesn't actually run on Hudson.  It still runs on IBM hardware.  We are working on getting that running on foundation hardware.

The bootstrap script that actually starts the build is org.eclipse.releng.eclipsebuilder\bootstrap.sh in /cvsroot/eclipse

You can see that there are many JDKs that we use in the build to compile.  The these are the execution environments we use

linuxJdkArchive=jdks/jdk-1_5_0_22-fcs-bin-b03-linux-i586-09_oct_2009.zip
linuxJdkArchive14=jdks/jdk-1_4_2_14-fcs-bin-b05-linux-i586-14_mar_2007.zip
linuxJdkArchive16=jdks/1.6/jdk-6u17-fcs-bin-b04-linux-i586-11_oct_2009.zip
linuxppcJdkArchive=jdks/IBMJava2-SDK-1.4.2-10.0.ppc.tgz
linuxppcJdkArchive15=jdks/ibm-java2-sdk-5.0-6.0-linux-ppc.tgz
windowsJreArchive=jdks/jdk-1_4_2_16-fcs-bin-b05-windows-i586-16_sep_2007.zip
windows15JdkArchive=jdks/jdk-1_5_0_22-fcs-bin-b03-windows-i586-09_oct_2009.zip
windows16JdkArchive=jdks/1.6/jdk-6u17-fcs-bin-b04-windows-i586-11_oct_2009.zip
windows10FoundationArchive=jdks/weme-win-x86-foundation10_6.1.0.20060317-111429.zip
windows11FoundationArchive=jdks/weme-win-x86-ppro11_6.1.1.20061110-161633.zip

The bootclass paths are as follows

bootclasspath="$builderDir/jdk/win32/jdk1.4.2_16/jre/lib/rt.jar:$builderDir/jdk/win32/jdk1.4.2_16/jre/lib/jsse.jar:$builderDir/jdk/win32/jdk1.4.2_16/jre/lib/jce.jar"
bootclasspath_15="$builderDir/jdk/win32_15/jdk1.5.0_22/jre/lib/rt.jar:$builderDir/jdk/win32_15/jdk1.5.0_22/jre/lib/jsse.jar:$builderDir/jdk/win32_15/jdk1.5.0_22/jre/lib/jce.jar"
bootclasspath_16="$builderDir/jdk/win32_16/jdk6_17/jre/lib/rt.jar:$builderDir/jdk/win32_16/jdk6_17/jre/lib/jsse.jar:$builderDir/jdk/win32_16/jdk6_17/jre/lib/jce.jar"
bootclasspath_foundation="$builderDir/jdk/win32_foundation/lib/jclFoundation10/classes.zip"
bootclasspath_foundation11="$builderDir/jdk/win32_foundation11/lib/jclFoundation11/classes.zip"


The  build itself is actually invoked with this VM
jdk/linux16/jdk6_17/jre/bin/java

And the execution environments are as follows when the build is invoked, see J2SE-1.5= for example

$builderDir/jdk/linux16/jdk6_17/jre/bin/java -Xmx500m -Declipse.p2.MD5Check=false -Dorg.eclipse.update.jarprocessor.pack200=$builderDir/jdk/linux/jdk1.5.0_22/bin -jar ../org.eclipse.releng.basebuilder/plugins/org.eclipse.equinox.launcher.jar -Dosgi.os=linux -Dosgi.ws=gtk -Dosgi.arch=ppc -application org.eclipse.ant.core.antRunner -Declipse.p2.MD5Check=false" -q -buildfile buildAll.xml $mail $testBuild $compareMaps -DmapVersionTag=$mapVersionTag -DpostingDirectory=$postingDirectory -Dbootclasspath=$bootclasspath -DbuildType=$buildType -D$buildType=true -DbuildId=$buildId -Dbuildid=$buildId -DbuildLabel=$buildLabel -Dtimestamp=$timestamp -DmapCvsRoot=:ext:kmoir@dev.eclipse.org:/cvsroot/eclipse $skipPerf $skipTest $skipPack $tagMaps -DJ2SE-1.5=$bootclasspath_15 -DJ2SE-1.4=$bootclasspath -DCDC-1.0/Foundation-1.0=$bootclasspath_foundation -DCDC-1.1/Foundation-1.1=$bootclasspath_foundation11 -DOSGi/Minimum-1.2=$bootclasspath_foundation11  -DJavaSE-1.6=$bootclasspath_16 -DlogExtension=.xml $javadoc $updateSite $sign $repoCache -DgenerateFeatureVersionSuffix=true -Djava15home=$builderDir/jdk/linux/jdk1.5.0_22/jre -listener org.eclipse.releng.build.listeners.EclipseBuildListener"
Comment 54 Satyam Kandula CLA 2012-03-09 21:44:47 EST
Kim, thank you for the info.
Comment 55 Stephan Herrmann CLA 2012-03-11 12:11:41 EDT
Created attachment 212432 [details]
static non-final fields in jdt.core

(In reply to comment #48)
> FakedTrackingVariable.markPassedToOutside is a static method but doesn't cause
> any sharing of any data. They are no static,non-final fields.

This is correct for the immediate context of FakedTrackingVariable.

Out of curiosity I assembled a list of all static non-final fields in JDT/Core (attached, fields from "Messages" classes are omitted). None of these fields looked suspicious to me in terms of this bug, most are not part of the compiler actually. The closest I could see is UnconditionalFlowInfo.CoverageTestId. However, all access to this field is actually dead code (protected by the constant COVERAGE_TEST_FLAG).

On a lazy day we may want to check some of these, and make them final wherever possible.

Given the design of a single processing thread and also given the lack of static non-final fields in the vicinity of the bug, we don't have a theory how concurrency could possibly affect the compiler, right?
Comment 56 Stephan Herrmann CLA 2012-03-11 12:25:30 EDT
To follow up on the hypothesis of a JIT bug, I started thinking what kind of code change could possibly produce the observed behavior. I was flabbergasted by finding two almost identical methods in class Disassembler: escapeString and decodeStringValue, where the *second* occurrence triggered the bug, not the first.

If JIT is the culprit, a translation must have happened right while compiling this class (Disassembler.java). Or the minimal difference between both methods must have contributed to the bug. Either way there're really some odds at work!
Comment 57 Deepak Azad CLA 2012-03-11 23:14:24 EDT
http://dev.eclipse.org/mhonarc/lists/platform-releng-dev/msg19736.html

In I20120311-2000 there is a compile error in o.e.jdt.ui, but everything is fine in my workspaces (I checked in two distinct workspaces on 2 different machines). Also I do not see anything that has changed in this area for the last couple of days. 

Is this another instance of this bug, something else?
Comment 58 Srikanth Sankaran CLA 2012-03-12 01:28:30 EDT
(In reply to comment #57)
> http://dev.eclipse.org/mhonarc/lists/platform-releng-dev/msg19736.html

[...]

> Is this another instance of this bug, something else?

For the record, this was a problem in build input and not another
instance of this problem.
Comment 59 Stephan Herrmann CLA 2012-03-12 18:37:49 EDT
For completeness I followed the Hashtable track:

The compiler contains a few locations that may iterate over a Set or a Map, in which case the order may be quasi non-deterministic.

E.g.:
- CompilationResult.getClassFiles() converts a map to an array.
- StackMapFrameCodeStream.getExceptionMarkers() iterates over a set - but then sorts the result.

However, both these seem to be way too late in the process for influencing the flow analysis.

Here's another one closer to the site of crime:
- FakedTrackingVariable.pickVarForReporting(Set, BlockScope, boolean)
Here we pick the first element from a Set. 
The method was added on 2012-01-15, i.e., later than N20120110-2000 when the bug was observed for the first time. Additionally, even this method is called too late (at the end of analyzing a block, or *after* reporting unreachable) for influencing the analysis towards this bug, I should say.

I scanned all Sets/Maps I could find in the compiler, but none locked like a hot track to me. => Still looking for a root cause of the observed non-determinism.
Comment 60 Stephan Herrmann CLA 2012-03-13 05:18:32 EDT
Bug 372944 comment 11 mentions that the error occurred as a flood of errors actually, which could be seen as an indication towards a JIT bug.

Shouldn't we ask people affected by this bug to add "-proceedOnError" to their compiler command lines to check if this will produce the same kind of flood?

Is there also an ant setting controlling whether compilation should stop after the first error?
Comment 61 Srikanth Sankaran CLA 2012-03-13 05:43:16 EDT
(In reply to comment #60)

> Shouldn't we ask people affected by this bug to add "-proceedOnError" to their
> compiler command lines to check if this will produce the same kind of flood?

Satyam, please check the options used for SDK build to see what is the
setting for this ? We would expect it to be not in effect. Turning it on
to see if we see a flood of errors would help.
Comment 62 Srikanth Sankaran CLA 2012-03-13 06:11:28 EDT
While attempting to reproduce locally against a Oracle VM, we should
try both the -server mode and -client mode. Some optimizations kick in
only in server mode and so some bugs show up only in server mode: e.g
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921.
Comment 63 Carl Anderson CLA 2012-03-13 14:10:06 EDT
WTP just got hit with this in another build.  In the past, it was one error.  This time, it was 111,862 errors- it appears that we got into this state at the beginning, and that carried through to many, many compiles, across plugins and features.  I see:
UninitializedLocalVariable
UninitializedBlankFinalField
for the vast majority of these errors, and then other errors for files/plugins that build on top of these.  This is still using I20120228-0800 to build.
Comment 64 Stephan Herrmann CLA 2012-03-13 15:36:36 EDT
(In reply to comment #63)
> WTP just got hit with this in another build.  In the past, it was one error. 
> This time, it was 111,862 errors- it appears that we got into this state at the
> beginning, and that carried through to many, many compiles, across plugins and
> features.  I see:
> UninitializedLocalVariable
> UninitializedBlankFinalField
> for the vast majority of these errors, and then other errors for files/plugins
> that build on top of these.  This is still using I20120228-0800 to build.

This perfectly matches the observation from bug 372944 comment 11.

A JIT bug could nicely explain this: once a given method is badly "optimized", it will consistently reproduce the same bug over and over.

Could you please check bug 372944 comment 12 for requests for additional info that might possibly help us? Thanks.
Comment 65 Carl Anderson CLA 2012-03-13 15:48:26 EDT
> 
> This perfectly matches the observation from bug 372944 comment 11.
> 
> A JIT bug could nicely explain this: once a given method is badly "optimized",
> it will consistently reproduce the same bug over and over.
> 
> Could you please check bug 372944 comment 12 for requests for additional info
> that might possibly help us? Thanks.

Comment #45 has the JVM information that we are still using.  This is an IBM JVM, so I have not set the flags in question.  And our builds occur on the build.eclipse.org machine - I don't have its hardware specifications readily available, but Denis Roy could probably fill in the details.
Comment 66 Eclipse Webmaster CLA 2012-03-13 16:17:08 EDT
Build.eclipse.org:

2 Intel Xeon e5540 Quad-core @ 2.53GHz
24 GB RAM
2 Internal HDD
2 Gigabit NICs

If you want something more specific I'll have to go digging.

-M.
Comment 67 Carl Anderson CLA 2012-03-13 16:19:00 EDT
Interesting- the next build only had one UninitializedLocalVariable.  The line
is different from others being evaluated here:
for (int i=0; i < files.length; i++) {
It says that the i++ is the uninitialized local variable.
Comment 68 Ayushman Jain CLA 2012-03-13 16:29:54 EDT
Stumbled across this to diagnose JIT failures - http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/topic/com.ibm.java.doc.diagnostics.50/diag/tools/jitpd.html
Comment 69 Stephan Herrmann CLA 2012-03-13 17:21:46 EDT
Satyam, have you collected a list of bug patterns in this field?

So far I can see:

(a) static final field with initializer: uninitialized blank final field

(b) increment in an old-style for loop: uninitialized local

(c) accessing any local var / arg, perhaps only after a throw/return
  (see bug 372944 comment 18): uninitialized local

(d) return or break statement: unreachable code
  (not sure what is the distinguishing feature in these cases..)

Can we match these patterns against VMs or anything?
Comment 70 Srikanth Sankaran CLA 2012-03-14 02:46:05 EDT
(In reply to comment #69)
> Satyam, have you collected a list of bug patterns in this field?

Just a quick note to inform stake holders that we are busy shipping
M6 this week. Once that is off the plate, this bug will be #1 priority
item for us and we should see faster progress on it.
Comment 71 Srikanth Sankaran CLA 2012-03-14 03:57:53 EDT
http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index.jsp?topic=%2Fcom.ibm.java.doc.diagnostics.50%2Fdiag%2Ftools%2Fjitpd.html

It would help if experimental builds of WTP could be undertaken
with ecj run on an IBM VM invoked as 

java -Xjit:verbose,optLevel=scorching 

and another time with:

java -Xjit:verbose,optLevel=noOpt

If this a translator/optimizer problem, we would expect the scorching setting
to make the failures deterministic and the noOpt setting to make the
failures go away.

Thanks for any help you are able to offer with this exercise.
We will also undertake similar experiments locally once M6 is cooked
and out of the door.
Comment 72 Srikanth Sankaran CLA 2012-03-14 03:59:37 EDT
(In reply to comment #71)

> java -Xjit:verbose,optLevel=noOpt

I don't know if the noOpt setting is the same as the -Xint option
documented here: http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index.jsp?topic=%2Fcom.ibm.java.doc.diagnostics.50%2Fdiag%2Ftools%2Fjitpd.html

In any case, the scorching setting is more important.
Comment 73 Satyam Kandula CLA 2012-03-14 04:02:33 EDT
Created attachment 212620 [details]
Table of the problems

Hi Stephan, Here is the table of the problems we have been seeing.
Comment 74 Markus Keller CLA 2012-03-15 07:56:59 EDT
The fact that this problem occurs on IBM and Oracle VMs doesn't really speak for a JIT bug. AFAIK, the JIT compilers of these VMs are completely separate implementations, so it's unlikely that they share the same bug.

I think you should have another look at hashCode()/equals() implementations in lookup environments (flow analysis, bindings, ...). A possible scenario that could lead to the observed effects:
- a hashCode() implementation is not completely deterministic (e.g. depends on Object#hashCode())
- an equals() implementation is buggy
- the equals() is only used for HashMap/Set lookups
=> Now, in rare cases where the (hashCode() % mapSize) collide, objects end up in the same bucket and can be confused. If equals() is not often used outside of HashMaps, this can go undetected for a long time.

A way to make this more likely to occur is to make hashCode() return a constant. That is legal but will kill performance by turning all O(1) operations into O(n).
Comment 75 Srikanth Sankaran CLA 2012-03-15 09:30:26 EDT
(In reply to comment #74)
> The fact that this problem occurs on IBM and Oracle VMs doesn't really speak
> for a JIT bug. AFAIK, the JIT compilers of these VMs are completely separate
> implementations, so it's unlikely that they share the same bug.

I agree that it calls for too much of a co-incidence for a very similar bug
to show up in two VMs, but stranger things have been known to happen. But
while we continue on this line of inquiry, we will not close our minds to
other parallel tracks that we can explore.

Both on this bug and its "duplicate" we have asked for some specific 
experiments to be done which we feel will decisively rule out the 
JIT angle if that is indeed not the culprit.

> I think you should have another look at hashCode()/equals() implementations in
> lookup environments (flow analysis, bindings, ...). A possible scenario that
> could lead to the observed effects:

Stephan/Ayush, does this ring a bell as far as the flow analysis code
that underwent changes foe 3.8 are concerned ?

In my debugging exercises, I found the window at which the flip from
good to bad happens to be extremely narrow and the object that gets
corrupted all of a sudden is simply being passed around to various
methods (i.e is not stored and retrieved) - Nevertheless, what you
outline is a worthy hypothesis to check out.

It would also be a worthwhile exercise to turn off resource leaks analysis
and see if the problem still shows up (assuming that all significant code
chunks have been protected by an option-check). (3.8 changed the flow analysis
in two major ways : (a) null annotations support which is default off and
(b) leaks analysis which is default on) - I would wait for the jit
thesis to be disproved first though. We do have the catch that not sighting
the problem is proof of the problem being fixed - given how sporadic it
is proving to be.

The other point is we also need to explain the flood of error messages
that we get in some of the reports - These spurious problems are
reported across several different files, several different methods.
If the flow info data for a method gets corrupted when we are analyzing
well inside a method, I would not expect cascading errors in completely 
different methods or files: This particular behavior would nicely be
explained by a JIT bug - once some heavily used method got badly
optimized, most/many/all subsequent uses of the method  could trigger
an additional instance of the problem.

All that said, the skepticism you express is healthy and justified.
Comment 76 Srikanth Sankaran CLA 2012-03-15 09:32:34 EDT
(In reply to comment #75)

> thesis to be disproved first though. We do have the catch that not sighting
> the problem is proof of the problem being fixed - given how sporadic it
> is proving to be.

Obviously I meant: "not sighting the problem is NOT proof of the problem
being fixed".
Comment 77 Stephan Herrmann CLA 2012-03-15 09:57:17 EDT
I totally agree with comment 74 and comment 75.

(In reply to comment #75)
> > I think you should have another look at hashCode()/equals() implementations in
> > lookup environments (flow analysis, bindings, ...). A possible scenario that
> > could lead to the observed effects:

This is a very good point.
 
> Stephan/Ayush, does this ring a bell as far as the flow analysis code
> that underwent changes foe 3.8 are concerned ?

No hashCode() implementations have been added lately. A Map and a Set have been introduced for the resource leak analysis, but both are only used for reporting a problem after it has been detected. This is not a good match for the observed behavior, but it must be investigated anyway...

> It would also be a worthwhile exercise to turn off resource leaks analysis
> and see if the problem still shows up (assuming that all significant code
> chunks have been protected by an option-check).

I'd be much relieved to *know* that this part is innocent. Unfortunately, as you say, we can never prove absence of this error.

I guess we're at the point where "likeliness" isn't a useful hint any more.
Unfortunately, for my part a systematic investigation of some 200
occurrences of Set/Map and implementations of hashCode() inside the compiler will have to wait till after EclipseCon.
Comment 78 Stephan Herrmann CLA 2012-03-15 10:01:44 EDT
BTW: while torturing WTP's JVM in an Object Teams build I got this in the JIT log:

....
+ (no-opt) java/util/IdentityHashMap.access$200(Ljava/util/IdentityHashMap;)I @ AF7C1F10-AF7C1F2E
+ (no-opt) java/util/IdentityHashMap$IdentityHashMapIterator.hasNext()Z @ AF7C1F50-AF7C200E
+ (no-opt) java/util/IdentityHashMap$KeyIterator.next()Ljava/lang/Object; @ AF7C2030-AF7C209C
+ (no-opt) java/util/IdentityHashMap$IdentityHashMapIterator.nextIndex()I @ AF7C20B0-AF7C21F3
+ (no-opt) java/util/IdentityHashMap.access$600(Ljava/lang/Object;)Ljava/lang/Object; @ AF7C2210-AF7C2237
+ (no-opt) java/util/IdentityHashMap.unmaskNull(Ljava/lang/Object;)Ljava/lang/Object; @ AF7C2250-AF7C2294
+ (no-opt) java/util/logging/LogManager.reset()V @ AF7C22B0-AF7C2508
+ (no-opt) java/util/logging/LogManager.getLoggerNames()Ljava/util/Enumeration; @ AF7C2530-AF7C2594
+ (no-opt) java/util/logging/LogManager.resetLogger(Ljava/lang/String;)V @ AF7C25B0-AF7C2753
+ (no-opt) java/util/logging/Logger.getHandlers()[Ljava/util/logging/Handler; @ AF7C2770-AF7C27EC
+ (no-opt) java/util/IdentityHashMap$KeySet.iterator()Ljava/util/Iterator; @ AF7C2810-AF7C289B
+ (no-opt) java/util/IdentityHashMap$KeyIterator.<init>(Ljava/util/IdentityHashMap;Ljava/util/Identi
tyHashMap$1;)V @ AF7C28B0-AF7C28DE
+ (no-opt) java/util/IdentityHashMap$KeyIterator.<init>(Ljava/util/IdentityHashMap;)V @ AF7C28F0-AF7C2928
+ (no-opt) java/util/IdentityHashMap$IdentityHashMapIterator.<init>(Ljava/util/IdentityHashMap;Ljava/util/IdentityHashMap$1;)V @ AF7C2950-AF7C297E
! java/util/IdentityHashMap$IdentityHashMapIterator.<init>(Ljava/util/IdentityHashMap;)V Error code=FFFFFFFF

This happened with -Xjit:verbose,vlog=ibmjvm.log,count=1,optLevel=noOpt and disappeared with -Xint.
Comment 79 Srikanth Sankaran CLA 2012-03-15 10:22:25 EDT
Satyam, please attach the test projects, ant scripts etc you are
experimenting with here - This will allow Stephan to critique
them as well play around with them.

(Let us err on the side of over-communicating/documenting side 
rather than on the under-communicating side)
Comment 80 Srikanth Sankaran CLA 2012-03-15 10:28:46 EDT
(In reply to comment #78)
> BTW: while torturing WTP's JVM in an Object Teams build I got this in the JIT
> log:

[...]

> java/util/IdentityHashMap$IdentityHashMapIterator.<init>(Ljava/util/IdentityHashMap;)V
> Error code=FFFFFFFF

What exactly is this showing, Stephan ?

> This happened with -Xjit:verbose,vlog=ibmjvm.log,count=1,optLevel=noOpt and
> disappeared with -Xint.

This shows noOpt != Xint (see comment#72).

(In reply to comment #77)

[...]

> > It would also be a worthwhile exercise to turn off resource leaks analysis
> > and see if the problem still shows up (assuming that all significant code
> > chunks have been protected by an option-check).
> 
> I'd be much relieved to *know* that this part is innocent. Unfortunately, as
> you say, we can never prove absence of this error.

There is one outcome that could clarify matters - if resource analysis
is turned off and the problem still shows up, we will know the problem
is certainly elsewhere (assuming all code is protected).

We can ask for this experiment to be done after the current set of
requests have been attended to.
Comment 81 Stephan Herrmann CLA 2012-03-15 11:11:56 EDT
(In reply to comment #80)
> > java/util/IdentityHashMap$IdentityHashMapIterator.<init>(Ljava/util/IdentityHashMap;)V
> > Error code=FFFFFFFF
> 
> What exactly is this showing, Stephan ?

I haven't the slightest.
Except that JIT wasn't happy (and code failed to run - a class could not be loaded).
 
> > This happened with -Xjit:verbose,vlog=ibmjvm.log,count=1,optLevel=noOpt and
> > disappeared with -Xint.

Using a slightly less drastic setting (count=9) only moved the error to:

! java/util/logging/Logger.getHandlers()[Ljava/util/logging/Handler; Error code=FFFFFFFF
Comment 82 Satyam Kandula CLA 2012-03-15 11:13:01 EDT
Created attachment 212729 [details]
Test project being used to reproduce the problem

Here is the project that I am trying to run on some linux machines to reproduce. The build.cmd is the command line that I am running to reproduce the problem. Note that the vm the build.cmd has is the previous VM. Now, I am running with the b147 Sun JDK7. 

Please let me know if you want to get some more code added.
Comment 83 Srikanth Sankaran CLA 2012-03-15 11:28:56 EDT
(In reply to comment #82)

> Please let me know if you want to get some more code added.

We may have to use full source files as opposed to skeleton versions.
Eliminating fields from a class shifts the local slot id I think.
Likewise, there could be other side effects if we try to work with
simpler cases to begin with.

Can we start with the offending source files in their entirety and
pull in one by one their dependencies to form the minimal buy
complete file set ?
Comment 84 Satyam Kandula CLA 2012-03-15 11:35:11 EDT
(In reply to comment #83)
I agree. As of now CompositeId.java is in full but not the other files. Sure, I will try to add that stuff. Thanks.
Comment 85 Stephan Herrmann CLA 2012-03-15 15:14:11 EDT
Wow, this VM (ibm-java2-i386-50-SR12FP5) is a fragile beast, playing a bit more with its jit option I frequently see the VM die like this:

     [java] +0x48c99
     [java] +0x1ec60
     [java] /opt/public/webtools/apps/ibm-java2-i386-50-SR12FP5/jre/bin/libj9jit23.so(+0x1612fb
     [java] +0x1612fb
     [java] +0x161222
     [java] +0x16117c
     [java] +0x161033
     [java] +0x160f3c
     [java] +0x13f0f4
     [java] +0x13e638
     [java] +0x13e0e8
     [java] +0x86384
     [java] +0x861a2
     [java] +0x4b554
     [java] +0x4b1c7
     [java] +0x1e50a
     [java] +0x4a996
     [java] +0x49a2b
     [java] +0x1e50a
     [java] +0x4980e
     [java] +0x3365
     [java] +0x67b9
     [java] clone+0x5e
     [java] Unhandled exception
     [java] Type=Segmentation error vmState=0x000507ff
     [java] J9Generic_Signal_Number=00000004 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
     [java] Handler1=F74EACFE Handler2=F74ABB69 InaccessibleAddress=00000014
     [java] EDI=BCBA2740 ESI=BCBA25EC EAX=00000000 EBX=BCBA2740
     [java] ECX=BC5DA128 EDX=BCBA25EC
     [java] EIP=F70D4B9F ES=002B DS=002B ESP=BCC88920
     [java] EFlags=00210246 CS=0023 SS=002B EBP=BCC88968
     [java] Module=/opt/public/webtools/apps/ibm-java2-i386-50-SR12FP5/jre/bin/libj9jit23.so
     [java] Module_base_address=F6F73000
     [java] 
     [java] Method_being_compiled=org/eclipse/jdt/internal/compiler/lookup/BlockScope.getBinding([[CILorg/eclipse/jdt/internal/compiler/lookup/InvocationSite;Z)Lorg/eclipse/jdt/internal/compiler/lookup/Binding;
     [java] Target=2_30_20110623_85457_lHdSMr (Linux 2.6.32.45-0.3-default)



It'll be hard to force JITing in the desired way if the VM simply responds with segv.
Comment 86 Srikanth Sankaran CLA 2012-03-15 20:49:09 EDT
(In reply to comment #85)

> It'll be hard to force JITing in the desired way if the VM simply responds with
> segv.

Just so, we are on the same page, could you show the full command line 
and settings ? I assume you are using the scorching mode in a blanket
fashion ? 

The documentation from comment#68, has some information on how to set
optlevel for specific methods IIRC - so we should be able to opt out
of the minefields - though it could get cumbersome.
Comment 87 Srikanth Sankaran CLA 2012-03-16 10:17:33 EDT
(In reply to comment #74)

[...]

> I think you should have another look at hashCode()/equals() implementations in
> lookup environments (flow analysis, bindings, ...). A possible scenario that
> could lead to the observed effects:

I investigated this trail some more and here are some observations:

Test set up:

    - Start with I20120228-0800 build.
    - Import JDT/Core as a source project.
    - Create a Java application launch configuration with
      org.eclipse.jdt.internal.compiler.batch.Main as the main class
      and -source 6.0 c:\CompositeId.java as arguments.
      [Need to check what source level I should actually be using]

Between CompilationUnitDeclaration.analyseCode() and
CompilationUnitDeclaration.generateCode() (i.e the window in which the
problem surfaces) I had a breakpoint in all equals methods + Object's and
I had a breakpoint in all hashCode methods without Object's - I couldn't 
plant a method entry breakpoint in the native method Object.hashCode for
some reason, but since Object.equals had a breakpoint, we should be good.

I could see equals & hashCode called in the following
call stacks only.

UnresolvedReferenceBinding(ReferenceBinding).hashCode() line: 873	
SimpleLookupTable.get(Object) line: 68	
LookupEnvironment.updateCaches(UnresolvedReferenceBinding, ReferenceBinding) line: 1501	
UnresolvedReferenceBinding.setResolvedType(ReferenceBinding, LookupEnvironment) line: 84	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, boolean, AccessRestriction) line: 686	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, AccessRestriction) line: 676	
Compiler.accept(IBinaryType, PackageBinding, AccessRestriction) line: 298	
LookupEnvironment.askForType(char[][]) line: 127	
UnresolvedReferenceBinding.resolve(LookupEnvironment, boolean) line: 54	
BinaryTypeBinding.resolveType(TypeBinding, LookupEnvironment, boolean) line: 131	
BinaryTypeBinding.superInterfaces() line: 1329	
BinaryTypeBinding.hasTypeBit(int) line: 998	
ParameterizedTypeBinding.hasTypeBit(int) line: 636	
FakedTrackingVariable.isAnyCloseable(TypeBinding) line: 502	
LocalDeclaration.analyseCode(BlockScope, FlowContext, FlowInfo) line: 82	
MethodDeclaration.analyseCode(ClassScope, FlowContext, FlowInfo) line: 110	
TypeDeclaration.internalAnalyseCode(FlowContext, FlowInfo) line: 710	
TypeDeclaration.analyseCode(CompilationUnitScope) line: 255	
CompilationUnitDeclaration.analyseCode() line: 111	
Compiler.process(CompilationUnitDeclaration, int) line: 776	
ProcessTaskManager.run() line: 137	
Thread.run() line: 736	

BinaryTypeBinding(Object).equals(Object) line: 58	
SimpleLookupTable.get(Object) line: 71	
LookupEnvironment.updateCaches(UnresolvedReferenceBinding, ReferenceBinding) line: 1501	
UnresolvedReferenceBinding.setResolvedType(ReferenceBinding, LookupEnvironment) line: 84	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, boolean, AccessRestriction) line: 686	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, AccessRestriction) line: 676	
Compiler.accept(IBinaryType, PackageBinding, AccessRestriction) line: 298	
LookupEnvironment.askForType(char[][]) line: 127	
UnresolvedReferenceBinding.resolve(LookupEnvironment, boolean) line: 54	
BinaryTypeBinding.resolveType(TypeBinding, LookupEnvironment, boolean) line: 131	
BinaryTypeBinding.superInterfaces() line: 1329	
BinaryTypeBinding.hasTypeBit(int) line: 998	
ParameterizedTypeBinding.hasTypeBit(int) line: 636	
FakedTrackingVariable.isAnyCloseable(TypeBinding) line: 502	
LocalDeclaration.analyseCode(BlockScope, FlowContext, FlowInfo) line: 82	
MethodDeclaration.analyseCode(ClassScope, FlowContext, FlowInfo) line: 110	
TypeDeclaration.internalAnalyseCode(FlowContext, FlowInfo) line: 710	
TypeDeclaration.analyseCode(CompilationUnitScope) line: 255	
CompilationUnitDeclaration.analyseCode() line: 111	
Compiler.process(CompilationUnitDeclaration, int) line: 776	
ProcessTaskManager.run() line: 137	
Thread.run() line: 736	

UnresolvedReferenceBinding(ReferenceBinding).hashCode() line: 873	
SimpleLookupTable.get(Object) line: 68	
LookupEnvironment.updateCaches(UnresolvedReferenceBinding, ReferenceBinding) line: 1501	
UnresolvedReferenceBinding.setResolvedType(ReferenceBinding, LookupEnvironment) line: 84	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, boolean, AccessRestriction) line: 686	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, AccessRestriction) line: 676	
Compiler.accept(IBinaryType, PackageBinding, AccessRestriction) line: 298	
LookupEnvironment.askForType(char[][]) line: 127	
UnresolvedReferenceBinding.resolve(LookupEnvironment, boolean) line: 54	
BinaryTypeBinding.resolveType(TypeBinding, LookupEnvironment, boolean) line: 131	
BinaryTypeBinding.superclass() line: 1303	
BinaryTypeBinding(ReferenceBinding).isUncheckedException(boolean) line: 1264	
LoopingFlowContext(FlowContext).checkExceptionHandlers(TypeBinding[], ASTNode, FlowInfo, BlockScope) line: 328	
MessageSend.analyseCode(BlockScope, FlowContext, FlowInfo) line: 117	
MessageSend(Expression).analyseCode(BlockScope, FlowContext, FlowInfo, boolean) line: 227	
MessageSend.analyseCode(BlockScope, FlowContext, FlowInfo) line: 68	
LocalDeclaration.analyseCode(BlockScope, FlowContext, FlowInfo) line: 92	
Block.analyseCode(BlockScope, FlowContext, FlowInfo) line: 39	
WhileStatement.analyseCode(BlockScope, FlowContext, FlowInfo) line: 119	
MethodDeclaration.analyseCode(ClassScope, FlowContext, FlowInfo) line: 110	
TypeDeclaration.internalAnalyseCode(FlowContext, FlowInfo) line: 710	
TypeDeclaration.analyseCode(CompilationUnitScope) line: 255	
CompilationUnitDeclaration.analyseCode() line: 111	
Compiler.process(CompilationUnitDeclaration, int) line: 776	
ProcessTaskManager.run() line: 137	
Thread.run() line: 736	

UnresolvedReferenceBinding(Object).equals(Object) line: 58	
SimpleLookupTable.get(Object) line: 71	
LookupEnvironment.updateCaches(UnresolvedReferenceBinding, ReferenceBinding) line: 1519	
UnresolvedReferenceBinding.setResolvedType(ReferenceBinding, LookupEnvironment) line: 84	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, boolean, AccessRestriction) line: 686	
LookupEnvironment.createBinaryTypeFrom(IBinaryType, PackageBinding, AccessRestriction) line: 676	
Compiler.accept(IBinaryType, PackageBinding, AccessRestriction) line: 298	
LookupEnvironment.askForType(char[][]) line: 127	
UnresolvedReferenceBinding.resolve(LookupEnvironment, boolean) line: 54	
BinaryTypeBinding.resolveType(TypeBinding, LookupEnvironment, boolean) line: 131	
BinaryTypeBinding.superclass() line: 1303	
BinaryTypeBinding(ReferenceBinding).isUncheckedException(boolean) line: 1264	
LoopingFlowContext(FlowContext).checkExceptionHandlers(TypeBinding[], ASTNode, FlowInfo, BlockScope) line: 328	
MessageSend.analyseCode(BlockScope, FlowContext, FlowInfo) line: 117	
MessageSend(Expression).analyseCode(BlockScope, FlowContext, FlowInfo, boolean) line: 227	
MessageSend.analyseCode(BlockScope, FlowContext, FlowInfo) line: 68	
LocalDeclaration.analyseCode(BlockScope, FlowContext, FlowInfo) line: 92	
Block.analyseCode(BlockScope, FlowContext, FlowInfo) line: 39	
WhileStatement.analyseCode(BlockScope, FlowContext, FlowInfo) line: 119	
MethodDeclaration.analyseCode(ClassScope, FlowContext, FlowInfo) line: 110	
TypeDeclaration.internalAnalyseCode(FlowContext, FlowInfo) line: 710	
TypeDeclaration.analyseCode(CompilationUnitScope) line: 255	
CompilationUnitDeclaration.analyseCode() line: 111	
Compiler.process(CompilationUnitDeclaration, int) line: 776	
ProcessTaskManager.run() line: 137	
Thread.run() line: 736	




Nothing stands out at a first glance - but a few observations/questions:

(1) So if the hashCode/equals hypothesis from comment#74 were to hold 
true, then the data structure concerned has to be the SimpleLookupTables
maintained by LookupEnvironment and updated at LookupEnvironment.updateCaches

(2) The comments in LookupEnvironment.updateCaches say

// hashCode is based on compoundName so this works - cannot be raw since type of parameterized type

this is blatantly not so - see ReferenceBinding.hashCode() - though
I don't see this contributing to a problem.

(3) Short circuiting leak analysis on option: This is not fully effective.
In several places, there is code of the form:

boolean shouldAnalyseResource = this.binding != null 
			&& flowInfo.reachMode() == FlowInfo.REACHABLE
			&& FakedTrackingVariable.isAnyCloseable(this.initialization.resolvedType)
			&& currentScope.compilerOptions().analyseResourceLeaks;

isAnyCloseable call executes a lot of code potentially, walks the
type hierarchy and causes various updates to LE etc.

(4) A bunch of types override equals without overriding hashCode,
but nothing obviously amiss:

BinaryField, BinaryType, CompilationUnit 	
ExternalJavaProject , ImportContainer 
ImportDeclaration , JarIndexLocation 	
JavaModel, SourceRefElement 
SourceType  	
StackMapFrameCodeStream$ExceptionMarker
Comment 88 Markus Keller CLA 2012-03-16 12:37:05 EDT
hashCode/equals of JavaElement and subtypes look OK to me. Only PackageFragmentRoot#equals(Object) is a bit hard to understand, since it just does an "instanceof PackageFragmentRoot" check and then proceeds even if o is of a subtype of PackageFragmentRoot. But I think it's safe, since only resource==null could yield a conflict, but this can only happen in subtypes, and both subtypes override equals(..).

For bindings, ArrayBinding#hashCode() and WildcardBinding#hashCode() look problematic, since their #swapUnresolved(..) methods change the hashCode() during the lifetime of the object in some cases. However, I don't know enough about the call sequences there to judge whether that's relevant.
Comment 89 Srikanth Sankaran CLA 2012-03-16 15:00:27 EDT
(In reply to comment #87)

> (1) So if the hashCode/equals hypothesis from comment#74 were to hold 
> true, then the data structure concerned has to be the SimpleLookupTables
> maintained by LookupEnvironment and updated at LookupEnvironment.updateCaches

Not so. The above is true as far as Object's method overrides are
concerned, but we have some other hashCode and equals methods that
differ in signature such as CharOperation.hashCode(char[]) that are
references in custom hash tables that need to be studied too.
Comment 90 Srikanth Sankaran CLA 2012-03-17 04:07:51 EDT
Created attachment 212822 [details]
Patch for force hashCode methods to return 0

(In reply to comment #74)

> A way to make this more likely to occur is to make hashCode() return a
> constant. That is legal but will kill performance by turning all O(1)
> operations into O(n).

This patch forces all hashCode methods to return 0. I ran the JDT/Core tests
with it - it triggered a bunch of "failures" where basically the order of
expected messages and actual messages differ to different discovery order
resulting from hashCode collision - None of these is a real issue.

No flow analysis related anomaly was triggered - However the test suite
is really made of numerous independent small tests and is not a good data
point for the type of exercise at hand.

We would like the SDK built with a compiler that has these changes.
Kim/David, how can we help you and make it easy for you to carry out
this experiment for us ? Some alternatives:

(1) Releng can apply this patch, build a compiler and use that to
build SDK on an experimental basis (not the official builds)

(2) We can release this change into master, wait for a nightly
build to happen and then rollback the changes - Releng can then
use the night build bits to build the SDK.

(3) We can build ecj.jar for you with this changes and hand over
the batch compiler that you can run and report back

(4) Any other alternatives ? 

Please let us know what would work best for you -
Comment 91 Stephan Herrmann CLA 2012-03-17 07:44:40 EDT
(In reply to comment #90)
> No flow analysis related anomaly was triggered - However the test suite
> is really made of numerous independent small tests and is not a good data
> point for the type of exercise at hand.
> 
> We would like the SDK built with a compiler that has these changes.
> ...

Why bother releng with this exercise? With constant hashCode() aren't we now looking for a deterministic, platform independent variant of the bug?

Anyway, I applied the patch and compiled a SDK-like set of projects in a runtime workbench. None of our suspects were sighted.

Am I missing anything?
Comment 92 Srikanth Sankaran CLA 2012-03-17 08:42:40 EDT
(In reply to comment #91)

> Why bother releng with this exercise? With constant hashCode() aren't we now
> looking for a deterministic, platform independent variant of the bug?

Yes, but there could still be some pattern in the SDK code base that 
triggers it. Every one of the reports have come from eclipse SDK builds,
plain vanilla or with additional functionality.
 
> Anyway, I applied the patch and compiled a SDK-like set of projects in a
> runtime workbench. None of our suspects were sighted.

I don't know what is in this SDK-line set of projects - Given two sightings
were from vanilla SDK builds, if we undertake one full build with the 
constant hash code compiler, it would allow us to validate some theories.
(I realize that even a full good build does not necessarily rule out the
hashCode/equals problem hypothesis - the constant hashCode only makes it
much more likely for the bug to show up - does not make it a certainty)

I agree such requests are burdensome on another project - but we have had
no success in the past recreating a full SDK build locally.
Comment 93 Satyam Kandula CLA 2012-03-19 01:58:13 EDT
(In reply to comment #82)
I also tried building the full plugin org.eclipse.equinox.p2.updatesite (Errors popped up in this plugin bug 372944 comment 11 ) for around 100,000 times without any luck in reproducing the problem.
Comment 94 Srikanth Sankaran CLA 2012-03-19 08:10:19 EDT
We have found a concurrency bug in the resource leak analysis functionality
that explains most of the problems we have been seeing. Or to put it 
conservatively,  at least one major cause for all the weird error messages
has been uncovered and whether there are similar other bugs is something 
to be investigated.

Looking at the code for FakedTrackingVariable.getNullStatusAggressively
there is a chunk that looks like this:

private int getNullStatusAggressively(LocalVariableBinding local, FlowInfo flowInfo) {
		int reachMode = flowInfo.reachMode();
		int status = 0;
		try {
			// unreachable flowInfo is too shy in reporting null-issues, temporarily forget reachability:
			if (reachMode != FlowInfo.REACHABLE)
				flowInfo.tagBits &= ~FlowInfo.UNREACHABLE;
			status = flowInfo.nullStatus(local);
		} finally {
			// reset
			flowInfo.tagBits |= reachMode;
		}

If the flowInfo parameter happens to refer to the static object 
FlowInfo.DEAD_END, then in the window of time between when we 
"temporarily forget reachability" and when we "reset" it in the
finally block, other threads would observe the static object 
FlowInfo.DEAD_END in a bad state.

org.eclipse.jdt.core.tests.compiler.regression.ResourceLeakTests.test056e()
is one test that shows that the parameter could refer to the static
object DEAD_END.

By initializing the static object DEAD_END incorrectly, i.e instead of

//---- right stuff 
	public static final UnconditionalFlowInfo DEAD_END; 
	static {
		DEAD_END = new UnconditionalFlowInfo();
		DEAD_END.tagBits = UNREACHABLE;
	}

if we now deliberately initialize 

// ---- deliberate incorrect initialization
	public static final UnconditionalFlowInfo DEAD_END; 
	static {
		DEAD_END = new UnconditionalFlowInfo();
		DEAD_END.tagBits = REACHABLE;  // WRONG INIT !!!
	} 

we are able to force most of the problems to manifest themselves.

Satyam will shortly document specific cases for each of the
four classes of bogus errors we have seen - (1) unreachable code,
(2) uninitialized local (3) Blank final already initialized
(4) Blank final not initialized.

For this to bug to manifest, there should be parallel compiles 
going on in the same VM. JDT/Core by itself does not parallellize 
compiles, but when this happens under control of ANT or other such
build manager tools, the concurrency issue could act up. (see 
comment#8 and comment#46). Otherwise there are no observers for 
the bad data. This explains why most users are not hitting this 
bug and why all the reports have come from build masters/releng 
czars.

This code was originally introduced via commit 1bf30b93f1c2f17c02d0cecfa43e877f00d01800 on 2011-09-29 and then
got refactored into its present place - From a timeline perspective
this spans the entire window - from the very first bug referred to
in comment#5 all the way to the most recent incident reports.

Satyam, please document how to reproduce each of the wrong
class of diagnostics -- TIA.

Stephan, please prepare a patch - it would be great if we could
deliver it for this week's I build -- TIA.
Comment 95 Srikanth Sankaran CLA 2012-03-19 09:27:54 EDT
(In reply to comment #94)

> Stephan, please prepare a patch - it would be great if we could
> deliver it for this week's I build -- TIA.

While we are at it, could we also make sure that all leak analysis
code is under the option check - I noticed quite a few places
around the use of FakedTrackingVariable.isAnyCloseable that cause
non-trivial amount of code to be executed checking whether an
object is a closeable only to subsequently check the option.
e.g : analyseCode in LocalDeclaration and AllocationExpression.

(Such meticulous guarding will help if and when we want to disable
an optional feature to see if a certain problem goes away and to
provide an escape hatch thus.)
Comment 96 Srikanth Sankaran CLA 2012-03-19 09:48:36 EDT
(In reply to comment #55)
> Created attachment 212432 [details]
> static non-final fields in jdt.core

> Out of curiosity I assembled a list of all static non-final fields in JDT/Core
> (attached, fields from "Messages" classes are omitted). None of these fields
> looked suspicious to me in terms of this bug, most are not part of the compiler

Satyam had closely scrutinized this whole list - the catch turns out to be
that FlowInfo.DEAD_END **is** a static final field. It is the referent that
is getting corrupted and not the reference and since there is no way to
specify in Java that an object is immutable, there is no compiler offered
guarantee - that also complicated analysis and it easily slipped through the
cracks.
Comment 97 Thomas Watson CLA 2012-03-19 10:21:46 EDT
bug374651 shows another case of unexpected compilation errors.  In this case there are several static final String constants that it thinks may not have been initialized in this class:

http://git.eclipse.org/c/equinox/rt.equinox.p2.git/tree/bundles/org.eclipse.equinox.p2.reconciler.dropins/src/org/eclipse/equinox/internal/p2/reconciler/dropins/Activator.java
Comment 98 Satyam Kandula CLA 2012-03-19 10:36:17 EDT
(In reply to comment #94)
- All bugs except the 'Unreachable code' bugs could be reproduced with this
change that Srikanth mentioned. 
######
> By initializing the static object DEAD_END incorrectly, i.e instead of
> 
> //---- right stuff 
>     public static final UnconditionalFlowInfo DEAD_END; 
>     static {
>         DEAD_END = new UnconditionalFlowInfo();
>         DEAD_END.tagBits = UNREACHABLE;
>     }
> 
> if we now deliberately initialize 
> 
> // ---- deliberate incorrect initialization
>     public static final UnconditionalFlowInfo DEAD_END; 
>     static {
>         DEAD_END = new UnconditionalFlowInfo();
>         DEAD_END.tagBits = REACHABLE;  // WRONG INIT !!!
>     } 
> 
#######
- 'Unreachable code' in the switch-case block could be reproduced using the
following mechanism. 
   i) Debug the compilation of 'Disassembler.java' case through the debugger.
  ii) Stop at SwitchStatement.java:90 (caseInits =
caseInits.mergedWith(flowInfo.unconditionalInits());)
 iii) Change the value of caseInits.tagBits to 0
  iv) Stop at SwitchStatement.java:105 for the break statement
   v) Change back the value of caseInits.tagBits to 3
  This will help to reproduce the case in comment 5.
Comment 99 Srikanth Sankaran CLA 2012-03-19 11:34:29 EDT
*** Bug 374651 has been marked as a duplicate of this bug. ***
Comment 100 Srikanth Sankaran CLA 2012-03-19 12:09:51 EDT
I looked at all locations modifying

org.eclipse.jdt.internal.compiler.flow.FlowInfo.tagBits
org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.definiteInits
org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.potentialInits.

The former drives the Unreachable code message, while the latter two drive
the (a) Uninitialized local (b) Blank final not initialized (c) Blank final
already initialized diagnostics. (well (a), (b) and (c) for the first 64
cobined fields + locals, but most of the cases we have been hitting have
been tiny methods really).

The only post 3.7 modifier of org.eclipse.jdt.internal.compiler.flow.FlowInfo.tagBits is
FakedTrackingVariable.getNullStatusAggressively.

The only post 3.7 modifiers of 
org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.definiteInits
org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.potentialInits.

are

org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.addInfoFrom(FlowInfo, boolean)
org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.resetAssignmentInfo(int)

Ayush, please review these two methods to the extent they change these
fields to make sure all is well. Should resetAssignmentInfo method be 
also be discarding nullness information ?
Comment 101 Stephan Herrmann CLA 2012-03-19 13:52:40 EDT
(In reply to comment #94)
> We have found a concurrency bug in the resource leak analysis functionality
> that explains most of the problems we have been seeing.

Srikanth, congratulations for this truly great catch!!
And a big sorry for putting in the bug in the first place.
 
> Stephan, please prepare a patch - it would be great if we could
> deliver it for this week's I build -- TIA.

I'll pick up work on this in an hour or so ...

(In reply to comment #96)
> Satyam had closely scrutinized this whole list - the catch turns out to be
> that FlowInfo.DEAD_END **is** a static final field. It is the referent that
> is getting corrupted and not the reference and since there is no way to
> specify in Java that an object is immutable, there is no compiler offered
> guarantee - that also complicated analysis and it easily slipped through the
> cracks.

Very true. When I saw your analysis I started to think of how to simulate immutable objects in Java.
Comment 102 Stephan Herrmann CLA 2012-03-19 15:25:22 EDT
Created attachment 212874 [details]
proposed fix

Here's the tiny proposed fix for this big bug.

If flowInfo == DEAD_END the whole code section has no effect, because the DEAD_END has no null information.

I little "if (flowInfo != FlowInfo.DEAD_END)" seems to be all it takes.

I'm currently running the tests. 
Next I will dig through the previous comments, 
and try to construct a meaningful test.
Comment 103 Stephan Herrmann CLA 2012-03-19 16:07:55 EDT
(In reply to comment #100)
> I looked at all locations modifying
> 
> org.eclipse.jdt.internal.compiler.flow.FlowInfo.tagBits
>
> [...]
> 
> The only post 3.7 modifier of
> org.eclipse.jdt.internal.compiler.flow.FlowInfo.tagBits is
> FakedTrackingVariable.getNullStatusAggressively.

Most modifiers of tagBits closely follow the pattern to exit early if (this == DEAD_END) or ((this.tagBits & UNREACHABLE_OR_DEAD) != 0). I found only one exception: UFI.unconditionalFieldLessCopy() has no such checks, which looks inconsistent to me.

Secondly, in class NullInfoRegistry the check "if (this != DEAD_END)" is ridiculous actually, DEAD_END is of type UnconditionalFlowInfo, *not* NullInfoRegistry :)

I don't think either of these causes harm in the sense of this bug, maybe just wasting a few CPU cycles.
Comment 104 Stephan Herrmann CLA 2012-03-19 16:13:47 EDT
(In reply to comment #102)
> Created attachment 212874 [details]
> proposed fix
> 
> Here's the tiny proposed fix for this big bug.
> [...]
> I'm currently running the tests. 

RunJdtCoreTests answered: PASS
Comment 105 Kim Moir CLA 2012-03-19 16:33:12 EDT
I've also  reverted the compiler for tonight's build to the one from M5.
Comment 106 Stephan Herrmann CLA 2012-03-19 17:05:40 EDT
(In reply to comment #105)
> I've also  reverted the compiler for tonight's build to the one from M5.

I'm afraid for safely avoiding this bug you'd have to go all the way back to M2. Not sure if that will trigger other issues (e.g., warnings suppressed in the code which aren't even raised -> spurious compile warnings re unused @SuppressWarnings - or code relying on some recent bug fixes?).

In order to upgrade to a compiler *with* this fix, we need a successful I-Build first? Or would it make sense to "sneak in" a hand-made ecj with this fix?
Comment 107 Stephan Herrmann CLA 2012-03-19 17:55:10 EDT
Created attachment 212880 [details]
improved fix

(In reply to comment #95)
> While we are at it, could we also make sure that all leak analysis
> code is under the option check - I noticed quite a few places
> around the use of FakedTrackingVariable.isAnyCloseable that cause
> non-trivial amount of code to be executed checking whether an
> object is a closeable only to subsequently check the option.
> e.g : analyseCode in LocalDeclaration and AllocationExpression.
> 
> (Such meticulous guarding will help if and when we want to disable
> an optional feature to see if a certain problem goes away and to
> provide an escape hatch thus.)

In bug 370639 I ensured, that with resource analysis disabled no single instance of FakedTrackingVariable would ever be created.
As you correctly note, the static method isAnyCloseable was not fully covered by checks. I fixed this by simply swapping the order of checks in three locations. With these changes I ran all JDT/Core tests with resource analysis disabled by default and with breakpoints in all ctors and static methods of FakedTrackingVariable. Only one of these breakpoints was ever triggered: FTV.cleanupAfterAssignment(). In this case adding another check looks like overkill to me.

Also the actual fix has been changed to a simple early exit to improve readability.
Comment 108 Stephan Herrmann CLA 2012-03-19 19:29:30 EDT
Created attachment 212884 [details]
same patch with a regression test

This patch includes a regression test that triggers one instance of the actual bug in its concurrency-based version. In order to increase the probability I added a guarded sleep(5) into the critical section. With this delay and with disabling the fix the bug seems to show deterministically on my machine. I even saw it once without the delay, but for a good regression test this didn't look reliable to me.

Let me know if s.o. sees an issue in this kind of test harness in the code:
    if (TEST_372319) { // see https://bugs.eclipse.org/372319
	try {
	    Thread.sleep(5); // increase probability of concurrency bug
	} catch (InterruptedException e) { /* nop */ }
    }
Comment 109 Srikanth Sankaran CLA 2012-03-19 19:34:15 EDT
(In reply to comment #106)
> (In reply to comment #105)
> > I've also  reverted the compiler for tonight's build to the one from M5.
> 
> I'm afraid for safely avoiding this bug you'd have to go all the way back to
> M2. 

Stephan, could you dig up the command line switch used to disable
the resource analysis and document it here ? It can be a workaround
until a fix is released and makes it way into the build environment.
Comment 110 Stephan Herrmann CLA 2012-03-19 19:42:37 EDT
(In reply to comment #109)
> (In reply to comment #106)
> > (In reply to comment #105)
> > > I've also  reverted the compiler for tonight's build to the one from M5.
> > 
> > I'm afraid for safely avoiding this bug you'd have to go all the way back to
> > M2. 
> 
> Stephan, could you dig up the command line switch used to disable
> the resource analysis and document it here ?

Just say:
  -warn:-resource

> It can be a workaround
> until a fix is released and makes it way into the build environment.

I was going to release the current version of the fix after another run of all tests, to be sure it's included in the next build. Any objections?
Comment 111 Stephan Herrmann CLA 2012-03-19 20:16:08 EDT
I have released the fix via http://git.eclipse.org/c/jdt/eclipse.jdt.core.git/commit/?id=8654e35d625c39f9786ab81699fccb6057b0bc72

Please let me know if more should be done.
Comment 112 Srikanth Sankaran CLA 2012-03-19 20:19:37 EDT
(In reply to comment #110)

> > It can be a workaround
> > until a fix is released and makes it way into the build environment.
> 
> I was going to release the current version of the fix after another run of all
> tests, to be sure it's included in the next build. Any objections?

I looked through the changes and they look good. Yes, please release them
so they get included for I build today.

Question: New test uses buildUniqueComplianceTestSuite 1.6, but 
TestAll adds it since 1.5 - Is this an issue ? (No idea)

I ran the test a few times on Windoze - it passed with the fix
and failed without consistently, so we seem to be good.

(In reply to comment #108)

> Let me know if s.o. sees an issue in this kind of test harness in the code:

We have precedence of such "testability improvement" code in JDT/Core.
So I don't see an issue - See the aptly named
org.eclipse.jdt.internal.core.JavaModelManager.doNotUse() :)

Thanks a lot for the quick turn around on the patch Stephan.
Comment 113 Srikanth Sankaran CLA 2012-03-19 20:22:07 EDT
(In reply to comment #111)
> I have released the fix via
> http://git.eclipse.org/c/jdt/eclipse.jdt.core.git/commit/?id=8654e35d625c39f9786ab81699fccb6057b0bc72

This patch looks good.
Comment 114 Srikanth Sankaran CLA 2012-03-19 20:29:28 EDT
(In reply to comment #112)

> I looked through the changes and they look good. Yes, please release them
> so they get included for I build today.

Jay, FYI.
Comment 115 Stephan Herrmann CLA 2012-03-19 20:42:09 EDT
(In reply to comment #112)
> Question: New test uses buildUniqueComplianceTestSuite 1.6, but 
> TestAll adds it since 1.5 - Is this an issue ? (No idea)

Oops, I mostly copied stuff from BatchCompilerTests and only experimented with switching to 1.6 once. But at a closer look the two locations you mention seem to be independent: when run from TestAll it gets invoked for all levels >= 1.5 and when run as a single test in the IDE only 1.6 is picked. Looks actually OK to me.

> Thanks a lot for the quick turn around on the patch Stephan.

Don't mention. I owe you a big beer or two for this.
Comment 116 Ayushman Jain CLA 2012-03-20 11:20:40 EDT
(In reply to comment #100)
> The only post 3.7 modifiers of 
> org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.definiteInits
> org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.potentialInits.
> 
> are
> 
> org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.addInfoFrom(FlowInfo,
> boolean)
> org.eclipse.jdt.internal.compiler.flow.UnconditionalFlowInfo.resetAssignmentInfo(int)
> 
> Ayush, please review these two methods to the extent they change these
> fields to make sure all is well. Should resetAssignmentInfo method be 
> also be discarding nullness information ?
All looks ok here. The resetAssignmentInfo discards the info corresponding to a loop counter after the loop, so yes both nullness info and inits info can be discarded for that particular variable.
Comment 117 Srikanth Sankaran CLA 2012-03-20 11:28:26 EDT
(In reply to comment #116)

> All looks ok here. The resetAssignmentInfo discards the info corresponding to a
> loop counter after the loop, so yes both nullness info and inits info can be
> discarded for that particular variable.

OK. So that it doesn't discard the nullness info today - could this lead
to some trouble somehow ?
Comment 118 Ayushman Jain CLA 2012-03-21 02:08:16 EDT
(In reply to comment #117)
> OK. So that it doesn't discard the nullness info today - could this lead
> to some trouble somehow ?

No, it won't, since no other local is assigned the same id.
Comment 119 David Williams CLA 2012-04-11 00:32:57 EDT
We are starting to get "good builds" on our build.eclipse.org 4.2 primary builds. 

I expect us to have our first "public" build this week (in next few days) ... that is, one we feel comfortable putting on downloads.eclipse.org, even though unit tests won't be running yet. 

I've opened a "releng" bug to update the compiler in the base builder we are using, but want to wait until after that "first" new I-build. 

Can one of you JDT core members comment in bug 376450 when you think "yes, the compiler from the I-build xyz is ready for I-build production use"?
Comment 120 Ayushman Jain CLA 2012-04-11 00:40:03 EDT
(In reply to comment #119)
> Can one of you JDT core members comment in bug 376450 when you think "yes, the
> compiler from the I-build xyz is ready for I-build production use"?
Sure David.
Comment 121 Satyam Kandula CLA 2012-04-30 07:24:39 EDT
The builds didn't show up these errors for a while. Hence marking this verified.
Verified for 3.8M7.