Bug 73675 - [dom] Need AST creation pipeline
Summary: [dom] Need AST creation pipeline
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.0   Edit
Hardware: PC Windows XP
: P3 normal (vote)
Target Milestone: 3.1 M4   Edit
Assignee: Jerome Lanneluc CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-09-10 11:23 EDT by Dirk Baeumer CLA
Modified: 2004-12-14 13:33 EST (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Baeumer CLA 2004-09-10 11:23:01 EDT
To speed up AST creation and to be able to share bindings between ASTs we 
would need a pipeline to create AST. The pipeline would work as follows:

- all CUs initially put into the pipeline are converted into AST and they
  are handed back one by one via some sort of call back interface

- when the pipeline is empty a special hook is call to have the change to
  add more CUs to the pipeline

- the pipeline can only contain CUs form the same project

- a CU can't be parsed twice. Philippe, we looked at our code and it would
  be a great help if we would be able to parse CUs a second time. Would this
  be possible if we guarantee that the content of the CU didn't change ?
Comment 1 Jerome Lanneluc CLA 2004-09-13 09:54:19 EDT
> a CU can't be parsed twice. Philippe, we looked at our code and it would
> be a great help if we would be able to parse CUs a second time. Would this
> be possible if we guarantee that the content of the CU didn't change ?
Do you need to parse, or to resolve a CU a second time ? If the content 
doesn't change, parsing it a second time would return an equivalent AST. So 
I'm not sure why you would want to do that.

If you need to resolve the CU a second time, I'm afraid this is not possible. 
All our compiler internals assume that bindings are unique. If you created a 
second set of bindings for the same CU, it would be an error for the compiler.
Comment 2 Dirk Baeumer CLA 2004-09-14 05:07:49 EDT
We need to parse the CU a second time including having bindings expecting that 
we get an equivalent AST. The use case is as follows:

- we parse a bunch of compilation units and do some data flow analysis on them
- for the analysis we neeed parts of the bindings, but no AST nodes anymore
- after the analysis we know that we can update some of the source code
- to use the AST rewriter we need an AST again for those CUs we can change

To not hold to many ASTs in memory we get rid of them as early as possible, 
but for the rewriter we need one again. We can guarantee that none of the 
source code has changed so the bindings world should be the same.
Comment 3 Philipe Mulet CLA 2004-09-14 05:23:31 EDT
Then you'd have to recreate one from scratch, since once destroyed, we cannot 
recreate an AST and hook it to existing bindings from the cache (we could, but 
we don't have support for this at the current time being).
Comment 4 Dirk Baeumer CLA 2004-09-14 06:06:04 EDT
What happens if we recreate the AST outside of the pipeline without resolving 
bindings. Will the compiler bindings still be created and kept in memory ? We 
are asking since creating an resolved AST for StyledText takes ~ 4 seconds 
which we would like to speed up if we only do source code rewriting.
Comment 5 Jerome Lanneluc CLA 2004-09-14 06:30:59 EDT
If you create an AST without asking to resolve bindings, the AST won't know 
about its bindings, and the resolveBinding() methods will return null.
Comment 6 Dirk Baeumer CLA 2004-09-14 11:50:17 EDT
But we were wondering what happens in the compiler AST. Does this one create 
the bindings and throws them right away. We wanted to avoid to even create the 
compiler bindings since we got the impression that those are expensive in 
memory and to construct. Can you clarify on this ?

That's why we asked if we can parse a CU a second time since we know that the 
bindings already exist in memory when we do so. 
Comment 7 Jerome Lanneluc CLA 2004-09-15 05:13:49 EDT
No the compiler bindings are not thrown right away, but as Philippe said, if 
you parse the CU a second time, we don't know how to bind the ASTNode in this 
CU to existing bindings.
Comment 8 Dirk Baeumer CLA 2004-09-15 06:52:20 EDT
Sorry, wasn't specific enough. The question is: if we create an AST without 
requesting bindings (outside the pipeline using the normal ASTParser) will the 
infrastructure create bindings for the AST in the compiler AST or not. If it 
doesn't we can do the following:

- having a pipeline to create all the ASTs and bindings for the analysis.
- when we want to rewrite the source code we create an AST outside of the
  pipeline without requesting bindings. For rewriting we don't need bindings.
  We want to avoid to have the bindings two times in memory. One set in the 
  pipeline and one set when we create the AST for rewriting.
Comment 9 Jerome Lanneluc CLA 2004-09-15 07:20:58 EDT
If you create an AST without requesting bindings (outside the pipeline using 
the normal ASTParser) the infrastructure will not create bindings for the AST 
(neither compiler bindings, nor DOM AST bindings). So for rewriting you can 
create another AST (without requesting bindings) and you won't have the 
bindings 2 times in memory.
Comment 10 Jerome Lanneluc CLA 2004-10-01 09:53:31 EDT
Released tentative API and implementation to HEAD:
- ASTParser#createASTs(ASTRequestor, IProgressMonitor)
- ASTRequestor
    ICompilationUnit[] getSources();
    void acceptAST(ASTNode);

The getSources() method returns the initial set of sources. When the initial set
has been prcessed it is called again. It returns another set of sources to
process , or null if none. And so on.

Please let me know if this API suits your need, or if you prefer to have the
initial set of sources be passed separately.

Also added tests in BatchASTCreationTests.
Comment 11 Dirk Baeumer CLA 2004-10-04 05:53:00 EDT
I prefer retrieving the initial set of sources through the AST requester as 
well. This makes it the same for all sources to be parsed.
Comment 12 Jerome Lanneluc CLA 2004-10-04 07:55:24 EDT
I'm not sure what to do in the case of a source injected in the loop when you
didn't request it. E.g. the initial source set contains only the following CU:
public class X {
  Y field;
}
Internaly the source for Y is diet parsed. The intent is to build and report the
corresponding DOM AST using ASTRequestor#acceptAST(...). However it is going to
be incomplete (i.e. it won't contain statements).
Will you know at this stage if you need the full AST ? I will need to add an API
so that you can tell us that a full parse need to be done for Y.java.
Comment 13 Dirk Baeumer CLA 2004-10-04 10:55:03 EDT
In the outlined example I wouldn't expect to get an AST for Y.java if I didn't 
request one. Can the API work as follows:

- getSources returns the compilation unit for which we want to get ASTs for.
- these are reported via acceptAST.

If you have to parse Y.java (as in the example above) I would only expect to 
get an AST for it if I request it via getSources. Is it possible for you to 
forget about Y.java and only give us a full AST if we request it ?

To answer your question: in the example above we don't know that we need 
Y.java if we start with X.java. Our current idea is that we first process all 
ASTs we requested, then do some post processing to compute further CUs to be 
analyzed and request the corresponding ASTs via a subsequent call of 
getSources.
Comment 14 Jerome Lanneluc CLA 2004-10-04 11:30:06 EDT
Then we have an issue: we cannot forget that we already resolved Y.java, and
attempting to resolve it a second time will result in a 'duplicate type' error.
Comment 15 Dirk Baeumer CLA 2004-10-04 13:35:12 EDT
So what are the options ?
Comment 16 Jerome Lanneluc CLA 2004-10-04 13:48:15 EDT
I'm still thinking about it, but so far the only solution I see is to give you
the  information that Y.java was resolved, and if you find that you need its
bindings, then you do it in a separate loop.
Comment 17 Dirk Baeumer CLA 2004-10-05 05:08:05 EDT
This will IMO not scale since it will "double" the bindings in memory again. 
The idea behind this pipeline was to only have bindings in memory once. We are 
already a little bit scared about the fact that we have to keep the bindings 
for each project. 

I still don't understand the underlying reason of the "duplicate type error". 
Is it to detect the case that two different source folders in a project define 
the same type (e.g. src1/pack1/A and src2/pack1/A). If so then we could add 
the source path to the binding and if the type is "recreated" from the same 
source we could reuse it. We can guarantee that we don't change the content 
while a pipeline is active. 
Comment 18 Jerome Lanneluc CLA 2004-10-06 06:18:40 EDT
Sorry but it looks like we cannot change the compiler internals that heavily
rely on the fact that a binding for a type is created only once.

All we can do is to tell you that a compilation unit that you didn't request an
AST for was resolved.
Comment 19 Dirk Baeumer CLA 2004-10-06 06:37:27 EDT
I think we need a solution for this problem. To treat 5.0 language constructs 
correctly we have to use the DOM/AST & bindings world. We can't use the Java 
Model any longer. So we need to have a scalable solution for generating ASTs 
and bindings for a larger amount of compilation units. 

Can we discuss this next week when Philippe is back ?
Comment 20 Jim des Rivieres CLA 2004-10-06 09:49:47 EDT
I believe there's an in-between possibilities that could provide much of the 
benefits of a pipeline without having to violate existing assumptions inside 
the compiler. Without the pipeline, we know that each compilation unit that 
the client gets an AST for (with bindings) entails building an entire compiler 
context from scratch, with no possibility of sharing between compilation 
units. With a "perfect" pipeline, all compilation units would be handled using 
the same compiler context. Anywhere in between these extremes is better that 
what clients are getting now. So...

The clients starts off by asking for ASTs for a list of compilation units. The 
implementation should be able to arrange that all of these are built with the 
same compiler context. If the client later determines that it wants ASTs for 
additional compilation units, there are 2 possibilities: (a) if the 
compilation unit has not been parsed before, then it is effectively added to 
the original list and compiled with the same compiler context; (b) if the 
compilation unit has already been parsed, then a new compilation context has 
to be created and used. Would this not provide most of the benefits we're 
after?
Comment 21 Dirk Baeumer CLA 2004-10-07 09:41:41 EDT
From how I understood Jerome the trick will not work for us. I will outline 
the problem using an example:

- the initial set of compilation units we computed to parse contains A.java 
  and B.java. A.java has a reference to type Y.

- when A.java is parsed the compiler recognizes that it needs the bindings for
  Y and parses Y.java, *before* the AST of A.java is handed over to the client.

Even if there is API which queries if we are interested in Y.java we will very 
likely answer false. There reason is that we know that we are interested in 
Y.java after we have inspected A.java. If the analysis envolves searching 
(which is the case for most of the type constraint refactorings) we compute 
the set of additional CUs to parse in batches to optimize searching. For 
example we parse A.java and B.java and detect that we have to parse all 
callers of the methods A#foo and B#bar as well we do a combined search at the 
end using an or query for references to A#foo and B#bar. This is much faster 
than searching for A#foo and B#bar separately.

So for all additional CUs we have to parse we will very likely get an 
additional compiler context.

Since we already have to keep the compiler context for all projects we have to 
inspect in memory (the type constraint refactorings hang onto bindings to 
unambiguously identify language constructs and to answer questions like is 
type A compatible to type B) I am a little bit worried about any additional 
compiler context we have. Especially if the number gets multiplied with the 
number of projects we inspect.

We looked into ways not to hang onto bindings but couldn't come up with a nice 
soultion either. The problems were:

- can't use IJavaElements since we don't have elements for generics (e.g.
  List<String>)

- implementing our own bindings. This is lot of work and we basically have to
  double all the type compatibility, override, overloading, ... rules which 
  isn't a trivial thing to do for J2SE 5.0.

One idea we didn't looked at is the ability to cut references between 
bindings. So for example if we hang onto the type binding A this references 
its package binding, but for the analysis we don't need this relationship.

Any thoughts or comments ?
Comment 22 Jerome Lanneluc CLA 2004-10-20 09:53:38 EDT
We propose to separate batches. The sources are passed in as one batch, and no
other sources are requested during the processing of the batch. getSources() is
removed from ASTRequestor.

Bindings from previous batches can be requested at the same time using binding
keys (remembered from a previous batch by the client).

API would be:
- ASTParser#createASTs(ICompilationUnit[] sources, String[] bindingKeys,
ASTRequestor requestor,IProgressMonitor pm)
- ASTRequestor#acceptAST(CompilationUnit ast, ICompilationUnit source)
- ASTRequestor#acceptBinding(IBinding binding, String bindingKey)

Would that suit your needs ?
Comment 23 Dirk Baeumer CLA 2004-10-21 06:04:49 EDT
One additional request: can we add a method to ASTRequestor to convert keys 
into bindings. Something like IBinding[] ASTRequestor#createBindings(String[] 
bindings).

Otherwise we have to recreate all bindings from the previous pipeline. The 
additional API would allow us to only recreate those we actually need.

Comment 24 Jerome Lanneluc CLA 2004-10-21 06:41:13 EDT
It is not possible to reenter the compiler loop, so all binding keys need to be
known in advance.
Comment 25 Jerome Lanneluc CLA 2004-10-22 04:49:13 EDT
I take that back. It should be possible to ask for the binding only during the
compiler loop.
Comment 26 Jerome Lanneluc CLA 2004-10-28 12:54:57 EDT
Released a first cut of the ast batch creation support (with the ability to
request bindings that are not in the given source).

The new API is:
- ASTParser#createASTs(ICompilationUnit[] compilationUnits, String[]
bindingKeys, ASTRequestor requestor, IProgressMonitor monitor) 
- ASTRequestor#acceptAST(CompilationUnit ast, ICompilationUnit source)
- ASTRequestor#acceptBinding(IBinding binding, String bindingKey)

The following API has been deprecated (and will be removed after 3.1 M3)
- ASTParser#createASTs(ASTRequestor requestor, IProgressMonitor monitor)
- ASTRequestor#acceptAST(ASTNode node)
- ASTRequestor#getSources()

For the binding creeation, note that only the following binding keys are
supported for now:
- packages
- toplevel types, 
- member types
- anonymous types
- local types
- base types
- array types
- fields
- methods

Generics and enums are not yet supported (and will likely not be supported
before 3.1 M3).
Comment 27 Dirk Baeumer CLA 2004-10-28 13:47:23 EDT
Jerome, thanks for the update!

I discussed the new API with Markus and we think it will satisfy our needs. The
proposed API doesn't offer a method to convert a string key into a binding when
we are in the call back of ASTRequestor#acceptAST(CompilationUnit ast,
ICompilationUnit source). Can you outline how this API will look like ?
Comment 28 Jerome Lanneluc CLA 2004-10-28 13:51:30 EDT
Sorry I forgot to say that I also added the API:
- ASTRequestor#createBindings(String[] bindingKeys)
that you can use when you are in the callback acceptAST(...)
Comment 29 Jim des Rivieres CLA 2004-10-29 14:56:25 EDT
Jerome, I've reviewed API spec and released my changes. I flagged a few 
issues, particularly around bindingKeys. The spec is unclear on bindingKeys 
are ok, and which are not. This is true for both for ASTParser.createASTs and 
ASTRequestor.createBindings. Please clarify how these are supposed to work.
Comment 30 Dirk Baeumer CLA 2004-11-02 05:06:51 EST
Jerome,

we looked into the limitations of converting keys into bindings and I don't
fully understand why we have to have it. Here is my current understanding:

- we pass a set of CUs for which we want to have ASTs for
- the compiler does a light pass over all CUs and builds the bindings for them
- now we ask to convert a key into a binding where the definition in one of
  the CUs pass into the pipeline.

Since all bindings are already "resolved" why can't the key be interpreted as a
"reference" and the already existing binding is returned. If we reference the
binding from another file this isn't a problem either.
Comment 31 Jerome Lanneluc CLA 2004-11-03 05:34:04 EST
On the first pass, the compiler builds bindings only for top level and member
elements. It doesn't build bindings for local elements. So if you pass in a
binding key for a local variable, we would have to fully parse the compilation
unit and resolve all its internals. This would reenter the compiler loop and
this is not supported.
Comment 32 Markus Keller CLA 2004-11-03 06:02:00 EST
OK then, does that mean that we can always resolve top-level and member
ITypeBindings and IMethodBindings from all CUs (from those requested as well as
for those not requested in the pipeline)?

If yes, then we shouldn't have a problem, since we won't need bindings for local
stuff.
Comment 33 Dirk Baeumer CLA 2004-11-03 06:05:18 EST
Agree with Markus. Converting back keys for local stuff almost means that we
reparse the file. In this case it doesn't make sense to convert the key. So can
we implement the method in a way that the restriction only applies for local
bindings.
Comment 34 Jerome Lanneluc CLA 2004-11-05 09:19:51 EST
Yes, having the restriction that the binding key cannot be a local element key
is possible. If it is a top level or member element, then the binding can be
returned, whether an AST for the same compilation is requested or not. 

To be consistent, we would have this restriction for binding keys passed to both
ASTRequestor#createBindings(...) and ASTParser#createASTs(...). Would that be ok ?
Comment 35 Jerome Lanneluc CLA 2004-11-22 07:45:08 EST
I changed the spec to say that local elements can be created from their binding
keys only if an ast or a binding key in the same cu is not requested.
Comment 36 Jerome Lanneluc CLA 2004-11-22 07:46:59 EST
Jim, I updated the spec to take your comments into account. Note that I still
believe that creating ASTs as a batch even if not resolving is still usefull and
more consistent than having to do several call to createAST(...).
Comment 37 Jerome Lanneluc CLA 2004-11-22 07:56:56 EST
Marking as fixed. Please open new bugs for batch ast creation from now on.
Comment 38 Frederic Fusier CLA 2004-12-14 13:33:09 EST
Verified for 3.1 M4 using build I200412140800.