Community
Participate
Working Groups
To speed up AST creation and to be able to share bindings between ASTs we would need a pipeline to create AST. The pipeline would work as follows: - all CUs initially put into the pipeline are converted into AST and they are handed back one by one via some sort of call back interface - when the pipeline is empty a special hook is call to have the change to add more CUs to the pipeline - the pipeline can only contain CUs form the same project - a CU can't be parsed twice. Philippe, we looked at our code and it would be a great help if we would be able to parse CUs a second time. Would this be possible if we guarantee that the content of the CU didn't change ?
> a CU can't be parsed twice. Philippe, we looked at our code and it would > be a great help if we would be able to parse CUs a second time. Would this > be possible if we guarantee that the content of the CU didn't change ? Do you need to parse, or to resolve a CU a second time ? If the content doesn't change, parsing it a second time would return an equivalent AST. So I'm not sure why you would want to do that. If you need to resolve the CU a second time, I'm afraid this is not possible. All our compiler internals assume that bindings are unique. If you created a second set of bindings for the same CU, it would be an error for the compiler.
We need to parse the CU a second time including having bindings expecting that we get an equivalent AST. The use case is as follows: - we parse a bunch of compilation units and do some data flow analysis on them - for the analysis we neeed parts of the bindings, but no AST nodes anymore - after the analysis we know that we can update some of the source code - to use the AST rewriter we need an AST again for those CUs we can change To not hold to many ASTs in memory we get rid of them as early as possible, but for the rewriter we need one again. We can guarantee that none of the source code has changed so the bindings world should be the same.
Then you'd have to recreate one from scratch, since once destroyed, we cannot recreate an AST and hook it to existing bindings from the cache (we could, but we don't have support for this at the current time being).
What happens if we recreate the AST outside of the pipeline without resolving bindings. Will the compiler bindings still be created and kept in memory ? We are asking since creating an resolved AST for StyledText takes ~ 4 seconds which we would like to speed up if we only do source code rewriting.
If you create an AST without asking to resolve bindings, the AST won't know about its bindings, and the resolveBinding() methods will return null.
But we were wondering what happens in the compiler AST. Does this one create the bindings and throws them right away. We wanted to avoid to even create the compiler bindings since we got the impression that those are expensive in memory and to construct. Can you clarify on this ? That's why we asked if we can parse a CU a second time since we know that the bindings already exist in memory when we do so.
No the compiler bindings are not thrown right away, but as Philippe said, if you parse the CU a second time, we don't know how to bind the ASTNode in this CU to existing bindings.
Sorry, wasn't specific enough. The question is: if we create an AST without requesting bindings (outside the pipeline using the normal ASTParser) will the infrastructure create bindings for the AST in the compiler AST or not. If it doesn't we can do the following: - having a pipeline to create all the ASTs and bindings for the analysis. - when we want to rewrite the source code we create an AST outside of the pipeline without requesting bindings. For rewriting we don't need bindings. We want to avoid to have the bindings two times in memory. One set in the pipeline and one set when we create the AST for rewriting.
If you create an AST without requesting bindings (outside the pipeline using the normal ASTParser) the infrastructure will not create bindings for the AST (neither compiler bindings, nor DOM AST bindings). So for rewriting you can create another AST (without requesting bindings) and you won't have the bindings 2 times in memory.
Released tentative API and implementation to HEAD: - ASTParser#createASTs(ASTRequestor, IProgressMonitor) - ASTRequestor ICompilationUnit[] getSources(); void acceptAST(ASTNode); The getSources() method returns the initial set of sources. When the initial set has been prcessed it is called again. It returns another set of sources to process , or null if none. And so on. Please let me know if this API suits your need, or if you prefer to have the initial set of sources be passed separately. Also added tests in BatchASTCreationTests.
I prefer retrieving the initial set of sources through the AST requester as well. This makes it the same for all sources to be parsed.
I'm not sure what to do in the case of a source injected in the loop when you didn't request it. E.g. the initial source set contains only the following CU: public class X { Y field; } Internaly the source for Y is diet parsed. The intent is to build and report the corresponding DOM AST using ASTRequestor#acceptAST(...). However it is going to be incomplete (i.e. it won't contain statements). Will you know at this stage if you need the full AST ? I will need to add an API so that you can tell us that a full parse need to be done for Y.java.
In the outlined example I wouldn't expect to get an AST for Y.java if I didn't request one. Can the API work as follows: - getSources returns the compilation unit for which we want to get ASTs for. - these are reported via acceptAST. If you have to parse Y.java (as in the example above) I would only expect to get an AST for it if I request it via getSources. Is it possible for you to forget about Y.java and only give us a full AST if we request it ? To answer your question: in the example above we don't know that we need Y.java if we start with X.java. Our current idea is that we first process all ASTs we requested, then do some post processing to compute further CUs to be analyzed and request the corresponding ASTs via a subsequent call of getSources.
Then we have an issue: we cannot forget that we already resolved Y.java, and attempting to resolve it a second time will result in a 'duplicate type' error.
So what are the options ?
I'm still thinking about it, but so far the only solution I see is to give you the information that Y.java was resolved, and if you find that you need its bindings, then you do it in a separate loop.
This will IMO not scale since it will "double" the bindings in memory again. The idea behind this pipeline was to only have bindings in memory once. We are already a little bit scared about the fact that we have to keep the bindings for each project. I still don't understand the underlying reason of the "duplicate type error". Is it to detect the case that two different source folders in a project define the same type (e.g. src1/pack1/A and src2/pack1/A). If so then we could add the source path to the binding and if the type is "recreated" from the same source we could reuse it. We can guarantee that we don't change the content while a pipeline is active.
Sorry but it looks like we cannot change the compiler internals that heavily rely on the fact that a binding for a type is created only once. All we can do is to tell you that a compilation unit that you didn't request an AST for was resolved.
I think we need a solution for this problem. To treat 5.0 language constructs correctly we have to use the DOM/AST & bindings world. We can't use the Java Model any longer. So we need to have a scalable solution for generating ASTs and bindings for a larger amount of compilation units. Can we discuss this next week when Philippe is back ?
I believe there's an in-between possibilities that could provide much of the benefits of a pipeline without having to violate existing assumptions inside the compiler. Without the pipeline, we know that each compilation unit that the client gets an AST for (with bindings) entails building an entire compiler context from scratch, with no possibility of sharing between compilation units. With a "perfect" pipeline, all compilation units would be handled using the same compiler context. Anywhere in between these extremes is better that what clients are getting now. So... The clients starts off by asking for ASTs for a list of compilation units. The implementation should be able to arrange that all of these are built with the same compiler context. If the client later determines that it wants ASTs for additional compilation units, there are 2 possibilities: (a) if the compilation unit has not been parsed before, then it is effectively added to the original list and compiled with the same compiler context; (b) if the compilation unit has already been parsed, then a new compilation context has to be created and used. Would this not provide most of the benefits we're after?
From how I understood Jerome the trick will not work for us. I will outline the problem using an example: - the initial set of compilation units we computed to parse contains A.java and B.java. A.java has a reference to type Y. - when A.java is parsed the compiler recognizes that it needs the bindings for Y and parses Y.java, *before* the AST of A.java is handed over to the client. Even if there is API which queries if we are interested in Y.java we will very likely answer false. There reason is that we know that we are interested in Y.java after we have inspected A.java. If the analysis envolves searching (which is the case for most of the type constraint refactorings) we compute the set of additional CUs to parse in batches to optimize searching. For example we parse A.java and B.java and detect that we have to parse all callers of the methods A#foo and B#bar as well we do a combined search at the end using an or query for references to A#foo and B#bar. This is much faster than searching for A#foo and B#bar separately. So for all additional CUs we have to parse we will very likely get an additional compiler context. Since we already have to keep the compiler context for all projects we have to inspect in memory (the type constraint refactorings hang onto bindings to unambiguously identify language constructs and to answer questions like is type A compatible to type B) I am a little bit worried about any additional compiler context we have. Especially if the number gets multiplied with the number of projects we inspect. We looked into ways not to hang onto bindings but couldn't come up with a nice soultion either. The problems were: - can't use IJavaElements since we don't have elements for generics (e.g. List<String>) - implementing our own bindings. This is lot of work and we basically have to double all the type compatibility, override, overloading, ... rules which isn't a trivial thing to do for J2SE 5.0. One idea we didn't looked at is the ability to cut references between bindings. So for example if we hang onto the type binding A this references its package binding, but for the analysis we don't need this relationship. Any thoughts or comments ?
We propose to separate batches. The sources are passed in as one batch, and no other sources are requested during the processing of the batch. getSources() is removed from ASTRequestor. Bindings from previous batches can be requested at the same time using binding keys (remembered from a previous batch by the client). API would be: - ASTParser#createASTs(ICompilationUnit[] sources, String[] bindingKeys, ASTRequestor requestor,IProgressMonitor pm) - ASTRequestor#acceptAST(CompilationUnit ast, ICompilationUnit source) - ASTRequestor#acceptBinding(IBinding binding, String bindingKey) Would that suit your needs ?
One additional request: can we add a method to ASTRequestor to convert keys into bindings. Something like IBinding[] ASTRequestor#createBindings(String[] bindings). Otherwise we have to recreate all bindings from the previous pipeline. The additional API would allow us to only recreate those we actually need.
It is not possible to reenter the compiler loop, so all binding keys need to be known in advance.
I take that back. It should be possible to ask for the binding only during the compiler loop.
Released a first cut of the ast batch creation support (with the ability to request bindings that are not in the given source). The new API is: - ASTParser#createASTs(ICompilationUnit[] compilationUnits, String[] bindingKeys, ASTRequestor requestor, IProgressMonitor monitor) - ASTRequestor#acceptAST(CompilationUnit ast, ICompilationUnit source) - ASTRequestor#acceptBinding(IBinding binding, String bindingKey) The following API has been deprecated (and will be removed after 3.1 M3) - ASTParser#createASTs(ASTRequestor requestor, IProgressMonitor monitor) - ASTRequestor#acceptAST(ASTNode node) - ASTRequestor#getSources() For the binding creeation, note that only the following binding keys are supported for now: - packages - toplevel types, - member types - anonymous types - local types - base types - array types - fields - methods Generics and enums are not yet supported (and will likely not be supported before 3.1 M3).
Jerome, thanks for the update! I discussed the new API with Markus and we think it will satisfy our needs. The proposed API doesn't offer a method to convert a string key into a binding when we are in the call back of ASTRequestor#acceptAST(CompilationUnit ast, ICompilationUnit source). Can you outline how this API will look like ?
Sorry I forgot to say that I also added the API: - ASTRequestor#createBindings(String[] bindingKeys) that you can use when you are in the callback acceptAST(...)
Jerome, I've reviewed API spec and released my changes. I flagged a few issues, particularly around bindingKeys. The spec is unclear on bindingKeys are ok, and which are not. This is true for both for ASTParser.createASTs and ASTRequestor.createBindings. Please clarify how these are supposed to work.
Jerome, we looked into the limitations of converting keys into bindings and I don't fully understand why we have to have it. Here is my current understanding: - we pass a set of CUs for which we want to have ASTs for - the compiler does a light pass over all CUs and builds the bindings for them - now we ask to convert a key into a binding where the definition in one of the CUs pass into the pipeline. Since all bindings are already "resolved" why can't the key be interpreted as a "reference" and the already existing binding is returned. If we reference the binding from another file this isn't a problem either.
On the first pass, the compiler builds bindings only for top level and member elements. It doesn't build bindings for local elements. So if you pass in a binding key for a local variable, we would have to fully parse the compilation unit and resolve all its internals. This would reenter the compiler loop and this is not supported.
OK then, does that mean that we can always resolve top-level and member ITypeBindings and IMethodBindings from all CUs (from those requested as well as for those not requested in the pipeline)? If yes, then we shouldn't have a problem, since we won't need bindings for local stuff.
Agree with Markus. Converting back keys for local stuff almost means that we reparse the file. In this case it doesn't make sense to convert the key. So can we implement the method in a way that the restriction only applies for local bindings.
Yes, having the restriction that the binding key cannot be a local element key is possible. If it is a top level or member element, then the binding can be returned, whether an AST for the same compilation is requested or not. To be consistent, we would have this restriction for binding keys passed to both ASTRequestor#createBindings(...) and ASTParser#createASTs(...). Would that be ok ?
I changed the spec to say that local elements can be created from their binding keys only if an ast or a binding key in the same cu is not requested.
Jim, I updated the spec to take your comments into account. Note that I still believe that creating ASTs as a batch even if not resolving is still usefull and more consistent than having to do several call to createAST(...).
Marking as fixed. Please open new bugs for batch ast creation from now on.
Verified for 3.1 M4 using build I200412140800.