392448 – [naming] make QualifiedName extensible

Bug 392448 - [naming] make QualifiedName extensible

Summary: [naming] make QualifiedName extensible

Status:	NEW

Alias:	None

Product:	TMF
Classification:	Modeling
Component:	Xtext (show other bugs)
Version:	2.4.0
Hardware:	PC Mac OS X

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Project Inbox
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-10-19 10:40 EDT by Knut Wannheden
Modified:	2013-02-17 02:48 EST (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Knut Wannheden

2012-10-19 10:40:26 EDT

The QualifiedName API is defined as a concrete Java class. As a result it is not easily possible to use a custom QualifiedName implementation.

A custom implementation could for instance be interesting in scenarios where a lot of QualifiedName objects are required as it would then be possible to provide implementations which save memory by working like linked lists, or which internally use a string pool for the name segments, or which as the internal representation use a single String for all segments (instead of an array of Strings).

To achieve this the QualifiedName API would have to be more extensible. I propose that the QualifiedName class is made abstract or even an interface and that the implementation is pushed into a separate DefaultQualifiedName class.

In scenarios where multiple languages are involved which use different QualifiedName implementation this could cause some problems. In ResourceDescriptionsData the 'lookupMap' is a Multimap using QualifiedName objects as key. This will of course use equals() to compare QualifiedNames. This would then mean that an index lookup would only work when the indexed QualifiedName is of the same type as the lookup QualifiedName. This is of course a problem which must be solved. I suppose it can be compared to the "problem" that two Collections are not necessarily equal(), even if they contain the same elements.

Comment 1 Lieven Lemiengre

2012-10-19 12:05:14 EDT

(In reply to comment #0)
> The QualifiedName API is defined as a concrete Java class. As a result it is
> not easily possible to use a custom QualifiedName implementation.
> 
> A custom implementation could for instance be interesting in scenarios where
> a lot of QualifiedName objects are required as it would then be possible to
> provide implementations which save memory by working like linked lists, or
> which internally use a string pool for the name segments, or which as the
> internal representation use a single String for all segments (instead of an
> array of Strings).
> 
> To achieve this the QualifiedName API would have to be more extensible. I
> propose that the QualifiedName class is made abstract or even an interface
> and that the implementation is pushed into a separate DefaultQualifiedName
> class.
> 
> In scenarios where multiple languages are involved which use different
> QualifiedName implementation this could cause some problems. In
> ResourceDescriptionsData the 'lookupMap' is a Multimap using QualifiedName
> objects as key. This will of course use equals() to compare QualifiedNames.
> This would then mean that an index lookup would only work when the indexed
> QualifiedName is of the same type as the lookup QualifiedName. This is of
> course a problem which must be solved. I suppose it can be compared to the
> "problem" that two Collections are not necessarily equal(), even if they
> contain the same elements.

Last year, Mark and I explored this idea.
We explored an implementation with linked qualified names to reduce memory usage (with & without String interning). 
We changed the equals method for comparing lowercase qualified names.
We tried all sorts of other tiny optimizations but with very little effect on the overall performance of the code. In our best versions the improvement was a very disappointing 10%. Later we discovered that it is best to optimize the classes that use QualifiedName (mostly in and around IScopes), these optimizations are language specific and currently QualifiedName is hardly noticeable when I profile.

With a custom QualifiedName I see no opportunity to do _language specific_ optimizations (but please correct me if I'm wrong).
The advantages of a single, final QualifiedName shouldn't be ignored:
- All calls are to a QualifiedName are always monomorphic and thus very easy to inline by the VM
- Short-lived QualifiedNames can be eliminated by escape analysis

Why did our efforts to improve QualifiedName fail?
Linked names: You memory access patterns become more complex, but you save a bit (not much) of memory. (most QualifiedNames are short (2 to 4 segments) in our impl making the memory savings very small) Some operations on QN can be heavily optimized using this method but we found that these operations aren't used a lot in the Xtext framework.
String interning: Most strings in QualifiedName are actually substrings of a bigger string (the string that represents your file), so you don't gain much. (btw this behavior will change: in Java 8 'substring' will make a copy). Interning also makes creating QualifiedNames slower while most QNs are only used once or not at all.
Optimized compare/toLowerCase: Most operations on Strings are insanely optimized by the JVM, it doesn't use the code in the String class but optimized code (see http://hg.openjdk.java.net/jdk8/awt/hotspot/file/d61761bf3050/src/share/vm/opto/library_call.cpp) While our implementations were more efficient, they weren't as nicely optimized as the regular compares/hashcode/tolowercase & there was no improvement.

Comment 2 Knut Wannheden

2012-10-19 13:14:26 EDT

Hi Lieven,

Thank you very much for all your input!

A 10% performance improvement doesn't sound too bad to me. But that begs the question what *exactly* was improved by 10%? I suppose the overall performance improvement (e.g. of a full build) was much less.

In our product we are dealing with a lot of QualifiedNames. The index alone contains 500K exported objects (amounting to 465K distinct QualifiedNames). Even a short QualifiedName quickly uses around 250 bytes. And as soon as toLowerCase() has been called (which will happen if you have case insensitive languages) you can double that to 500 bytes. So if we were to assume the 500 bytes as the average, the cache in ResourceDescriptionsData alone will use around 250 MiB of memory.

In our product all languages use '.' as the separator for QualifiedNames. So I've been playing with the idea of using the dotted String as the internal representation in the QualifiedName implementation. I.e. a QualifiedName would just have a single String field and implement all operations using String arithmetics. With this approach the memory consumption can be lowered substantially (often by over 50%). Of course some of the QualifiedName operations will be slower, but others will in fact be faster. I think it would be interesting to see how a generally applicable implementation along these lines would perform.

The equals() problem I mention is certainly something that would have to be solved. That and some of the points you bring up are certainly good points against this change. But let's keep it open for discussion. Also I would like to do some performance testing with the String based approach I described.

Comment 3 Knut Wannheden

2012-10-19 13:30:00 EDT

Also I'd like to point out that from what I can tell the '::' separator used for statics in Xbase doesn't seem to have anything directly to do with QualifiedName and IQualifiedNameConverter. That is a parsing/serialization detail only. I hope I didn't miss anything here.

Comment 4 Lieven Lemiengre

2012-10-21 16:08:30 EDT

(In reply to comment #2)
> Hi Lieven,
> 
> Thank you very much for all your input!
> 
> A 10% performance improvement doesn't sound too bad to me. But that begs the
> question what *exactly* was improved by 10%? I suppose the overall
> performance improvement (e.g. of a full build) was much less.
It was overall buildtime, but the variation in our performance testset is very big (sometimes >25%) for total buildtimes (OS & disk play a big role).
We did a few repetitions with a small project, our aim was to reduce the amount of toLowerCase calls (our language is mostly case-insensitive too). The performance improvement was noticable, but not big enough to make a real dent in the graphs. At that time we were also investigating other optimizations that were a lot more worthwhile so we dropped this.

> In our product we are dealing with a lot of QualifiedNames. The index alone
> contains 500K exported objects (amounting to 465K distinct QualifiedNames).
> Even a short QualifiedName quickly uses around 250 bytes. And as soon as
> toLowerCase() has been called (which will happen if you have case
> insensitive languages) you can double that to 500 bytes. So if we were to
> assume the 500 bytes as the average, the cache in ResourceDescriptionsData
> alone will use around 250 MiB of memory.
How much do you think you can improve here? If you simply intern the first 2 segments of a QualifiedName, would that give a serious reduction in memory consumption?

> In our product all languages use '.' as the separator for QualifiedNames. So
> I've been playing with the idea of using the dotted String as the internal
> representation in the QualifiedName implementation. I.e. a QualifiedName
> would just have a single String field and implement all operations using
> String arithmetics. With this approach the memory consumption can be lowered
> substantially (often by over 50%). Of course some of the QualifiedName
> operations will be slower, but others will in fact be faster. I think it
> would be interesting to see how a generally applicable implementation along
> these lines would perform.
That's an interesting approach. Reducing the amount of objects is something we did't think of.

Btw, I'm not against this enhancement, but I do seriously doubt that QN should be made extensible. The current QN implementation should simply be replaced if you find something better. Making QN extensible means that some part of it can be language specific and I don't think this is the case.

Comment 5 Knut Wannheden

2013-02-17 02:43:38 EST

Recently a change was made in Xpand/Xtend with the aim of substantially lowering the memory required for storing its Identifier and QualifiedName instances: http://git.eclipse.org/c/m2t/org.eclipse.xpand.git/commit/?id=2b669d9a0772e93f625b98002d51b7c82e11d594.

While things of course are different in Xpand/Xtend, possibly some of the approaches applied there could also be used in Xtext. Maybe a caching IQualifiedNameCoverter implementation or using a cache in QualifiedName#create() would be worthwhile. Possibly that could be used to lower the memory footprint of the default implementation so that there wouldn't be any need to make the QualifiedName class extensible. At least not for memory footprint reasons that I gave.

Comment 6 Knut Wannheden

2013-02-17 02:48:25 EST

For the mentioned Xpand/Xtend change please also refer to bug 385587 and the previous commits http://git.eclipse.org/c/m2t/org.eclipse.xpand.git/commit/?id=36f6b3da3877a82c625a9e79181c25c5da4c7dd4 and http://git.eclipse.org/c/m2t/org.eclipse.xpand.git/commit/?id=e460b61461a905e316d149ee21d7716c3b7b4e7b which represent initial attempts to reduce the memory footprint. But I think http://git.eclipse.org/c/m2t/org.eclipse.xpand.git/commit/?id=2b669d9a0772e93f625b98002d51b7c82e11d594 is the most interesting for Xtext.