571288 – [compiler] [performance] Improve jdt compiler CharOperation

Bug 571288 - [compiler] [performance] Improve jdt compiler CharOperation

Summary: [compiler] [performance] Improve jdt compiler CharOperation

Status:	RESOLVED WONTFIX

Alias:	None

Product:	JDT
Classification:	Eclipse Project
Component:	Core (show other bugs)
Version:	4.20
Hardware:	All All

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	JDT-Core-Inbox
QA Contact:

URL:
Whiteboard:
Keywords:	performance

Depends on:
Blocks:

Reported:	2021-02-18 02:53 EST by Jörg Kubitz
Modified:	2021-04-30 05:14 EDT (History)
CC List:	2 users (show)

See Also:	Gerrit Change

Attachments
Screeenshot of sampled hotspots (119.54 KB, image/png) 2021-02-18 02:53 EST, Jörg Kubitz	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jörg Kubitz

2021-02-18 02:53:53 EST

Created attachment 285589 [details]
Screeenshot of sampled hotspots

Currently CPU compile time has hotspots within CharOperation(s).
In CharOperation.hashCode() and .equals() the char array is evaluated on a char by char basis. This could be improved a lot by caching the hashed value inside a container object.
This is for example done in java.lang.String.hash.

My Proposal is to replace all char[] with java.lang.String or a custom String class inside the compiler.

Difference between char[] and java.lang.String:
 char[] is mutable - String immutable.
 char[] is array of 16 bit char - String contains 8 bit byte[].
As far as i see char[] is not mutated in the jdt compiler.
With current java versions a String is either stored as Latin or UTF16 encoding. This saves RAM in where only Latin chars are used. The most java sourecode i know uses only Latin chars. Less memory might improve cache hits. The two seperate encoding code pathes introduced a little overhead. Probably this is not noticable.

This would be a big code change which i want approval before contributing. Is there any deeper reason why char[] is used?

Comment 1 Eclipse Genie

2021-02-24 07:53:28 EST

New Gerrit change created: https://git.eclipse.org/r/c/jdt/eclipse.jdt.core/+/176811

Comment 2 Stephan Herrmann

2021-02-25 12:45:25 EST

(In reply to JÃ¶rg Kubitz from comment #0)
> This would be a big code change which i want approval before contributing.
> Is there any deeper reason why char[] is used?

I don't think there's still a developer around from those days, who could tell you the reasoning from back then (20+ years ago). My first guess is, that org.eclipse.jdt.internal.compiler.parser.Scanner.source is the mother of all char[]. 

Perhaps, back then s.o. found that creating an array as a slice of this array was cheaper than creating a full blown String object (I'm guessing). Such micro performance characteristics may of course have changed as JVMs evolved.

Before starting on that refactoring (not big as you say, but very, very huge), please have a look at class CharOperation for the wealth of operations how char[] and char[][] etc are being processed. Next search for references to System.arraycopy() (in my JDT workspace almost 3000 references). That should give a first picture of the amount of work you are speaking about.

Also there are API classes with char[] and char[][] in their signatures. Class Signature is the first that comes to mind.

OTOH, I know it's a PITA to do all the conversions, e.g., at the boundary between compiler and Java model. So even I would be happy to have a compiler using String throughout.

Unfortunately, it's very difficult to predict the actual impact on performance of a full replacement. Maybe not all is improvement?

Pragmatically, I would say manually switching from char[] to String is near impossible (read: we'd all be retired before you finish). So, do you have ideas how that transformation could be safely, reliably (!) automated?

If, you are able to implement a fully automatic transformation that's simple enough so we can tell just from reading the code that it is 100.0000% correct, THEN I will say: GOOD JOB!

In all other cases I'd be very much scared of this. Not to mention the huge effect on git blame.


One last thought, as you started from analysing performance of HashtableOfPackage: I wonder what would be a typical number of elements in each table? Since all these structures are very much cascaded, I wouldn't expect any of these tables to be huge. Do you have data on this? Now, if my feeling is good, then perhaps hashing isn't even the most suitable strategy for lookup? Here's a crazy idea: how would those hash tables perform, if we'd use for the hash just the first character (sum of first 2 characters ...)? Perhaps the number of hash-collisions would still be tolerable?

Comment 3 Jörg Kubitz

2021-02-25 13:32:24 EST

I would probably start with a search&replace like "[%]" -> ".charAt(%)" and then refine such replacements until no compile errors left and completly rewrite/optimize CharOperation. I guess that would take a week or month (you retire so fast?). Wont be  validatable by just looking at it! You would have to trust me - and the tests. To not touch the tests and signature i could add compatibility signatures containig string->char conversion.
Blame is a big , however you can always also blame the foremost before (1 extra click). I dont know how much that would affect your work. Therefore i totaly value your estimate/feeling.

I took your message as a vote for "not worth".

Statistics:

During compiling of jdt itself the hashmaps are mostly sized <10 and only litte like size 100. The entrys are mostly classnames or full qualified names. full qualified names typically start all with the same substring like 

"org.eclipse.jdt.core.tests.performance",
"org.eclipse.jdt.core.tests.other"

So typically the first characters are always the same. And even the last character is higly dominated by the [https://en.wikipedia.org/wiki/Letter_frequency] of the english language - which is typically used in programming. I guess there are 90% 'e','r','t' at the last position. Thats a mess for hashing.

Comment 4 Jörg Kubitz

2021-02-26 05:56:27 EST

As far as i know only a few rules would be needed:

1 Substitute for each loops:
    for (char c : char[]) {body}
 -> for (int i = 0; i < s.length(); i++) {
      char c= s.charAt(i);
      body
   } 

2 Convert length member to length function:
    char[].length
 ->
    String.lenght()

3. Type conversion:
    char[]
 ->
    String

4. Substitute indexd read:
    char[i]
 ->
    String.charAt(i)

5. Add methods with legacy arguments:
   function(.., char[],..){body}
 ->
    @deprecated
    function(.., char[],..){return function(.., String.valueOf[char[]],..);}
    function(.., String,..){body} // new method from Rule 3

6. Add functions with legacy return types:
    char[] function(args){body}
 ->
    @depracted
    char[] function(args){return functionAsString(args).toCharArray()}
    String functionAsString(args){body} // rename function from Rule 3

Comment 5 Stephan Herrmann

2021-02-28 17:53:54 EST

(In reply to JÃ¶rg Kubitz from comment #4)
> As far as i know only a few rules would be needed:

I value your enthusiasm, still I believe this is still far from covering the issues I mentioned in comment 2.

Before you dive into more details: we know this is a huge task, as such it has unavoidable non-trivial risks (keep in mind: every bug we introduce in the compiler is a potential show-stopper for some users). To balance the risks, what would be the big advantage of that change? Is all this only for hashtable performance? Or is there some greater value to be gained?

Frankly, to me this feels like: "hashtable performance isn't great, oh, hashtable has an issue with char[]. Wait, char[] is used in lots of places where I didn't expect it. That must be changed, too."

That last sentence is jumping to a non-solution :)

Comment 6 Jörg Kubitz

2021-02-28 23:41:50 EST

> I believe this is still far from covering the issues I mentioned

While i believe you see issues where there are none: System.arraycopy is only used in CharOperation for char[]. Those would be replaced by String operations like +. Other Classes only systemcopy char[][] which would still systemcopy String[].

> Or is there some greater value to be gained?

Pros:
* Cleaner interface: Using an immutable where an immutable is needed.
* More readable code: Using build in String Operations instead of lengthy calls.
* Performance change: The speed up would be at maximum Factor 2 (due to half memory) + the average String length (because the Strings needs only be hashed once). There is also a chance that the speed drops to half because of additional checks for UTF/LATIN and the additional indirection. I dont know the outcome. I dont even have a guess. It would be an experiment. Its measurable. But i wont do the experiment if it would be rejected anyway even if it resulted in positive performance gains due to other reasons:

Cons:
* blame
* risk

Comment 7 Stephan Herrmann

2021-03-01 06:15:29 EST

(In reply to JÃ¶rg Kubitz from comment #6)
> > I believe this is still far from covering the issues I mentioned
> 
> While i believe you see issues where there are none: System.arraycopy is
> only used in CharOperation for char[]. 

Wrong :)

Sorry, I will not tell which examples I found, because I believe it's your task to demonstrate that you see the full picture :)

Comment 8 Stephan Herrmann

2021-03-02 05:24:41 EST

Bug 571522 mentions reasons both pro and contra the proposed refactoring:

PRO: if we turn off batched compilation then lookup speed gains relevance because lookup structures will be populated more densely (really? measurably?).

CON: why focus on reducing memory footprint when "today (2021) ram is cheap and plenty"?

Comment 9 Jörg Kubitz

2021-03-02 05:29:29 EST

I dont focus on memory footprint. Its a matter of performance wether to use the double amount of memory. It might even turn the difference wether the workload fits into 1st/2nd cache. So the performance may even more then double when you double the amount of memory used within a tight loop.

Comment 10 Jörg Kubitz

2021-03-23 14:29:16 EDT

I just saw the new Vector API from jdk 16:
https://download.java.net/java/early_access/jdk16/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vector.html
My first thought: great now i can implement a vectorized CharOperation. But wait - there is no CharVector but only ByteVector/ShortVector without any possible typecast from/to char arrays. :-(

Comment 11 Jörg Kubitz

2021-04-30 05:14:06 EDT

Closing this as there is no support from JDT core team for such a change.