Community
Participate
Working Groups
The Identifier index created during Pass 1 parsing currently uses a single linear array, and grows as required by allocating arrays at 1.5x and copying all values. A consequence of this is that, for a given memory footprint (eg: 1GB + one object), the required parsing space is (1 + 1*1.5 =) 2.5GB of heap. Instead of growing by 1.5x, it would use less memory during Pass 1 parsing if memory was allocated and extended as needed, with no copying to an extended array.
Created attachment 285606 [details] Identifier Index patch
Thanks for the contribution. 1. Is this a real problem in that is pass 1 the limiting factor for being able to successfully parse a dump, or will a later stage always fail first - e.g. the dominator tree which allocates at least 6 int arrays each of size of the number of identifiers? 2. The parallel sort is replaced by a special Quick sort - is this significantly slower? 3. Could a pathological / malicious dump trigger n-squared sort time?
Andrew, this is a lead-in to some follow-up changes; the goal being that we can likely eliminate the pass1 static footprint requirement at initial parse time. The second chunk is allowing pass1 parsing to be shifted to disk before starting pass2. Net outcome is a reduction of the fixed pass1 memory requirements during parsing. The main challenge is doing it without breaking the API. I'm happy to withdraw this, I think there's probably a better way to tackle it by deferring to some alternative IO model. This would be something like writing directly through to disk for some of the Array* models but this needs to be balanced against IO costs. I've yet to really think about this in detail but it seems like this would allow for significantly reduced parsing memory requirements.
I made a basic change to address some of this problem: https://git.eclipse.org/c/mat/org.eclipse.mat.git/diff/plugins/org.eclipse.mat.parser/src/org/eclipse/mat/parser/index/IndexWriter.java?id=23f80cc96a622bdc1dc886fdab5cbc69c9f3c945 This uses an ArrayLongBig to accumulate the longs in a series of arrays. Before a sort the values are then copied to a long[], so it does require approximately 2N values, rather than up to 2.5N for before this change. The attachment 285606 [details] shows how it could be done in N values, but with a more complex sort of the subarrays. I notice that the class library for Java 17 for Arrays.parallelSort() makes a copy of the array, whereas Java 11 did not. This means that even if the conversion of ArrayLongBig to long[] were done via a file so that only N values were kept in memory at once, the parallelSort would require another N. If the heap space for Identifier index is still a problem in pass 1 then we can look at this patch again.
This issue has been migrated to https://github.com/eclipse-mat/org.eclipse.mat/issues/32.