Bug 572512 - Memory mapped files for parsing storage (proposal for comment)
Summary: Memory mapped files for parsing storage (proposal for comment)
Status: CLOSED MOVED
Alias: None
Product: MAT
Classification: Tools
Component: Core (show other bugs)
Version: 1.11   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-01 01:18 EDT by Jason Koch CLA
Modified: 2024-05-08 15:49 EDT (History)
1 user (show)

See Also:


Attachments
mmap_index_writer_base.patch (24.83 KB, patch)
2021-04-01 01:18 EDT, Jason Koch CLA
no flags Details | Diff
Free some space during garbage collection. (5.40 KB, patch)
2021-05-11 09:11 EDT, Andrew Johnson CLA
no flags Details | Diff
dominator_tree.patch (17.95 KB, patch)
2021-05-11 09:12 EDT, Jason Koch CLA
no flags Details | Diff
2_memory_mapped_storage_into_tmpdir.patch (1.64 KB, patch)
2021-06-02 13:16 EDT, Jason Koch CLA
no flags Details | Diff
3_explicit_check_for_negative_index (2.22 KB, patch)
2021-06-02 13:18 EDT, Jason Koch CLA
no flags Details | Diff
4_introduce_basic_heap_backed_arrays.patch (2.32 KB, patch)
2021-06-02 13:21 EDT, Jason Koch CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Koch CLA 2021-04-01 01:18:26 EDT
Created attachment 286016 [details]
mmap_index_writer_base.patch

Current parsing of files requires reading large arrays into heap storage, which provides a minimum bound on the footprint of Pass1/Pass2 parsing.

This patch moves the parsing to off-heap storage, specifically to a file-backed region. This allows parsing larger files within a significantly smaller heap footprint.

Observations:
- There is a choice here between on-heap and off-heap. On-heap is convenient in that the JVM is clearly bounded, but the downside of course is that it is clearly bounded. Off-heap simplifies this, especially with file-backed, as it allows the OS to spill pages as-needed to disk and expand to handle vastly larger files with a given heap size.
- Nice feature with off-heap is that sparse region/arrays do not take up any physical space. For example if an array is allocated that is N entries long but only 1/4 of those are used, a sparse file FS (most modern OS) will only allocate physical resources for what is used in the application. For anonymous regions, only the pages that are used will be mapped. This is a distinct difference to on-heap, where the memory is allocated/paged in even if it is not required.
- There is also a choice between off-heap anonymous (using available RAM, spilled via swap), and off-heap named file (using available FS cache or paged to a file on file system). The current implementation picks a named file, but chooses Java's default /tmp. There is no guarantee this is fast or appropriate for the user's device.
- Following comments on list, I _assume_ that most users are using SSD which supports fast random storage. Spinning disk performance likely suffers if any spill is required -> in this case we should switch to anonymous off-heap which guarantees that storage is in memory.

Comments:
- I haven't performed any speed-of-parsing performance testing, though I think the difference between "cannot parse due to not enough ram" vs "can parse" is easy, it is less obvious to me whether there will be a penalty in this model for heaps that fit in RAM. This warrants testing further if the approach looks sensible to others.
- I have specifically written this so that, I think, it should be easy to swap in different implementations, such as off heap file, off heap anonymous, on-heap, etc, and stick to interface. It's possible we could provide this as an option to the user or even switch implementations dynamically?
Comment 1 Andrew Johnson CLA 2021-04-09 13:15:09 EDT
Interesting.
Do we have some idea of how the phases compare in memory usage:
Pass 1
Pass 2
Garbage collection
Dominator tree - this can be quite memory intensive, but operates after unreachable objects have been discarded
Comment 2 Jason Koch CLA 2021-04-10 13:30:58 EDT
> Interesting.
> Do we have some idea of how the phases compare in memory usage:

In my experience this depends on the 'shape' of the heap in analysis. I'll add example calcs for 1billion object count, very rough analysis. The primary determinant of memory usage is the object _count_, and to a smaller extent the tree structure from GC roots. Much less important is the size of the hprof.

> Pass 1

For 1 billion objects, 20GB during generation, falling to 8GB steady state.

Bulk of memory here is the 'Identifier' store which is a long[] at least as long as the count of objects. While generating, up to newCapacity is allocated as copy/scratch space (1.5x), so have to assume 2.5x*8B*N objects.

> Pass 2

Multiple indexes generated here, assuming 1bn objects. Likely to use 16-28GB, plus scratch space; majority retained.

(1) object id -> outbound id references log [IntArray1NWriter, flush to PosIndexStreamer/IntIndexStreamer] (depends on shape of ref tree, int[] 4GB for header, byte[] 1GB for header2, plus IntIndexStreamer underneath, which uses int[page size] during generation, plus background compression. +12GB.
(2) object id -> class id array [IntIndexCollector] (paged ArrayIntCompressed, compression based on max class id bits) int[], something less than 4GB
(3) object id -> position in file index [LongIndexCollector] (paged ArrayLongCompressed, compression based on max class id bits), long[] something less than 8GB
(4) object id -> array size index (for arrays only) [SizeIndexCollectorUncompressed] (int->int array), int[] much less than 4GB

> Garbage collection

Approx 25GB here during processing. Most memory dropped after complete.

Over and above whatever is stored previously, for 1billion objects:
boolean[] reachable (1B * object count) = 1GB, dropped after complete.
IntArray1NSortedWriter for completed outbound storage = int[]4GB for header, byte[]1GB for header2. +12GB; dropped after complete.
InboundWriter for collating the inbound references, int[]4GB for header, byte[]1GB for header2, long[]segmentSizes (<1GB), writes occur directly to file then are reprocessed. +12GB; dropped after complete.

> Dominator tree - this can be quite memory intensive, but operates
> after unreachable objects have been discarded

Assuming a full 1bn heap, approx 28GB during processing, dropped after complete; but, up to 16GB of index is unload()ed prior to starting which mitigates this quite a bit.

a2size, o2address, o2class are unload()ed here, so effective requirement is somewhere less than 28GB, we can assume somewhere 8-16GB is saved before starting, depending on xxCompressed page effectiveness. Then, allocates 7x int[] for number of objects in the heap. For a 1bn object file, this is 7x4Bx1BN = 28GB. As you point out, if GarbageCleaner has reduced the footprint, DominatorTree requirements are scaled down respectively.

======

So, Identifier and Size certainly have an immediate impact.

If we think this strategy is sensible we can extend the memory mapped strategy to _all_ index types and save significant amounts of memory during processing. I picked the two that were least coupled in the code to begin with, and that would have an impact, to demonstrate the basic idea.

Lots of the existing code relies on the class hierarchy and cross-references between reader and writer, so I wanted some validation of the idea before pursuing further. 

There are additional benefits too if we take this further. For example, rather than reading a whole chunk of the file in SoftReference<ArrayLongCompressed> and unpacking it, the ArrayLongCompressed could be a wrapper/layer over the underlying store, and then the OS manages caching. This sort of thing I think can simplify the code quite a bit as we would no longer need to manage concurrent caching to achieve performance.
Comment 3 Jason Koch CLA 2021-04-10 17:35:23 EDT
To show the impact I ran a quick test against a modest sized heap. It is a ~25% reduction in memory required to process the heap.

Sample heap:
- 20.800 MByte heap (20GB)
- 153 million objects
- 65million removed during GarbageCleaner

Xmx-Xms settings required to pass heap:

- upstream/master 5400m OK, 5200m fails at GarbageCleaner createHistogramOfUnreachableObjects() with a hashmap resize. At this point, Identifier takes 50% of heap w 1.25GB (50%), histogram ArrayList at 520MB (20%), and IntIndexCollector at 510MB (19%). This fails before getting to DominatorTree processing.

- origin/netflix 4000m OK, 3800m fails at DominatorTree<init>. Inspecting dump shows 2GB of DominatorTree int[] arrays and SnapshotFactoryImpl.parse():purgedMapping taking 600MB [nb: I'll put up a new patch for this]. These two components take up 98.1% of the active heap.

From this it is possible to see the impact of Identifier & Size indexes moved to mmap/off-heap. Once other generated indexes had been unload()ed, the next biggest use of memory is DominatorTree.
Comment 4 Andrew Johnson CLA 2021-04-28 09:07:23 EDT
A simple change for IndexWriter.Identifier might be to use the ArrayLongBig inside to collect identifiers with add.

API Compatible changes:
On sort() it might have to append them to any existing entries, then reset the ArrayLongBig
get() would need to go to the big array for the first entries (e.g. after a sort) then any current ArrayLongBig

That might limit the size needed to 2n rather than 2.5
[Though if m entries were added after sorting, then it would need:
n+m + m for the toArray, clear the ArrayLongBig for -m, then (n+m) for the new array so that is still 2(n+m) ]
That is still in-memory though.
Comment 5 Jason Koch CLA 2021-04-28 17:23:33 EDT
Yes, I looked initially at something like this, and it is viable. Specifically, you want to be able to release the heap allocation after the Identifier storage has been loaded. The biggest win would come from an API change or a sneaky trick. My thinking was, after a sort(), you could trigger a flush to disk and then reload the data from mmap'd region thus releasing it from the heap. Or, do it explicitly by exposing a flush in the API to convert it to a xxReader. However, as I built this out, it looked like going to a direct memory-mapped model made for much cleaner code, and allows a future path to simplify quite a bit of the storage and memory tier.

I did not yet pursue it, but I could run an experiment w/ swapping the DominatorTree to off-heap (should be simple to swap to use an IntArray instead of int[], etc) and measure how that impacts overall parsing memory requirements.
Comment 6 Andrew Johnson CLA 2021-05-10 15:23:54 EDT
For Garbage Cleaner we have the IOne2LongIndex identifiers = idx.identifiers; index holding old object addresses.
This is passed in from the parser, and for HPROF and DTFJ is an in-heap IndexWriter.Identifier object. I've recently changed IndexWriter.Identifier to accumulate in an ArrayLongBig, then it gets converted by the sort() to a long[], of the exact size needed.

It is kept around for the unreachable objects phase, then there is the map[] phase where the old to new mapping is built, and a new array of object addresses, index by the new object ID.
long identifiers[oldN]
boolean reachable[oldN]
there is then this part
int[] map = new int[oldNoOfObjects];
long[] id2a = new long[newNoOfObjects];
so you are right, the old identifiers with 8*oldN bytes and new with 8*newN bytes will use a lot of space.

We can off-load the identifier address with quite a small code change.
We can write the identifiers to a temporary index file using LongIndexStreamer.writeTo at the begining of the clean.
This takes a long[], ArrayLong or IteratorLong. We could extract a long[] from the IOne2LongIndex.getNext but that would take a lot of space.
We can instead create a IteratorLong to read the supplied IOne2LongIndex and write it out with LongIndexStreamer.writeTo and get a new index. We can then close and delete the supplied index to free the long array.
We could even close this index to free a little more space.
We then do the GC, and build the map[]
We then reopen the index file for the identifier addresses.
We then have another iteratorLong stepping through the map[] skipping unused entries and retrieving the object address from the newly created temporary index and writing them out to the final index.

This should save quite a bit of space at the expense of another file write and read, but with the memory saved the SoftReference caches should work better too.
Comment 7 Andrew Johnson CLA 2021-05-11 04:55:00 EDT
Similarly, we could store the old object to class index to disk before the garbage clean.

Example OutOfMemoryError when parsing a dump with 63 million live objects and presumably 63,169,600 total objects.
Failure allocating map[]

at org.eclipse.mat.parser.internal.GarbageCleaner.clean(Lorg/eclipse/mat/parser/internal/PreliminaryIndexImpl;Lorg/eclipse/mat/parser/internal/SnapshotImplBuilder;Ljava/util/Map;Lorg/eclipse/mat/util/IProgressListener;)[I (GarbageCleaner.java:149)


Object / Stack Frame                                                           |Name| Shallow Heap | Retained Heap
------------------------------------------------------------------------------------------------------------------
<local> org.eclipse.mat.parser.internal.PreliminaryIndexImpl @ 0x826fc308      |    |           48 |       545,256
<local> org.eclipse.mat.parser.internal.SnapshotImplBuilder @ 0x82abf540       |    |           40 |            40
<local> java.util.HashMap @ 0x826fc370                                         |    |           48 |            48
<local> org.eclipse.mat.ui.util.ProgressMonitorWrapper @ 0x826fc338            |    |           32 |            32
<local> org.eclipse.mat.parser.index.IndexManager @ 0x82abf568                 |    |           48 |            48
<local> boolean[63169578] @ 0xaa000000                                         |    |   63,169,600 |    63,169,600
<local> int[3703] @ 0x82abf780                                                 |    |       14,832 |        14,832
<local> org.eclipse.mat.parser.index.IndexWriter$Identifier @ 0x82cce638       |    |           24 |   505,356,664
<local> org.eclipse.mat.parser.index.IndexReader$IntIndex1NReader @ 0x82d7e2d0 |    |           32 |        24,112
<local> org.eclipse.mat.parser.index.IndexWriter$IntIndexCollector @ 0x812f7a60|    |           48 |   205,306,888
<local> org.eclipse.mat.collect.HashMapIntObject @ 0x82d1a800                  |    |           40 |       828,072
<local> java.util.HashMap @ 0x82f02800                                         |    |           48 |     4,231,568
------------------------------------------------------------------------------------------------------------------

The IntIndexCollector for object to class takes approximately log2(oldN)*oldN / 8 bytes as an in-memory version.
Comment 8 Andrew Johnson CLA 2021-05-11 09:11:31 EDT
Created attachment 286362 [details]
Free some space during garbage collection.

The parsers can supply the object to address and object to class indexes as in-memory versions. This frees the memory by converting them to disk-backed versions.
Comment 9 Jason Koch CLA 2021-05-11 09:12:00 EDT
Created attachment 286363 [details]
dominator_tree.patch

Sample patch to support using disk-backed storage during DominatorTree construction.
Comment 10 Jason Koch CLA 2021-05-11 09:15:10 EDT
Yes, there are a few ways to reduce memory footprint ongoing. For ex the attached dominator tree patch provides reduction of 5x int[] from the DominatorTree construction - for a 2bn object heap, this works out to a saving of 40GB of required heap space.
Comment 11 Andrew Johnson CLA 2021-05-11 09:19:48 EDT
I've added a simple patch to free some space during garbage collection. Please give it a try on a large dump to see if it helps.
I think it could reduce usage at that point by 8*nOld (for the identifier index) + 8*nOld (for the long[] array) + log2(nOld)*nOld/4 for the object to classId compressed index so up to 19.75Gbyte for 1,000,000,000 objects, though if those indexes were used by GC then some memory would be in the soft reference caches.
Comment 12 Eclipse Genie CLA 2021-05-12 11:58:28 EDT
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/180543
Comment 14 Eclipse Genie CLA 2021-05-12 15:53:11 EDT
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/180553
Comment 16 Andrew Johnson CLA 2021-05-13 03:09:30 EDT
I've merged some changes freeing up some space for object id to address and object id to class id during GC.
It turned out that the DTFJ parser was already converting these indexes into disk versions if they were big enough and memory was short, so I added it always in HPROF, and added a conversion process for in-memory index Identifier or IntIndexCollector.
I didn't bother unloading the disk indexes as if there is enough memory it is good to use any entries already there, and the soft references will allow them to be freed if memory is short.
Comment 17 Andrew Johnson CLA 2021-06-02 09:31:21 EDT
Comment on attachment 286363 [details]
dominator_tree.patch

Please do not merge this patch.
Comment 18 Andrew Johnson CLA 2021-06-02 09:31:35 EDT
Please don't take this as a vote against this idea, but I'm going to mark attachment 286363 [details] as obsolete as it looks like includes code which is not original so can't be submitted under the Eclipse Contributor Agreement https://www.eclipse.org/legal/ECA.php as EPL code to the project. If it's obsolete it's a reminder not to merge this.
Comment 19 Jason Koch CLA 2021-06-02 12:09:31 EDT
+1 agree with your position to obsolete the dom-tree patch
Comment 20 Jason Koch CLA 2021-06-02 13:16:32 EDT
Created attachment 286508 [details]
2_memory_mapped_storage_into_tmpdir.patch

Fix to place files into java tmp location instead of current working directory.
Comment 21 Jason Koch CLA 2021-06-02 13:18:00 EDT
Created attachment 286509 [details]
3_explicit_check_for_negative_index

Explicitly check for negative indexes in case of greater than 2bn indexes (overflow)
Comment 22 Jason Koch CLA 2021-06-02 13:21:25 EDT
Created attachment 286510 [details]
4_introduce_basic_heap_backed_arrays.patch

Also, this basic heap-backed implementation could smooth any transition / migration work.
Comment 23 Andrew Johnson CLA 2022-05-19 04:07:28 EDT
See bug 571331 comment 4 for details of the Identifier change which reduces the memory usage to 16*nOld in generation, 8*nOld in use, so for 1 billion objects, 16GB during generation, falling to 8GB steady state.

DTFJIndexBuilder shows how the write out the Identifier to a temporary index after pass 1. The reader then uses the standard MAT reader so uses SoftReference caching. We should do this for HPROF. We need a fix to bug 579931 to make reverse() thread safe.

object id -> array size index is actually 4*nOld in size
Although this index is usually sparse some dumps can have more arrays than plain objects, so a HashMap will not always save space.

> (3) object id -> position in file index [LongIndexCollector] (paged ArrayLongCompressed, compression based on max class id bits), long[] something less than 8GB

The compression here is based on the maximum position in file, so for 1,000,000,000 objects and perhaps a 80GB file this should be less than 5GB.
Comment 24 Jason Koch CLA 2022-05-19 12:37:14 EDT
In my view, putting the arrays behind an abstraction of some sort is the primary benefit. We have two challenges, in my view, to supporting large heaps:
- 2bn object limit, which will forever be a constraint if we only use int-based-indexing
- runtime size for objects

By building a small abstraction over int[] and long[] we can then use our own long-based indexing and grow/shrink the backing store is needed.

Even if the backing store is "backed by an array", then we introduce flexibility to the design, such that users can even make the tradeoff themselves, or even go so far as to offer it as an osgi/eclipse plugin, etc.
Comment 25 Eclipse Webmaster CLA 2024-05-08 15:49:05 EDT
This issue has been migrated to https://github.com/eclipse-mat/org.eclipse.mat/issues/33.