Community
Participate
Working Groups
Created attachment 286016 [details] mmap_index_writer_base.patch Current parsing of files requires reading large arrays into heap storage, which provides a minimum bound on the footprint of Pass1/Pass2 parsing. This patch moves the parsing to off-heap storage, specifically to a file-backed region. This allows parsing larger files within a significantly smaller heap footprint. Observations: - There is a choice here between on-heap and off-heap. On-heap is convenient in that the JVM is clearly bounded, but the downside of course is that it is clearly bounded. Off-heap simplifies this, especially with file-backed, as it allows the OS to spill pages as-needed to disk and expand to handle vastly larger files with a given heap size. - Nice feature with off-heap is that sparse region/arrays do not take up any physical space. For example if an array is allocated that is N entries long but only 1/4 of those are used, a sparse file FS (most modern OS) will only allocate physical resources for what is used in the application. For anonymous regions, only the pages that are used will be mapped. This is a distinct difference to on-heap, where the memory is allocated/paged in even if it is not required. - There is also a choice between off-heap anonymous (using available RAM, spilled via swap), and off-heap named file (using available FS cache or paged to a file on file system). The current implementation picks a named file, but chooses Java's default /tmp. There is no guarantee this is fast or appropriate for the user's device. - Following comments on list, I _assume_ that most users are using SSD which supports fast random storage. Spinning disk performance likely suffers if any spill is required -> in this case we should switch to anonymous off-heap which guarantees that storage is in memory. Comments: - I haven't performed any speed-of-parsing performance testing, though I think the difference between "cannot parse due to not enough ram" vs "can parse" is easy, it is less obvious to me whether there will be a penalty in this model for heaps that fit in RAM. This warrants testing further if the approach looks sensible to others. - I have specifically written this so that, I think, it should be easy to swap in different implementations, such as off heap file, off heap anonymous, on-heap, etc, and stick to interface. It's possible we could provide this as an option to the user or even switch implementations dynamically?
Interesting. Do we have some idea of how the phases compare in memory usage: Pass 1 Pass 2 Garbage collection Dominator tree - this can be quite memory intensive, but operates after unreachable objects have been discarded
> Interesting. > Do we have some idea of how the phases compare in memory usage: In my experience this depends on the 'shape' of the heap in analysis. I'll add example calcs for 1billion object count, very rough analysis. The primary determinant of memory usage is the object _count_, and to a smaller extent the tree structure from GC roots. Much less important is the size of the hprof. > Pass 1 For 1 billion objects, 20GB during generation, falling to 8GB steady state. Bulk of memory here is the 'Identifier' store which is a long[] at least as long as the count of objects. While generating, up to newCapacity is allocated as copy/scratch space (1.5x), so have to assume 2.5x*8B*N objects. > Pass 2 Multiple indexes generated here, assuming 1bn objects. Likely to use 16-28GB, plus scratch space; majority retained. (1) object id -> outbound id references log [IntArray1NWriter, flush to PosIndexStreamer/IntIndexStreamer] (depends on shape of ref tree, int[] 4GB for header, byte[] 1GB for header2, plus IntIndexStreamer underneath, which uses int[page size] during generation, plus background compression. +12GB. (2) object id -> class id array [IntIndexCollector] (paged ArrayIntCompressed, compression based on max class id bits) int[], something less than 4GB (3) object id -> position in file index [LongIndexCollector] (paged ArrayLongCompressed, compression based on max class id bits), long[] something less than 8GB (4) object id -> array size index (for arrays only) [SizeIndexCollectorUncompressed] (int->int array), int[] much less than 4GB > Garbage collection Approx 25GB here during processing. Most memory dropped after complete. Over and above whatever is stored previously, for 1billion objects: boolean[] reachable (1B * object count) = 1GB, dropped after complete. IntArray1NSortedWriter for completed outbound storage = int[]4GB for header, byte[]1GB for header2. +12GB; dropped after complete. InboundWriter for collating the inbound references, int[]4GB for header, byte[]1GB for header2, long[]segmentSizes (<1GB), writes occur directly to file then are reprocessed. +12GB; dropped after complete. > Dominator tree - this can be quite memory intensive, but operates > after unreachable objects have been discarded Assuming a full 1bn heap, approx 28GB during processing, dropped after complete; but, up to 16GB of index is unload()ed prior to starting which mitigates this quite a bit. a2size, o2address, o2class are unload()ed here, so effective requirement is somewhere less than 28GB, we can assume somewhere 8-16GB is saved before starting, depending on xxCompressed page effectiveness. Then, allocates 7x int[] for number of objects in the heap. For a 1bn object file, this is 7x4Bx1BN = 28GB. As you point out, if GarbageCleaner has reduced the footprint, DominatorTree requirements are scaled down respectively. ====== So, Identifier and Size certainly have an immediate impact. If we think this strategy is sensible we can extend the memory mapped strategy to _all_ index types and save significant amounts of memory during processing. I picked the two that were least coupled in the code to begin with, and that would have an impact, to demonstrate the basic idea. Lots of the existing code relies on the class hierarchy and cross-references between reader and writer, so I wanted some validation of the idea before pursuing further. There are additional benefits too if we take this further. For example, rather than reading a whole chunk of the file in SoftReference<ArrayLongCompressed> and unpacking it, the ArrayLongCompressed could be a wrapper/layer over the underlying store, and then the OS manages caching. This sort of thing I think can simplify the code quite a bit as we would no longer need to manage concurrent caching to achieve performance.
To show the impact I ran a quick test against a modest sized heap. It is a ~25% reduction in memory required to process the heap. Sample heap: - 20.800 MByte heap (20GB) - 153 million objects - 65million removed during GarbageCleaner Xmx-Xms settings required to pass heap: - upstream/master 5400m OK, 5200m fails at GarbageCleaner createHistogramOfUnreachableObjects() with a hashmap resize. At this point, Identifier takes 50% of heap w 1.25GB (50%), histogram ArrayList at 520MB (20%), and IntIndexCollector at 510MB (19%). This fails before getting to DominatorTree processing. - origin/netflix 4000m OK, 3800m fails at DominatorTree<init>. Inspecting dump shows 2GB of DominatorTree int[] arrays and SnapshotFactoryImpl.parse():purgedMapping taking 600MB [nb: I'll put up a new patch for this]. These two components take up 98.1% of the active heap. From this it is possible to see the impact of Identifier & Size indexes moved to mmap/off-heap. Once other generated indexes had been unload()ed, the next biggest use of memory is DominatorTree.
A simple change for IndexWriter.Identifier might be to use the ArrayLongBig inside to collect identifiers with add. API Compatible changes: On sort() it might have to append them to any existing entries, then reset the ArrayLongBig get() would need to go to the big array for the first entries (e.g. after a sort) then any current ArrayLongBig That might limit the size needed to 2n rather than 2.5 [Though if m entries were added after sorting, then it would need: n+m + m for the toArray, clear the ArrayLongBig for -m, then (n+m) for the new array so that is still 2(n+m) ] That is still in-memory though.
Yes, I looked initially at something like this, and it is viable. Specifically, you want to be able to release the heap allocation after the Identifier storage has been loaded. The biggest win would come from an API change or a sneaky trick. My thinking was, after a sort(), you could trigger a flush to disk and then reload the data from mmap'd region thus releasing it from the heap. Or, do it explicitly by exposing a flush in the API to convert it to a xxReader. However, as I built this out, it looked like going to a direct memory-mapped model made for much cleaner code, and allows a future path to simplify quite a bit of the storage and memory tier. I did not yet pursue it, but I could run an experiment w/ swapping the DominatorTree to off-heap (should be simple to swap to use an IntArray instead of int[], etc) and measure how that impacts overall parsing memory requirements.
For Garbage Cleaner we have the IOne2LongIndex identifiers = idx.identifiers; index holding old object addresses. This is passed in from the parser, and for HPROF and DTFJ is an in-heap IndexWriter.Identifier object. I've recently changed IndexWriter.Identifier to accumulate in an ArrayLongBig, then it gets converted by the sort() to a long[], of the exact size needed. It is kept around for the unreachable objects phase, then there is the map[] phase where the old to new mapping is built, and a new array of object addresses, index by the new object ID. long identifiers[oldN] boolean reachable[oldN] there is then this part int[] map = new int[oldNoOfObjects]; long[] id2a = new long[newNoOfObjects]; so you are right, the old identifiers with 8*oldN bytes and new with 8*newN bytes will use a lot of space. We can off-load the identifier address with quite a small code change. We can write the identifiers to a temporary index file using LongIndexStreamer.writeTo at the begining of the clean. This takes a long[], ArrayLong or IteratorLong. We could extract a long[] from the IOne2LongIndex.getNext but that would take a lot of space. We can instead create a IteratorLong to read the supplied IOne2LongIndex and write it out with LongIndexStreamer.writeTo and get a new index. We can then close and delete the supplied index to free the long array. We could even close this index to free a little more space. We then do the GC, and build the map[] We then reopen the index file for the identifier addresses. We then have another iteratorLong stepping through the map[] skipping unused entries and retrieving the object address from the newly created temporary index and writing them out to the final index. This should save quite a bit of space at the expense of another file write and read, but with the memory saved the SoftReference caches should work better too.
Similarly, we could store the old object to class index to disk before the garbage clean. Example OutOfMemoryError when parsing a dump with 63 million live objects and presumably 63,169,600 total objects. Failure allocating map[] at org.eclipse.mat.parser.internal.GarbageCleaner.clean(Lorg/eclipse/mat/parser/internal/PreliminaryIndexImpl;Lorg/eclipse/mat/parser/internal/SnapshotImplBuilder;Ljava/util/Map;Lorg/eclipse/mat/util/IProgressListener;)[I (GarbageCleaner.java:149) Object / Stack Frame |Name| Shallow Heap | Retained Heap ------------------------------------------------------------------------------------------------------------------ <local> org.eclipse.mat.parser.internal.PreliminaryIndexImpl @ 0x826fc308 | | 48 | 545,256 <local> org.eclipse.mat.parser.internal.SnapshotImplBuilder @ 0x82abf540 | | 40 | 40 <local> java.util.HashMap @ 0x826fc370 | | 48 | 48 <local> org.eclipse.mat.ui.util.ProgressMonitorWrapper @ 0x826fc338 | | 32 | 32 <local> org.eclipse.mat.parser.index.IndexManager @ 0x82abf568 | | 48 | 48 <local> boolean[63169578] @ 0xaa000000 | | 63,169,600 | 63,169,600 <local> int[3703] @ 0x82abf780 | | 14,832 | 14,832 <local> org.eclipse.mat.parser.index.IndexWriter$Identifier @ 0x82cce638 | | 24 | 505,356,664 <local> org.eclipse.mat.parser.index.IndexReader$IntIndex1NReader @ 0x82d7e2d0 | | 32 | 24,112 <local> org.eclipse.mat.parser.index.IndexWriter$IntIndexCollector @ 0x812f7a60| | 48 | 205,306,888 <local> org.eclipse.mat.collect.HashMapIntObject @ 0x82d1a800 | | 40 | 828,072 <local> java.util.HashMap @ 0x82f02800 | | 48 | 4,231,568 ------------------------------------------------------------------------------------------------------------------ The IntIndexCollector for object to class takes approximately log2(oldN)*oldN / 8 bytes as an in-memory version.
Created attachment 286362 [details] Free some space during garbage collection. The parsers can supply the object to address and object to class indexes as in-memory versions. This frees the memory by converting them to disk-backed versions.
Created attachment 286363 [details] dominator_tree.patch Sample patch to support using disk-backed storage during DominatorTree construction.
Yes, there are a few ways to reduce memory footprint ongoing. For ex the attached dominator tree patch provides reduction of 5x int[] from the DominatorTree construction - for a 2bn object heap, this works out to a saving of 40GB of required heap space.
I've added a simple patch to free some space during garbage collection. Please give it a try on a large dump to see if it helps. I think it could reduce usage at that point by 8*nOld (for the identifier index) + 8*nOld (for the long[] array) + log2(nOld)*nOld/4 for the object to classId compressed index so up to 19.75Gbyte for 1,000,000,000 objects, though if those indexes were used by GC then some memory would be in the soft reference caches.
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/180543
Gerrit change https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/180543 was merged to [master]. Commit: http://git.eclipse.org/c/mat/org.eclipse.mat.git/commit/?id=039199c8beb7ad28761ad988399a4a97fb995966
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/180553
Gerrit change https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/180553 was merged to [master]. Commit: http://git.eclipse.org/c/mat/org.eclipse.mat.git/commit/?id=542be78a8001cc5eb106bc2e26bd42a950843695
I've merged some changes freeing up some space for object id to address and object id to class id during GC. It turned out that the DTFJ parser was already converting these indexes into disk versions if they were big enough and memory was short, so I added it always in HPROF, and added a conversion process for in-memory index Identifier or IntIndexCollector. I didn't bother unloading the disk indexes as if there is enough memory it is good to use any entries already there, and the soft references will allow them to be freed if memory is short.
Comment on attachment 286363 [details] dominator_tree.patch Please do not merge this patch.
Please don't take this as a vote against this idea, but I'm going to mark attachment 286363 [details] as obsolete as it looks like includes code which is not original so can't be submitted under the Eclipse Contributor Agreement https://www.eclipse.org/legal/ECA.php as EPL code to the project. If it's obsolete it's a reminder not to merge this.
+1 agree with your position to obsolete the dom-tree patch
Created attachment 286508 [details] 2_memory_mapped_storage_into_tmpdir.patch Fix to place files into java tmp location instead of current working directory.
Created attachment 286509 [details] 3_explicit_check_for_negative_index Explicitly check for negative indexes in case of greater than 2bn indexes (overflow)
Created attachment 286510 [details] 4_introduce_basic_heap_backed_arrays.patch Also, this basic heap-backed implementation could smooth any transition / migration work.
See bug 571331 comment 4 for details of the Identifier change which reduces the memory usage to 16*nOld in generation, 8*nOld in use, so for 1 billion objects, 16GB during generation, falling to 8GB steady state. DTFJIndexBuilder shows how the write out the Identifier to a temporary index after pass 1. The reader then uses the standard MAT reader so uses SoftReference caching. We should do this for HPROF. We need a fix to bug 579931 to make reverse() thread safe. object id -> array size index is actually 4*nOld in size Although this index is usually sparse some dumps can have more arrays than plain objects, so a HashMap will not always save space. > (3) object id -> position in file index [LongIndexCollector] (paged ArrayLongCompressed, compression based on max class id bits), long[] something less than 8GB The compression here is based on the maximum position in file, so for 1,000,000,000 objects and perhaps a 80GB file this should be less than 5GB.
In my view, putting the arrays behind an abstraction of some sort is the primary benefit. We have two challenges, in my view, to supporting large heaps: - 2bn object limit, which will forever be a constraint if we only use int-based-indexing - runtime size for objects By building a small abstraction over int[] and long[] we can then use our own long-based indexing and grow/shrink the backing store is needed. Even if the backing store is "backed by an array", then we introduce flexibility to the design, such that users can even make the tradeoff themselves, or even go so far as to offer it as an osgi/eclipse plugin, etc.
This issue has been migrated to https://github.com/eclipse-mat/org.eclipse.mat/issues/33.