Bug 570670 - Optimisations for GarbageCleaner
Summary: Optimisations for GarbageCleaner
Status: CLOSED MOVED
Alias: None
Product: MAT
Classification: Tools
Component: Core (show other bugs)
Version: 1.11   Edit
Hardware: All All
: P3 minor (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-26 11:54 EST by Jason Koch CLA
Modified: 2024-05-08 15:46 EDT (History)
2 users (show)

See Also:


Attachments
1_simplify_objectmarker_marking_code.patch (34.38 KB, patch)
2021-01-26 12:22 EST, Jason Koch CLA
no flags Details | Diff
2_getPage_perform_read_outside_sync_block.patch (4.52 KB, patch)
2021-01-26 12:23 EST, Jason Koch CLA
no flags Details | Diff
3_remove_unneeded_sync_in_ArrayIntCompressed.patch (1.02 KB, patch)
2021-01-26 12:23 EST, Jason Koch CLA
no flags Details | Diff
4_threadlocal_cache_for_indexwriter_page.patch (3.88 KB, patch)
2021-01-26 12:23 EST, Jason Koch CLA
no flags Details | Diff
5_use_fj_for_GarbageCleaner.patch (21.57 KB, patch)
2021-01-26 12:23 EST, Jason Koch CLA
no flags Details | Diff
6_patch_feedback.patch (2.93 KB, patch)
2021-01-26 12:23 EST, Jason Koch CLA
no flags Details | Diff
2b_fix_ensure_readdirect_multithreaded.patch (1.64 KB, patch)
2021-04-01 00:20 EDT, Jason Koch CLA
no flags Details | Diff
4_introduce_threadlocal_cache_for_indexwriter_getpage (2.71 KB, patch)
2021-04-01 00:23 EDT, Jason Koch CLA
no flags Details | Diff
7_fix_handle_garbagecleaner_overflow_on_large_dumps (1.29 KB, patch)
2021-04-01 00:26 EDT, Jason Koch CLA
no flags Details | Diff
8_enqueue_only_pointers.patch (4.76 KB, patch)
2021-05-11 09:01 EDT, Jason Koch CLA
no flags Details | Diff
8_enqueue_only_pointers_proper.patch (4.31 KB, patch)
2021-06-02 18:40 EDT, Jason Koch CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Koch CLA 2021-01-26 11:54:42 EST
I have a series of patches that improve performance in the GarbageCleaner codebase. After a brief discussion on the mailing list I propose these patches.

All existing TCs pass. This results in a 4x speedup on a 16-core machine and significantly lower overall ('real' cpu usage) when parsing large (70+gb) hprof file with ~1 billion objects.
Comment 1 Jason Koch CLA 2021-01-26 12:22:47 EST
Created attachment 285390 [details]
1_simplify_objectmarker_marking_code.patch
Comment 2 Jason Koch CLA 2021-01-26 12:23:02 EST
Created attachment 285391 [details]
2_getPage_perform_read_outside_sync_block.patch
Comment 3 Jason Koch CLA 2021-01-26 12:23:13 EST
Created attachment 285392 [details]
3_remove_unneeded_sync_in_ArrayIntCompressed.patch
Comment 4 Jason Koch CLA 2021-01-26 12:23:27 EST
Created attachment 285393 [details]
4_threadlocal_cache_for_indexwriter_page.patch
Comment 5 Jason Koch CLA 2021-01-26 12:23:39 EST
Created attachment 285394 [details]
5_use_fj_for_GarbageCleaner.patch
Comment 6 Jason Koch CLA 2021-01-26 12:23:50 EST
Created attachment 285395 [details]
6_patch_feedback.patch
Comment 7 Jason Koch CLA 2021-01-26 15:21:24 EST
I've performed some before/after of this patchset and on the hprof files I have, the GC phase uses less memory than Phase1/Phase2. In practice this means that if there is enough heapspace allocated to get through to complete Phase2, then it will be completed through to the end. If there is not enough heapspace to get through to end of Phase2, then we do not get to the point of running GC.

In summary, we can be confident that this patchset does not degrade the heap behaviour of MAT parsing and also gives some clues about future direction for potential optimization.
Comment 8 Jason Koch CLA 2021-03-25 11:03:34 EDT
Hi Andrew / Krum - Were you able to check over this and confirm any feedback?
I have assessed memory footprint and it sits below the pass1/2 requirements.
Comment 9 Jason Koch CLA 2021-03-30 23:42:18 EDT
I just realised none of these patches apply cleanly; I tried to break it up into smaller patches and made a mess of it. I'll get that fixed.

I'll clean this all up and raise a separate ticket.
Comment 10 Andrew Johnson CLA 2021-03-31 05:11:17 EDT
I'm looking at the patches. I think the first one worked for me, but if you want to keep this bug then you can mark old patches as superseded and add new ones.
Comment 11 Jason Koch CLA 2021-04-01 00:20:25 EDT
Created attachment 286012 [details]
2b_fix_ensure_readdirect_multithreaded.patch

2b_fix_ensure_readdirect_multithreaded.patch

this fixes issues w/ 2
Comment 12 Jason Koch CLA 2021-04-01 00:23:14 EDT
Created attachment 286013 [details]
4_introduce_threadlocal_cache_for_indexwriter_getpage
Comment 13 Jason Koch CLA 2021-04-01 00:26:33 EDT
Created attachment 286014 [details]
7_fix_handle_garbagecleaner_overflow_on_large_dumps
Comment 14 Jason Koch CLA 2021-04-01 00:28:56 EDT
I have amended patch 2 with a 2b. Replaced patch 4, and added a patch 7. I believe these should apply cleanly in order now and pass tests properly: 1,2+2b,3,4,5,6,7.
Comment 15 Andrew Johnson CLA 2021-04-26 02:50:22 EDT
I've applied the patches to my workspace and they run. It's not a large set of patches (<400 added lines) so isn't a 'large' contribution, so no CQ is needed.

The Fork/join object marker is simpler than the old way, but the old way had some performance tuning,
even though sometimes threads were idle.

The old way:
* Depth first search thread - with locality.
* Have a local stack for objects close to the current object.
* Have a local queue for remaining objects.
* Use the global stack for excess objects or when local stack and queue are empty.

so one question is how well the new marking works with limited memory (insufficient to hold most of the 
outbound references in memory). For example the CI builds are memory constrained because of the VMs
used for Jenkins, fortunately the test dumps are small. The new thread local cache could help, but was
the old system of locality in marking useful?

A minor improvement for fork/join might be to split very large object arrays - for example if there is
an object array with 1,000,000,000 elements can only one thread process it, or can it be split?

I think the IndexWriter getPage SoftReference also has the possible problem where the SoftReference is cleared between two gets.

To be this in for Eclipse 2021-06 we would need to finish this a few days before 2021-06 M3+3 on Wednesday, May 26, 2021.
Comment 16 Jason Koch CLA 2021-05-06 22:55:58 EDT
Thanks for the feedback. Quick update: I have some ideas, certainly we can defer / lazy load the array for FjObjectMarker into the next compute phase - and I have a draft of that working locally, it definitely reduces heap footprint during marking.

I am still thinking about the locality question, and it is a good point. There are a few architectures I have in mind that might give superb or might give terrible throughput, I'm yet to really test any of them out,..

On the one hand, locality is beneficial for cache purposes.

In practice however, looking at a heap I have here: 2bn objects, outbound index is 20GB in size, and current CPUs have O(10s of MB) of L3 cache, so the chance of an object ID appearing is quite small; this would favour simpler code paths to help the CPU. In addition, monitoring jmap -histo on a heap of this size indicates at peak only <300,000 FjObjectMarkers are enqueued (after I do the lazy load mentioned above), which is only a few MBs of data.

But there are certainly areas where locality is likely to be more significant, and I'd need to measure and get a bit more detailed to be certain. I think the only way to be sure is to build a few alternative implementations and generate some more ideas on what would be a better solution.

Ideas bubbling around, that may or may not work:
- N threads, split heap into 1/N regions, and have all bits[] read/write updates go via an owner thread, and threads, as they outbound.get(), forward any non-local requests via queues to other threads.
- Revisit the idea of work-stealing algorithm; some CPUs were underused in past but maybe some heuristics or assumptions can be revisited.
- Use a bitset or similar, backed by a CAS [or in the case of split threads, no CAS required], to manage the visited/not.
- Within existing FJ pool, do more work between recursion or use iteration to simplify the code paths for CPU [I haven't yet looked at generated ASM].
- Split loops so that the fetch/mark/fork happens as distinct _stages_ in processing.

I want to take some measurements to find out why exactly the code is stalling too; it seems from simple perf the majority of the code is spent around the outbound object get, so perhaps we can look at optimising this by batching reads to the IO layer! Much to think about.
Comment 17 Andrew Johnson CLA 2021-05-08 04:13:37 EDT
The locality I was considering was also reader caching for the outbound references.
Looking at the code for outbound refs:

IntIndex1NReader
 header - PositionIndexReader
  pages of ArrayIntLongCompressed held in a soft reference cache, to index into
 body - IntIndexReader
  pages of ArrayIntCompressed held in a soft reference cache
  
and those all read from a SimpleBufferedRandomAccessInputStream which also has a soft reference cache of 512 byte pages

[I suppose now we have CompressedRandomAccessFile in hprof we could gzip compress the index files, but it would be slower, and most of the space saving is from compressing hprof:

file name                        	uncompressed	compressed	Saving  	Compression
java_pid32284.0001.hprof          	5407868827	1306076258	4101792569	75.85%
java_pid32284.0001.inbound.index 	262560438	191519550	71040888	27.06%
java_pid32284.0001.outbound.index	262560438	126402088	136158350	51.86%
java_pid32284.0001.domOut.index  	213915350	128975789	84939561	39.71%
java_pid32284.0001.o2hprof.index	181198308	67032436	114165872	63.01%
java_pid32284.0001.o2ret.index  	148956613	1724083 	147232530	98.84%
java_pid32284.0001.domIn.index  	71806040	45299834	26506206	36.91%
java_pid32284.0001.o2c.index    	68556040	570041  	67985999	99.17%
java_pid32284.0001.a2s.index    	52808808	628021  	52180787	98.81%
Totals                          	6670230862	1868228100	4802002762	71.99%

chunked compressed java_pid32284.0001.hprof.gz 1555306556

]
Comment 18 Jason Koch CLA 2021-05-11 09:01:29 EDT
Created attachment 286359 [details]
8_enqueue_only_pointers.patch
Comment 19 Jason Koch CLA 2021-05-11 09:02:29 EDT
I've added a *draft* patch that swaps the int[] enqueue to a int enqueue, which reduces the memory footprint during marking.
Comment 20 Jason Koch CLA 2021-06-02 18:40:16 EDT
Created attachment 286512 [details]
8_enqueue_only_pointers_proper.patch

Addresses comments (via email?) regarding enqueue of entire arrays instead of only pointers.
Comment 21 Yi Yang CLA 2023-06-08 03:17:53 EDT
- 2_getPage_perform_read_outside_sync_block.patch
- 2b_fix_ensure_readdirect_multithreaded.patch

These two patches reduce synchronization overhead of IntIndexReader, I think they are principally correct, but I have not seen any significant performance improvement.

- 3_remove_unneeded_sync_in_ArrayIntCompressed.patch

This looks good to me

- 4_introduce_threadlocal_cache_for_indexwriter_getpage 
- 6_patch_feedback.patch

These two patches reduced the synchronization overhead of IntIndexCollector.set, but did not completely eliminate it. Considering this, do you think that if a completely lock-free ArrayIntUncompressed is provided, we can eliminate the synchronization overhead of IntIndexCollector.set/get and remove the need for these patches altogether?
Comment 22 Jason Koch CLA 2024-01-19 21:38:29 EST
The only change still hanging around in here is the GarbageCleaner patch/es. I'll retest these and see whether it's worth another pr.
Comment 23 Eclipse Webmaster CLA 2024-05-08 15:46:21 EDT
This issue has been migrated to https://github.com/eclipse-mat/org.eclipse.mat/issues/31.