Bug 580821 - Leak suspects report takes extremely long in getPathsByGCRoot
Summary: Leak suspects report takes extremely long in getPathsByGCRoot
Status: RESOLVED FIXED
Alias: None
Product: MAT
Classification: Tools
Component: Core (show other bugs)
Version: 1.13   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: 1.14.0   Edit
Assignee: Andrew Johnson CLA
QA Contact:
URL:
Whiteboard:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2022-09-27 15:30 EDT by Kevin Grigorenko CLA
Modified: 2023-02-03 12:27 EST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Grigorenko CLA 2022-09-27 15:30:20 EDT
The leak suspects report takes at least 6 hours to run on a dump after which it is killed by our processing system. A thread dump at that time shows:

  at org/eclipse/mat/parser/internal/snapshot/MultiplePathsFromGCRootsComputerImpl.bfs(MultiplePathsFromGCRootsComputerImpl.java:227(Compiled Code))
  at org/eclipse/mat/parser/internal/snapshot/MultiplePathsFromGCRootsComputerImpl.computePaths(MultiplePathsFromGCRootsComputerImpl.java:84)
  at org/eclipse/mat/parser/internal/snapshot/MultiplePathsFromGCRootsComputerImpl.getPathsByGCRoot(MultiplePathsFromGCRootsComputerImpl.java:109)
  at org/eclipse/mat/inspections/FindLeaksQuery.buildSuspectRecordGroupOfObjects(FindLeaksQuery.java:255)
  at org/eclipse/mat/inspections/FindLeaksQuery.buildResult(FindLeaksQuery.java:355)
  at org/eclipse/mat/inspections/FindLeaksQuery.execute(FindLeaksQuery.java:153)

Based on the stack, it found class histogram records that exceed 70% of heap usage. This suspiciousClasses list is ultimately processed in FindLeaksQuery.buildSuspectRecordGroupOfObjects which calls getPathsByGCRoot on a random sampling of 10,000 objects from the suspect classes.

Dump statistics:

Size: 3.6 GB
Classes: 14.2k
Objects: 97.7m
Class Loader: 293

It seems that the complexity of this object graph is too much for the algorithm. When I manually try to select just a single object of the suspect class and run merge shortest paths to GC roots, it is taking extremely long (still running). Unfortunately, the dump cannot be shared outside IBM. Multiple dumps from the same customer show the same behavior. Garbage collection is healthy and CPU and I/O seem to be fine. I saw this same behavior with a dump from this customer a few months ago so I doubt it is an environmental issue.

I'm not sure how much room for improvement there is algorithmically if the object graph is extremely complex. Perhaps we should just have some reasonable and configurable timeout?
Comment 1 Andrew Johnson CLA 2022-11-04 10:58:59 EDT
You could run the Find Leaks query 'leakhunter' with a smaller '-max_paths' argument - perhaps 1000 or 100 - to see if it does complete in a reasonable time.
Comment 2 Kevin Grigorenko CLA 2022-11-09 13:42:20 EST
(In reply to Andrew Johnson from comment #1)
> You could run the Find Leaks query 'leakhunter' with a smaller '-max_paths'
> argument - perhaps 1000 or 100 - to see if it does complete in a reasonable
> time.

I tried to run shortest paths to GC roots on a single suspect object and it took hours (I ended up giving up), so it seems to be the complexity of the object graph for that dump. Graphically, the user may just cancel the query; but, the use case here is automatic, headless leak suspect generation. I think a timeout would be the best option because the paths to GC roots is just a "nice-to-have" heuristic and most users would rather have any other parts of the suspect report instead of this running endlessly. However, I'm not sure how hard it would be to add a timeout option. Maybe that could be added to all queries?
Comment 3 Andrew Johnson CLA 2022-11-20 13:18:03 EST
I have a conjecture as to what is going on.

The merge paths to GC is fairly simple.
It takes the GC roots, puts them on a FIFO queue, and pulls off items and for each item gets the children, and for each child sees if it has been visited. If it hasn't then the parent of that child is set, marks the child as visited and the child is put on the queue. This is a big task, but similar to looking for unreachable objects and should be handled okay. 

One complication is that certain parent / child links are to be ignored, such as weak references. This is done for each parent / child by looking at the type of the parent, and if the parent is one of the special types then get all the fields of the parent and see if all the references from the parent to the child are only through excluded references. This shouldn't be too hard, and for a WeakReference there are only a few fields, so although for each child of a weak reference we get all the fields (as NamedReferences) each time that should not be an excessive amount of work.

However, for dumps from J9 machines the parser sets up pseudo weak references from the java.lang.Runtime object to each finalizable object which has not been finalized, as a <Unfinalized> Thread to locals GC root type (from the java.lang.Runtime object). Expanding the java.lang.Runtime object shows pseudo field references to each unfinalized object, as the object can't be garbage collected until the runtime has completed finalization.

If there are millions of finalizable objects then there could be millions of references from the java.lang.Runtime object to those objects. The breadth first search could get slow in this case, as there would be millions of outbounds from the java.lang.Runtime object, and as each outbound is inspected then all the millions of named references are generated and scanned through to see if they refer to the child.
Comment 4 Andrew Johnson CLA 2022-11-21 06:23:59 EST
I find that merge shortest paths can take a very long time if I have an exclude:

-exclude java.lang.Object\[\]:[0]

You wouldn't normally do that, but when the bfs gets to a large array it calls and searches getOutboundReferences for every outbound.

I have a fix which I hope helps this problem.
Comment 5 Eclipse Genie CLA 2022-11-21 06:25:19 EST
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197166
Comment 7 Eclipse Genie CLA 2022-11-21 07:33:42 EST
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197168
Comment 9 Kevin Grigorenko CLA 2022-11-21 14:48:49 EST
(In reply to Andrew Johnson from comment #3)
> If there are millions of finalizable objects

I restored the original customer dump from archive and I'm not sure how to count finalizable objects although the Finalizer Overview query doesn't show anything happening related to finalizers. From my understanding of the stack product, it primarily uses PhantomReferences rather than finalizers.

If you want to look at the dump, it's available for loading in MAT on Citrix in S:\sf\TS006\121\TS006121054\2022-04-26\26042022.zip_unpack\26042022\core.20220426.163223.3148.0002.zip_unpack\core.20220426.163223.3148.0002.dmp

The archive only lasts for 7 days.
Comment 10 Eclipse Genie CLA 2022-11-21 15:34:41 EST
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197188
Comment 12 Kevin Grigorenko CLA 2022-11-22 11:31:55 EST
Verified with the latest commits on the aforementioned dump that the leak suspects report is now very fast. Thanks!