Community
Participate
Working Groups
The leak suspects report takes at least 6 hours to run on a dump after which it is killed by our processing system. A thread dump at that time shows: at org/eclipse/mat/parser/internal/snapshot/MultiplePathsFromGCRootsComputerImpl.bfs(MultiplePathsFromGCRootsComputerImpl.java:227(Compiled Code)) at org/eclipse/mat/parser/internal/snapshot/MultiplePathsFromGCRootsComputerImpl.computePaths(MultiplePathsFromGCRootsComputerImpl.java:84) at org/eclipse/mat/parser/internal/snapshot/MultiplePathsFromGCRootsComputerImpl.getPathsByGCRoot(MultiplePathsFromGCRootsComputerImpl.java:109) at org/eclipse/mat/inspections/FindLeaksQuery.buildSuspectRecordGroupOfObjects(FindLeaksQuery.java:255) at org/eclipse/mat/inspections/FindLeaksQuery.buildResult(FindLeaksQuery.java:355) at org/eclipse/mat/inspections/FindLeaksQuery.execute(FindLeaksQuery.java:153) Based on the stack, it found class histogram records that exceed 70% of heap usage. This suspiciousClasses list is ultimately processed in FindLeaksQuery.buildSuspectRecordGroupOfObjects which calls getPathsByGCRoot on a random sampling of 10,000 objects from the suspect classes. Dump statistics: Size: 3.6 GB Classes: 14.2k Objects: 97.7m Class Loader: 293 It seems that the complexity of this object graph is too much for the algorithm. When I manually try to select just a single object of the suspect class and run merge shortest paths to GC roots, it is taking extremely long (still running). Unfortunately, the dump cannot be shared outside IBM. Multiple dumps from the same customer show the same behavior. Garbage collection is healthy and CPU and I/O seem to be fine. I saw this same behavior with a dump from this customer a few months ago so I doubt it is an environmental issue. I'm not sure how much room for improvement there is algorithmically if the object graph is extremely complex. Perhaps we should just have some reasonable and configurable timeout?
You could run the Find Leaks query 'leakhunter' with a smaller '-max_paths' argument - perhaps 1000 or 100 - to see if it does complete in a reasonable time.
(In reply to Andrew Johnson from comment #1) > You could run the Find Leaks query 'leakhunter' with a smaller '-max_paths' > argument - perhaps 1000 or 100 - to see if it does complete in a reasonable > time. I tried to run shortest paths to GC roots on a single suspect object and it took hours (I ended up giving up), so it seems to be the complexity of the object graph for that dump. Graphically, the user may just cancel the query; but, the use case here is automatic, headless leak suspect generation. I think a timeout would be the best option because the paths to GC roots is just a "nice-to-have" heuristic and most users would rather have any other parts of the suspect report instead of this running endlessly. However, I'm not sure how hard it would be to add a timeout option. Maybe that could be added to all queries?
I have a conjecture as to what is going on. The merge paths to GC is fairly simple. It takes the GC roots, puts them on a FIFO queue, and pulls off items and for each item gets the children, and for each child sees if it has been visited. If it hasn't then the parent of that child is set, marks the child as visited and the child is put on the queue. This is a big task, but similar to looking for unreachable objects and should be handled okay. One complication is that certain parent / child links are to be ignored, such as weak references. This is done for each parent / child by looking at the type of the parent, and if the parent is one of the special types then get all the fields of the parent and see if all the references from the parent to the child are only through excluded references. This shouldn't be too hard, and for a WeakReference there are only a few fields, so although for each child of a weak reference we get all the fields (as NamedReferences) each time that should not be an excessive amount of work. However, for dumps from J9 machines the parser sets up pseudo weak references from the java.lang.Runtime object to each finalizable object which has not been finalized, as a <Unfinalized> Thread to locals GC root type (from the java.lang.Runtime object). Expanding the java.lang.Runtime object shows pseudo field references to each unfinalized object, as the object can't be garbage collected until the runtime has completed finalization. If there are millions of finalizable objects then there could be millions of references from the java.lang.Runtime object to those objects. The breadth first search could get slow in this case, as there would be millions of outbounds from the java.lang.Runtime object, and as each outbound is inspected then all the millions of named references are generated and scanned through to see if they refer to the child.
I find that merge shortest paths can take a very long time if I have an exclude: -exclude java.lang.Object\[\]:[0] You wouldn't normally do that, but when the bfs gets to a large array it calls and searches getOutboundReferences for every outbound. I have a fix which I hope helps this problem.
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197166
Gerrit change https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197166 was merged to [master]. Commit: http://git.eclipse.org/c/mat/org.eclipse.mat.git/commit/?id=69078e03df26793ddef893ea7130e7fd8b94eb56
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197168
Gerrit change https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197168 was merged to [master]. Commit: http://git.eclipse.org/c/mat/org.eclipse.mat.git/commit/?id=a491349d3775ae56b1817a15b655a3c6d7c0ab64
(In reply to Andrew Johnson from comment #3) > If there are millions of finalizable objects I restored the original customer dump from archive and I'm not sure how to count finalizable objects although the Finalizer Overview query doesn't show anything happening related to finalizers. From my understanding of the stack product, it primarily uses PhantomReferences rather than finalizers. If you want to look at the dump, it's available for loading in MAT on Citrix in S:\sf\TS006\121\TS006121054\2022-04-26\26042022.zip_unpack\26042022\core.20220426.163223.3148.0002.zip_unpack\core.20220426.163223.3148.0002.dmp The archive only lasts for 7 days.
New Gerrit change created: https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197188
Gerrit change https://git.eclipse.org/r/c/mat/org.eclipse.mat/+/197188 was merged to [master]. Commit: http://git.eclipse.org/c/mat/org.eclipse.mat.git/commit/?id=77ad12eda0f924d1ef85e299584eed01e3cf0a3f
Verified with the latest commits on the aforementioned dump that the leak suspects report is now very fast. Thanks!