Community
Participate
Working Groups
Problem: The current implementation of CFlowStack (Version 1.1-till 1.2 rc1) stores all threads in a hashtable. These Threads are removes after several calls to the methods getThreadStack(). (In our cases >70 calls.) In our project, this cycle is much to long. We have soveral long running threads, which grap quite a lot of lot of memory. This memory can only be freed after the threads have been removed from the CFlowStack. In our production code we sum up to more than 300 Threads stored in different CFlowStack Varaibles. Often this results in out of memory Errors. Added patch provides an improved freeing scheme. It checks if the size of the table has grown, if it has, it checks if it can remove some threads.
Created attachment 9938 [details] Patch which solves the problem
Created attachment 9939 [details] Sample A short sample, which demonstrates the problem. if run with the patch the number of threads never exceeded 5. (in most runs it didn´t if run without the thread, the number of threads exceeds easily 100.
I've verified the test program behaves as described - I see the threads getting up to ~140 before being cleaned up. With the patch it never goes above 4 for me. It gets to ~140 because when it reaches that level, the line: if (change_count > Math.max(MIN_COLLECT_AT, COLLECT_AT/size)) { returns true which causes a tidyup (since COLLECT_AT is 20000) It seems the static fields COLLECT_AT and MIN_COLLECT_AT are for configuring the frequency of tidying up (although they can't currently be 'set' by a user). The patch adds another variable oldsize - and the behavior then seems to be tidy up as soon as the stack grows by even 1 element - so if I then trace the tidying-up code it seems to run almost constantly against the test program. I think I could achieve the same result by setting both of the COLLECT_AT fields to 1. That doesn't seem to be quite what we want. It is a bit awkward because different users/applications are likely to have different requirements on the space/performance trade off.
I knew I'd realise something as soon as I appended to the bug ;) You can't quite replicate what the patch does by setting COLLECT_AT, MIN_COLLECT_AT to 1 - if they were set to 1 then we would frequently be checking whether all the threads we currently know about are alive. With the patch we only go through all the threads checking for liveness if a new one is added to the system or if the current test triggers it: (change_count > Math.max(MIN_COLLECT_AT, COLLECT_AT/size)) This means if the system reaches a steady state (no new threads are being started) then we behave how we behave today, but if the system is constantly seeing new threads start then it trims the hashtable more frequently. So I think what I'm saying is that the proposal in the patch is a good one. The reason it looks a bit extreme on first inspection is because the sample program is a little contrived and just fires off threads, never reaching a steady state. Anyone want to comment?
The current implementation of CFlowStack was built by a PARC summer intern as a 1 week project many, many years ago (~version 0.7). I don't believe the implementation has been either carefully evaluated or changed since then. Last week Matthew asked me about using ThreadLocals here for performance reasons, and I replied: I think that it would be good to switch the cflow per-thread implementation in rt.jar to use ThreadLocal when run on a VM that supports this. This might have some performance benefit, but I'm doubtful it would be very large right away. The main benefit I see is that the current version has some memory leak issues with holding on to thread objects after they've terminated. I've never heard a report of this being a real problem in the field, but it is a nagging issue. Well, now I've heard about this being a real problem in the field ;-) The only reason not to just use a ThreadLocal is that it is a 1.2 feature. Currently rt.jar only uses 1.1 features. I see several reasonable technical solutions, all of which have different tradeoffs. 1. Make Arno's proposed change. This is the simplest change to the code and clearly fixes Arno's problem. However, you'd need to spend some time understanding the performance issues where there are a lare number of "stable" threads and a small number being consistently created and destroyed. For example, if you have 300 threads in the system (and the hashtable), then every time the table grows all 300 threads will be checked for liveness. This could be expensive for some use cases. 2. Switch the design of CFlowStack to use a ThreadLocal instead of the hashtable. This is obviously the most efficient and precise way of mapping a perthread field. This would also be a fairly simple implementation. The major downside of this approach is that it would mean that AspectJ programs that use cflow would only run under 1.2 or later VMs. 3. Build a dual implementation of CFlowStack that uses ThreadLocal if its available and falls back to a simple hashtable when running under 1.1. This is a slightly more challenging implementation to make sure that the 1.2 code is kept well separate from the 1.1 code. However, this would give all of the benefits of #2 while still maintaining compatibility with 1.1 VMs. There are additional compromise solutions such as building two different rt.jars, one of which will run on 1.1; or making a CFlowStack11 class and requiring a compile-time flag to select its use. I'd recommend implementing solution #3 and spinning one more release candidate with this change and a couple of others. At this stage, solution #3 should ideally be implemented by a pair or should have a thorough code review process of people looking at the solution. This change clearly has the potential to cause stability problems. However, Arno's bug looks serious enough to me that it should be solved for 1.2 (and I've raised the severity to note that). This kind of bug can be very frustrating to developers because it only manifests itself as an OutOfMemoryError. All of the possible solutions have the potential to break existing AspectJ programs if implemented incorrectly; however, #3 is the only solution that I can say with confidence won't break anything so long as it's implemented correctly.
Created attachment 10255 [details] ThreadLocal fix for this bug Arno, please can you try this version of aspectjrt.jar. It is an implementation of Jims idea to use a ThreadLocal implementation for the stack when on a JVM that supports it. I've done some testing and it seems OK and it passes all the tests in the harness - but I'm not sure how much they validate the cflow behavior I have changed. In case you aren't sure if it is doing anything at all - you can invoke your program with system property "debugcflow=yes" and this will tell you through System.err whether it is trying to use the ThreadLocal implementation. Are you at all able to run your program on a 1.1 JVM to see if it works 'as before' ?
Fix checked in. If there is one piece of the implementation that needs a review, it is this one method in CFlowStack: private static void selectFactoryForVMVersion() { String override = System.getProperty ("aspectj.runtime.cflowstack.usethreadlocal","unspecified"); boolean useThreadLocalImplementation = false; if (override.equals("unspecified")) { String v = System.getProperty("java.class.version","0.0"); // Java 1.2 is version 46.0 and above useThreadLocalImplementation = (v.compareTo("46.0") >= 0); } else { useThreadLocalImplementation = override.equals("yes") || override.equals("true"); } // System.err.println("Trying to use thread local implementation? "+ // useThreadLocalImplementation); if (useThreadLocalImplementation) { tsFactory = getThreadLocalStackFactory(); } else { tsFactory = getThreadLocalStackFactoryFor11(); } } The area I am concerned about is that I've decided to introduce a system property override which *might* prove useful for debugging in the field: aspectj.runtime.cflowstack.usethreadlocal Effectively allowing you to turn off the Thread Local implementation if you don't want it. I've mainly done this because I've yet to have it confirmed by anyone other than my own test program that all this new code is working as expected. I don't plan on publicising this setting beyond this bug report...
Created attachment 10331 [details] Variant of the submitted test program. This is a modified test program that actually measures something (at least on my SUN142 JVM it does). It creates a 1000 threads which each grab a load of memory and then terminate. It checks whether the use of the thread local storage for managing the CFlowStacks makes a difference. Here are my (reproduceable) measurements: >java TestThreadRunner Using ThreadLocal mgmt = true Max memory used through the run = 1005544 Time to execute the program = 16078 >java -Daspectj.runtime.cflowstack.usethreadlocal=false TestThreadRunner Using ThreadLocal mgmt = false Max memory used through the run = 11926560 Time to execute the program = 36359 So it does appear to help: -Time to execute is reduced from 36seconds to run the program down to 16seconds. -Storage is reduced from 11Meg to 1Meg. However, I'd still like proof that the thread stacks we are now managing in a different way are completely correct.
Add Noel CC
Fixed - could do with verifying by Arno though...