59909 – CFlowStack removesThreads to late

Bug 59909 - CFlowStack removesThreads to late

Summary: CFlowStack removesThreads to late

Status:	RESOLVED FIXED

Alias:	None

Product:	AspectJ
Classification:	Tools
Component:	Compiler (show other bugs)
Version:	1.1.1
Hardware:	PC Windows XP

Importance:	P2 critical (vote)
Target Milestone:	1.2
Assignee:	Jim Hugunin
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2004-04-26 05:40 EDT by Arno Schmidmeier
Modified:	2012-04-03 15:52 EDT (History)
CC List:	2 users (show)

See Also:

Attachments
Patch which solves the problem (3.70 KB, patch) 2004-04-26 05:41 EDT, Arno Schmidmeier	aclement: iplog+	Details \| Diff
Sample (1.22 KB, application/octet-stream) 2004-04-26 05:48 EDT, Arno Schmidmeier	no flags	Details
ThreadLocal fix for this bug (34.43 KB, application/x-zip-compressed) 2004-05-04 10:14 EDT, Andrew Clement	no flags	Details
Variant of the submitted test program. (1.99 KB, text/plain) 2004-05-06 05:11 EDT, Andrew Clement	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Arno Schmidmeier

2004-04-26 05:40:05 EDT

Problem:
The current implementation of CFlowStack (Version 1.1-till 1.2 rc1) stores all 
threads in a hashtable. These Threads are removes after several calls to the 
methods getThreadStack(). (In our cases >70 calls.)

In our project, this cycle is much to long. We have soveral long running 
threads, which grap quite a lot of lot of memory. This memory can only be freed 
after the threads have been removed from the CFlowStack. 
In our production code we sum up to more than 300 Threads stored in different 
CFlowStack Varaibles. Often this results in out of memory Errors.

Added patch provides an improved freeing scheme.

It checks if the size of the table has grown, if it has, it checks if it can 
remove some threads.

Comment 1 Arno Schmidmeier

2004-04-26 05:41:01 EDT

Created attachment 9938 [details]
Patch which solves the problem

Comment 2 Arno Schmidmeier

2004-04-26 05:48:03 EDT

Created attachment 9939 [details]
Sample

A short sample, which demonstrates the problem.
if run with the patch the number of threads never exceeded 5. (in most runs it
didn´t 
if run without the thread, the number of threads exceeds easily 100.

Comment 3 Andrew Clement

2004-04-26 09:17:20 EDT

I've verified the test program behaves as described - I see the threads 
getting up to ~140 before being cleaned up.  With the patch it never goes 
above 4 for me.  It gets to ~140 because when it reaches that level, the line:

if (change_count > Math.max(MIN_COLLECT_AT, COLLECT_AT/size)) {

returns true which causes a tidyup (since COLLECT_AT is 20000)

It seems the static fields COLLECT_AT and MIN_COLLECT_AT are for configuring 
the frequency of tidying up (although they can't currently be 'set' by a 
user).  The patch adds another variable oldsize - and the behavior then seems 
to be tidy up as soon as the stack grows by even 1 element - so if I then 
trace the tidying-up code it seems to run almost constantly against the test 
program.  I think I could achieve the same result by setting both of the 
COLLECT_AT fields to 1.  That doesn't seem to be quite what we want.  It is a 
bit awkward because different users/applications are likely to have different 
requirements on the space/performance trade off.

Comment 4 Andrew Clement

2004-04-26 11:31:58 EDT

I knew I'd realise something as soon as I appended to the bug ;)

You can't quite replicate what the patch does by setting COLLECT_AT, 
MIN_COLLECT_AT to 1 - if they were set to 1 then we would frequently be 
checking whether all the threads we currently know about are alive.  With the 
patch we only go through all the threads checking for liveness if a new one is 
added to the system or if the current test triggers it:
(change_count > Math.max(MIN_COLLECT_AT, COLLECT_AT/size))
This means if the system reaches a steady state (no new threads are being 
started) then we behave how we behave today, but if the system is constantly 
seeing new threads start then it trims the hashtable more frequently.

So I think what I'm saying is that the proposal in the patch is a good one.  
The reason it looks a bit extreme on first inspection is because the sample 
program is a little contrived and just fires off threads, never reaching a 
steady state.

Anyone want to comment?

Comment 5 Jim Hugunin

2004-04-26 12:40:06 EDT

The current implementation of CFlowStack was built by a PARC summer intern as 
a 1 week project many, many years ago (~version 0.7).  I don't believe the 
implementation has been either carefully evaluated or changed since then.

Last week Matthew asked me about using ThreadLocals here for performance 
reasons, and I replied:

I think that it would be good to switch the cflow per-thread implementation in 
rt.jar to use ThreadLocal when run on a VM that supports this.  This might 
have some performance benefit, but I'm doubtful it would be very large right 
away.  The main benefit I see is that the current version has some memory leak 
issues with holding on to thread objects after they've terminated.  I've never 
heard a report of this being a real problem in the field, but it is a nagging 
issue.

Well, now I've heard about this being a real problem in the field ;-)

The only reason not to just use a ThreadLocal is that it is a 1.2 feature.  
Currently rt.jar only uses 1.1 features.

I see several reasonable technical solutions, all of which have different 
tradeoffs.

1. Make Arno's proposed change.  This is the simplest change to the code and 
clearly fixes Arno's problem.  However, you'd need to spend some time 
understanding the performance issues where there are a lare number of "stable" 
threads and a small number being consistently created and destroyed.  For 
example, if you have 300 threads in the system (and the hashtable), then every 
time the table grows all 300 threads will be checked for liveness.  This could 
be expensive for some use cases.

2. Switch the design of CFlowStack to use a ThreadLocal instead of the 
hashtable.  This is obviously the most efficient and precise way of mapping a 
perthread field.  This would also be a fairly simple implementation.  The 
major downside of this approach is that it would mean that AspectJ programs 
that use cflow would only run under 1.2 or later VMs.

3. Build a dual implementation of CFlowStack that uses ThreadLocal if its 
available and falls back to a simple hashtable when running under 1.1.  This 
is a slightly more challenging implementation to make sure that the 1.2 code 
is kept well separate from the 1.1 code.  However, this would give all of the 
benefits of #2 while still maintaining compatibility with 1.1 VMs.

There are additional compromise solutions such as building two different 
rt.jars, one of which will run on 1.1; or making a CFlowStack11 class and 
requiring a compile-time flag to select its use.

I'd recommend implementing solution #3 and spinning one more release candidate 
with this change and a couple of others.  At this stage, solution #3 should 
ideally be implemented by a pair or should have a thorough code review process 
of people looking at the solution.

This change clearly has the potential to cause stability problems.  However, 
Arno's bug looks serious enough to me that it should be solved for 1.2 (and 
I've raised the severity to note that).  This kind of bug can be very 
frustrating to developers because it only manifests itself as an 
OutOfMemoryError.  All of the possible solutions have the potential to break 
existing AspectJ programs if implemented incorrectly; however, #3 is the only 
solution that I can say with confidence won't break anything so long as it's 
implemented correctly.

Comment 6 Andrew Clement

2004-05-04 10:14:33 EDT

Created attachment 10255 [details]
ThreadLocal fix for this bug

Arno, please can you try this version of aspectjrt.jar.  It is an
implementation of Jims idea to use a ThreadLocal implementation for the stack
when on a JVM that supports it.  I've done some testing and it seems OK and it
passes all the tests in the harness - but I'm not sure how much they validate
the cflow behavior I have changed.

In case you aren't sure if it is doing anything at all - you can invoke your
program with system property "debugcflow=yes" and this will tell you through
System.err whether it is trying to use the ThreadLocal implementation.

Are you at all able to run your program on a 1.1 JVM to see if it works 'as
before' ?

Comment 7 Andrew Clement

2004-05-05 06:35:55 EDT

Fix checked in.  If there is one piece of the implementation that needs a 
review, it is this one method in CFlowStack:

private static void selectFactoryForVMVersion() {
  String override = System.getProperty
    ("aspectj.runtime.cflowstack.usethreadlocal","unspecified");  
  boolean useThreadLocalImplementation = false;
  if (override.equals("unspecified")) {
    String v = System.getProperty("java.class.version","0.0");
    // Java 1.2 is version 46.0 and above
    useThreadLocalImplementation = (v.compareTo("46.0") >= 0);
  } else {
    useThreadLocalImplementation = override.equals("yes") || 
                                   override.equals("true");
  }
  // System.err.println("Trying to use thread local implementation? "+
  //  useThreadLocalImplementation);
  if (useThreadLocalImplementation) {
    tsFactory = getThreadLocalStackFactory();
  } else {
    tsFactory = getThreadLocalStackFactoryFor11();
  }
}

The area I am concerned about is that I've decided to introduce a system 
property override which *might* prove useful for debugging in the field:

  aspectj.runtime.cflowstack.usethreadlocal

Effectively allowing you to turn off the Thread Local implementation if you 
don't want it.  I've mainly done this because I've yet to have it confirmed by 
anyone other than my own test program that all this new code is working as 
expected.

I don't plan on publicising this setting beyond this bug report...

Comment 8 Andrew Clement

2004-05-06 05:11:50 EDT

Created attachment 10331 [details]
Variant of the submitted test program.

This is a modified test program that actually measures something (at least on
my SUN142 JVM it does).  It creates a 1000 threads which each grab a load of
memory and then terminate.  It checks whether the use of the thread local
storage for managing the CFlowStacks makes a difference.  Here are my
(reproduceable) measurements:

>java TestThreadRunner
Using ThreadLocal mgmt = true
Max memory used through the run = 1005544
Time to execute the program = 16078

>java -Daspectj.runtime.cflowstack.usethreadlocal=false TestThreadRunner
Using ThreadLocal mgmt = false
Max memory used through the run = 11926560
Time to execute the program = 36359

So it does appear to help:
-Time to execute is reduced from 36seconds to run the program down to
16seconds.
-Storage is reduced from 11Meg to 1Meg.

However, I'd still like proof that the thread stacks we are now managing in a
different way are completely correct.

Comment 9 Noel Markham

2004-05-06 08:01:14 EDT

Add Noel CC

Comment 10 Adrian Colyer

2004-05-13 05:19:36 EDT

Fixed - could do with verifying by Arno though...