[jetty-dev] Statistics - non synchronized standard deviation?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[jetty-dev] Statistics - non synchronized standard deviation?

From: Greg Wilkins <gregw@xxxxxxxxxxx>
Date: Mon, 15 Feb 2010 18:09:04 +1100
Delivered-to: jetty-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/jetty-dev>
List-help: <mailto:jetty-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/jetty-dev>, <mailto:jetty-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/jetty-dev>, <mailto:jetty-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Thunderbird 2.0.0.23 (X11/20090817)

Michael & Simone,

I've been looking at the work Michael has done on

 https://bugs.eclipse.org/bugs/show_bug.cgi?id=302018

and relating it to the work previously done by Simone on
evaluating the benefits of making stats collection non
blocking.

Simone found that even with a 2 CPU machine, there
was significant benefit in using AtomicLongs etc
rather than synchronizing to keep the stats correct.
Simone - can you remember how significant the
performance increase was?

The problem is that the latest patch to include std deviation
support re-introduces synchronization points.  Thus we are
going to lose the performance improvements that were
the motivation for the stats changes in the first place.


So I'd like to consider how we might achieve non synchronized
stats that can include std deviation.

Previously, the statistics accumulated a total and a count,
each as a separate atomic operation.     So that whenever the
mean value was obtained, there was a chance that you were
observing and increased total, but not an increased count.


However, for the values that we are considering, such snapshot
issues are not very numerically sensitive.    For example,
if after 1000 requests, each approx 50ms long, you looked
just as the 1001 total had been incremented, but before the count
had been, you would see an mean of 50050/1000 = 50.05ms
instead of 50050/1001 = 50.00ms.    Even if you had 8 CPU's
and saw 7 other total updates, without 7 count updates, then
the snapshot average could be 50350/1000=50.35%

So approx max error in the mean is (mean*CPUs/requests),
which get's vanishingly small with any significant number of
requests


But the problem is now how to calculate a rolling std deviation
without synchronization.      The basic formula's for this are:

  long total = _total.addAndGet(sample);
  long count = _count.incrementAndGet();

  double mean = ((double)total)/count;
  double delta = sample - mean;
  totalVariance += delta*delta;


Then to sample the std deviation, we do

  std_dev = sqrt(totalVariance/count)


So we can see that this algorithm is vulnerable
to errors in the mean (discussed above), plus
it's own similar numeric sensitivity from the
division by count.    But as we saw, that error
becomes vanishingly small as count gets larger.


The problem is with the rolling_sum += operation.

There is no AtomicDouble class and += is not an
atomic operation.   So the problem is not that
an individual snapshot might not see a few increments (which
will self correct over time), but that some increments
may be overwritten by others and lost for all time.

So errors will accumulate and get larger.  More importantly
those errors will be more likely when the server is busy,
which is exactly the time we want stats like std deviation.

Has anybody worked around such an issue before with
the desire to atomically update a double without
synchronization?


My immediate thought is to do it all with long maths,
so we have

  long total = _total.addAndGet(sample);
  long count = _count.incrementAndGet();
  long mean100 = total*100/count;
  long delta100 = sample*100 - mean100;

  totalVariance10000.addAndGet(delta100*delta100);

and

  std_dev = sqrt(totalVariance10000.get()/10000.0/count.get())

But the problem with that is that the totalVarience value
can hit Long.MAX_VALUE.   If for example, the delta's in
a connection time were 30s then a million connections
would blow the Long.MAX_VALUE!

A million is pretty high, but not far enough out of sight
to feel safe.  I temp work around could be to measure
connection time internally as 10ths of seconds rather
than ms, which would give 100x100 time more samples.

But for request stats, I was considering going to more
precision (nanosecs) rather than less.  Perhaps we should
just go to 10ths of ms instead?   I could also reduce
the accuracy of std deviation to 1 decimal point (and
thus have mean10, delta10 and variance100)?


So what are peoples thoughts?

is an approx std deviation better than no std deviation?
is an exact std deviation worth slowing down for?

is there a less numerically sensitive way to gather
a rolling std deviation?


cheers

Follow-Ups:
- [jetty-dev] Re: Statistics - non synchronized standard deviation?
  - From: Simone Bordet

Prev by Date: [jetty-dev] Webinar on Jetty, Cometd & ActiveMQ
Next by Date: [jetty-dev] Re: Statistics - non synchronized standard deviation?
Previous by thread: [jetty-dev] Webinar on Jetty, Cometd & ActiveMQ
Next by thread: [jetty-dev] Re: Statistics - non synchronized standard deviation?
Index(es):
- Date
- Thread

Breadcrumbs