127264 – performance testing: add significance information to performance graphs

Bug 127264 - performance testing: add significance information to performance graphs

Summary: performance testing: add significance information to performance graphs

Status:	CLOSED WONTFIX

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Releng (show other bugs)
Version:	3.2
Hardware:	PC All

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Sonia Dimitrov
QA Contact:

URL:
Whiteboard:	stalebug
Keywords:

Depends on:
Blocks:

Reported:	2006-02-10 09:59 EST by Tom Hofmann
Modified:	2019-09-04 03:00 EDT (History)
CC List:	0 users

See Also:

Attachments
FP_org.eclipse.jdt.text_I20060208-0848_I20060208-0853.eclipseperflin2_R3.2.gif (7.21 KB, image/gif) 2006-02-10 10:05 EST, Tom Hofmann	no flags	Details
performance.diff (55.04 KB, patch) 2006-02-10 10:50 EST, Tom Hofmann	no flags	Details \| Diff
performance.diff (55.54 KB, patch) 2006-02-13 06:22 EST, Tom Hofmann	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tom Hofmann

2006-02-10 09:59:57 EST

Hi Sonia,

I will attach two patches that allow to augment the performance test reports with information about the statistical significance of the displayed differences. Could you review them? They worked fine for me but would certainly need testing before inclusion in the testing process.

On the fingerprint bars, we currently display the quotient of the means of two samples (one current and one reference sample) as a percentage. However, if one or both samples are not robust (i.e. have a high standard deviation), even a large difference of the two means may not be statistically significant. That is, the difference may have occured by chance and it cannot be concluded from the data that the real average performance is different.

The statistic being tested for significance is the "difference of the means" of the two measurements (current and reference measurement). Significance is tested using the Student's t-test on that statistic. The signifcance level is currently set to 10%, but can be changed easily (see StatisticsUtil.Percentile). A "significance level of 10%" means: the probability that we measured the observed difference on the same code base is less than 10%.

The implemented solution grays out the percentage bars of the fingerprint graphs and the check/failure marks for insignificant tests.

-- implementation notes

- in order to compute the standard error of the difference of the means of a certain test when generating the graphs, we need more information about the collected measurements than just the mean. We also need the standard deviation and sample size. This is implemented by storing them analogous to how the mean was stored up to now: as a DataPoint with a step id identified by the STDEV and SIZE constanst in InternalPerformanceMonitor. See DB.internalStore for the details. There is no database change needed, and the change only modestly increases the amount of stored data.

Since these values will not be available until the next run of the baseline tests, there won't be any visible change on the performance pages until then.

- the significance computation is done in FingerPrint.add (for the graphs) and ScenarioStatusTable.toString (for graying out check/failure marks). The implementation is delegated to StatisticsUtil.

Comment 1 Tom Hofmann

2006-02-10 10:05:24 EST

Created attachment 34492 [details]
FP_org.eclipse.jdt.text_I20060208-0848_I20060208-0853.eclipseperflin2_R3.2.gif

An example fingerprint bar showing two tests, of which one is significant but not the other.

Comment 2 Tom Hofmann

2006-02-10 10:50:07 EST

Created attachment 34500 [details]
performance.diff

Said patch - only one actually, but containing changes to both org.eclipse.test.performance and org.eclipse.test.performance.ui.

Comment 3 Sonia Dimitrov

2006-02-10 11:43:58 EST

Thanks Tom, I will give it a spin and send you some sample results.

Comment 4 Tom Hofmann

2006-02-13 06:22:57 EST

Created attachment 34562 [details]
performance.diff

almost the same - fixes an incorrect assumption in StatisticsSession.

Comment 5 Sonia Dimitrov

2006-02-17 14:50:38 EST

Patch applied, changes released to HEAD.

Comment 6 Sonia Dimitrov

2006-02-20 13:46:25 EST

Patch applied and used in 3.2M5 performance testing (though not released for 3.2M5).  I don't know how to verify that this works since the scenario in the attached gif does not appear to be part of results.  Please reopen if further changes required.

Comment 7 Tom Hofmann

2006-02-21 01:23:31 EST

Cool, thanks!

No change in the results will be visible until there is a reference build available with the additional data. Hm, I assumed that also the reference test runs would be run with the updated version of the performance plug-ins - is this correct? If yes, there should be a visible difference after the next reference test run has collected data.

Comment 8 Sonia Dimitrov

2006-02-21 08:57:46 EST

Will run the next reference test with the change (this Thursday).

Comment 9 Tom Hofmann

2006-03-01 04:54:09 EST

Ok, we start seeing some results with I20060228-1207:

- some tests are greyed due to their statistical insignificance
- the detail view shows a note about the test not being significant

Here are the things that we'll have polish:

- add the OK_greyed.gif icon for insignificant positive measurements (its missing on the details pages)

 > should be easy to correct on your side, right?

- choose a color for insignificance different from the grey used for "explainable" degradations. Yellow was suggested, but may be too much alerting.

 > what do you think?

- add hovers for insignificant fingerprint bars (probably also for explained degradations).

 > this would require adding more HTML output code to the generator - do you think this is feasible?

Also, I noticed that some measurements appear not-greyed out but still carry the note when looking at the details page. I have no idea why this is happening and will investigate.

Comment 10 Sonia Dimitrov

2006-03-12 11:14:54 EST

Reopening.

Comment 11 Sonia Dimitrov

2006-03-12 11:47:44 EST

- add hovers for insignificant fingerprint bars (probably also for explained
degradations).

 > this would require adding more HTML output code to the generator - do you
think this is feasible?

I had investigated this previously (see bug 89805#c20) - help wanted in this area.

Comment 12 Sonia Dimitrov

2006-03-12 16:21:08 EST

Changes for yellow cautionary markings released for next performance test run (i.e this Tuesday's I build).

Comment 13 Lars Vogel

2019-09-04 03:00:52 EDT

This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it and remove the stalebug whiteboard tag. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--