185580 – [results] Standard error should be used instead of Student's t-test to identify unreliable performance tests

Bug 185580 - [results] Standard error should be used instead of Student's t-test to identify unreliable performance tests

Summary: [results] Standard error should be used instead of Student's t-test to identi...

Status:	VERIFIED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Releng (show other bugs)
Version:	3.3
Hardware:	PC Windows XP

Importance:	P3 normal (vote)
Target Milestone:	3.3 RC4
Assignee:	Frederic Fusier
QA Contact:

URL:
Whiteboard:
Keywords:	performance, test

Depends on:
Blocks:	183338
	Show dependency tree

Reported:	2007-05-04 13:01 EDT by Frederic Fusier
Modified:	2008-09-19 04:23 EDT (History)
CC List:	7 users (show)

See Also:

Flags:	jerome_lanneluc: review+ daniel_megert: review+ philippe_mulet: review+ kim.moir: review+

Attachments
Proposed patch (37.48 KB, patch) 2007-05-04 13:05 EDT, Frederic Fusier	no flags	Details \| Diff
Preview of new table look (18.14 KB, application/octet-stream) 2007-05-04 13:28 EDT, Frederic Fusier	no flags	Details
Preview of new table look for all components (562.49 KB, application/octet-stream) 2007-05-09 08:05 EDT, Frederic Fusier	no flags	Details
New proposed patch (39.79 KB, patch) 2007-05-15 14:42 EDT, Frederic Fusier	no flags	Details \| Diff
Corresponding preview of new table (26.72 KB, application/octet-stream) 2007-05-15 14:46 EDT, Frederic Fusier	no flags	Details
Patch finalizing Scenario Status Table look (49.39 KB, patch) 2007-05-30 13:13 EDT, Frederic Fusier	no flags	Details \| Diff
Preview of table final look (74.16 KB, application/octet-stream) 2007-05-30 13:17 EDT, Frederic Fusier	no flags	Details
update manifest with ee (593 bytes, patch) 2007-05-30 14:38 EDT, Kim Moir	no flags	Details \| Diff
New patch finalizing Scenario Status Table look (47.54 KB, patch) 2007-05-31 12:51 EDT, Frederic Fusier	no flags	Details \| Diff
Final patch for new Scenario Status Table look (50.16 KB, patch) 2007-06-01 03:43 EDT, Frederic Fusier	no flags	Details \| Diff
Updated preview of table final look (545.09 KB, application/octet-stream) 2007-06-01 03:46 EDT, Frederic Fusier	no flags	Details
org.eclipse.jdt.text.php preview (24.08 KB, application/octet-stream) 2007-06-01 10:53 EDT, Frederic Fusier	no flags	Details
Patch for Scenario Status Table proposed at comment 30 (53.18 KB, patch) 2007-06-01 12:33 EDT, Frederic Fusier	no flags	Details \| Diff
Needed additional icon with previous patch (840 bytes, image/gif) 2007-06-01 12:38 EDT, Frederic Fusier	no flags	Details
Final released patch (55.32 KB, patch) 2007-06-05 09:53 EDT, Frederic Fusier	no flags	Details \| Diff
Additional patch (1.64 KB, patch) 2007-06-05 13:06 EDT, Frederic Fusier	no flags	Details \| Diff
Show Obsolete (11) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Frederic Fusier

2007-05-04 13:01:06 EDT

I20070502-0010 build performance results (http://fullmoon.ottawa.ibm.com/downloads/drops/I20070502-0010/performance/performance.php) show both t-test and standard error in component's Scenario Status Table (using different colors explained in the legend).

The decision was made to use only the standard error as criterion to decide whether a test is reliable or not. Performance tests results generation should be modified to take into account this...

Comment 1 Frederic Fusier

2007-05-04 13:05:40 EDT

Created attachment 65941 [details]
Proposed patch

This patch back to standard green, red, gray and yellow colors for images. Yellow tests are those which have a standard error greater than 3%. To make the information more readable, the deviation value and its standard error are put in small and gray characters inside the table...

Comment 2 Frederic Fusier

2007-05-04 13:28:27 EDT

Created attachment 65944 [details]
Preview of new table look

Unzip this attached file somewhere on your disk and open the file org.eclipse.osgi.php in a browser...

Comment 3 Kim Moir

2007-05-07 16:12:26 EDT

Hi Frederic

Sonia and I looked at this. We thought the tables looked a bit crowded with the additonal information. Sonia's comment was 

I wonder if the standard error values could go in the raw data tables instead?

Comment 4 Philipe Mulet

2007-05-07 17:04:16 EDT

Problem if moving it elsewhere, is that it requires a lot of clicking to find the same data again... how do you see what's going on for all machines at once ?

The idea is to ease the consumption, and considering the bottom section is kind of for our own eyes only anyway, we can present it how we find it useful.

In the past, no one usually worried about the bottom test results as long as they were green... but you don't know if they are marginally better or much better (since too painful to click through them all). We could remove the standard error for stable tests, but it is also interesting too see it to understand how to improve the bad ones.

Comment 5 Frederic Fusier

2007-05-09 08:05:05 EDT

Created attachment 66443 [details]
Preview of new table look for all components

To have a more general view, here's another preview of all components tables generated locally with this patch for build I20070501-0010 ...

Unzip this attached file somewhere on your disk and open the file global.php in a browser...

Dani, Kent, as perf team members, what's your mind about this?

Comment 6 Jerome Lanneluc

2007-05-10 04:48:10 EDT

(In reply to comment #5)
I like the new look. Using gray for the numbers doesn't make it crowed.

What is the meaning of the green check mark followed by the question mark '?' ?

Comment 7 Frederic Fusier

2007-05-10 06:56:45 EDT

(In reply to comment #6)
> (In reply to comment #5)
> I like the new look. Using gray for the numbers doesn't make it crowed.
> 
> What is the meaning of the green check mark followed by the question mark '?' ?
> 
These are tests which haven't results for the baseline specified through parameters of the generation application. Unfortunately, the "design" of this application does not allow to get other baseline results information easily in this case.

So, I have to make more changes to fix this issue but I'm definitely not convinced that it's really interesting to add another "hack" on the current "design" rather than rewrite it completely (especially to fix performance problems)...

Comment 8 Martin Aeschlimann

2007-05-11 09:02:41 EDT

I think the information in the table is valuable. But we should find a better way to present it, so that the table isn't so crowded.
Maybe leave away the '%'s and right align the derivations?   +3.4    +-1.0

Comment 9 Dani Megert

2007-05-14 09:38:13 EDT

Going back to green, red, yellow and gray is definitely what we want and the preview is very nice. We should make this happen for RC1.

Some additions/corrections:
- explain what the values in the table are
- explain what is compared to get the standard error (so people don't ask again
  next year ;-)
- on the legend replace "Grayed" ==> "Gray"
- center the text compared to the image (table and legend)
- the gray table text could be confused with the gray X. Maybe we can use a 
  different color for the table text (blue?)
- I would remove the '%' in the table and add the deviation into the fingerprint
  to get a consistent story

Comment 10 Kim Moir

2007-05-14 13:49:05 EDT

Frédéric, do you still want me to apply the original patch or will you be incorporating changes from the recent comments?

Comment 11 Frederic Fusier

2007-05-15 14:42:41 EDT

Created attachment 67299 [details]
New proposed patch

This patch includes following changes:
 - explain deviation and standard error computing in the table comment
 - put error into brackets instead of parenthesis
 - do not display the error if it's equals to 0 (test with only one iteration)
 - set space between image and values
 - use blue color instead of gray for values
 - remove the % sign for the error

Hope this proposal is a good compromise on all feedbacks I received. Thanks to all reviewers :-)

Comment 12 Frederic Fusier

2007-05-15 14:46:49 EDT

Created attachment 67300 [details]
Corresponding preview of new table

This preview is only on org.eclipse.ogsi component...

Comment 13 Kim Moir

2007-05-15 15:40:08 EDT

Released.

Comment 14 Kim Moir

2007-05-18 10:47:01 EDT

The performance tests from I20070516-1300, I20070517-0010 and I20070517-1700 are available and use this patch.  Are there additional changes you would like to make or should this bug be closed?

Comment 15 Frederic Fusier

2007-05-21 04:03:14 EDT

Philippe already asked me to increase the size of the font for the numbers in the table... Perhaps there will be other feedbacks during the Wednesday arch' call.

So, I think we can wait until then and finalize this bug by the end of the week to definitely close it for RC2...

Comment 16 Dani Megert

2007-05-21 04:12:11 EDT

At least on my browser it looks nice and I would vote -1 for bigger fonts.

Comment 17 Philipe Mulet

2007-05-21 06:21:31 EDT

Fair enough, I'll check my browser settings.

Only minor suggestions:
- instead of [+/-1], display [+/-1.0] to be consistent
- when only one iteration, use [n/a] instead of nothing

Need to improve the detail message when yellow test, all info is in sequence with no punctuation to separate.

Comment 18 Dani Megert

2007-05-21 07:13:37 EDT

- instead of [+/-1], display [+/-1.0] to be consistent
+1

- when only one iteration, use [n/a] instead of nothing
+1. If enough space we could write "only 1 iteration".

Comment 19 Frederic Fusier

2007-05-30 13:13:39 EDT

Created attachment 69335 [details]
Patch finalizing Scenario Status Table look

This patch addresses Philippe's remarks. There was not enough space to replace [n/a] with 'only 1 iteration', however hints explicitly explain the reason of it. Note that this n/a has been added using yellow color to warn user that test with only one measure is not really appropriate...

This fix also:
 - sets a minimum size for boxes columns to avoid split of numbers in the cells
 - emphasizes names of scenarios which are already displayed in the fingerprint
 - identifies scenario which do not have any results in last baseline run with
   a '*' before their name and the id of the real reference after.

Comment 20 Frederic Fusier

2007-05-30 13:17:56 EDT

Created attachment 69336 [details]
Preview of table final look

This preview is only on org.eclipse.ui component...

Comment 21 Frederic Fusier

2007-05-30 13:20:18 EDT

Reset previous review

Comment 22 Kim Moir

2007-05-30 14:37:41 EDT

Thanks for the patch. I really appreciate it.

I have a compile error in my workspace with at line 291
of ScenarioStatusTable.java

stream.append("</table>");

If I change the project to use a 1.5 vm, the error goes away

You wouldn't know this but our build runs using a 1.4.2 vm for various reasons. I've added an execution environment in the manifest to indicate this.  I'll attach a patch.

In any case, this area needs to be fixed so it works on on 1.4.2 vm.

Comment 23 Kim Moir

2007-05-30 14:38:25 EDT

Created attachment 69354 [details]
update manifest with ee

Comment 24 Frederic Fusier

2007-05-31 07:07:04 EDT

(In reply to comment #22)
> Thanks for the patch. I really appreciate it.
> 
> I have a compile error in my workspace with at line 291
> of ScenarioStatusTable.java
> 
> stream.append("</table>");
> 
> If I change the project to use a 1.5 vm, the error goes away
> 
> You wouldn't know this but our build runs using a 1.4.2 vm for various reasons.
> I've added an execution environment in the manifest to indicate this.  I'll
> attach a patch.
> 
> In any case, this area needs to be fixed so it works on on 1.4.2 vm.
> 
Sorry for that... I'll change this and also fix a minor issue I observed when reference is not the last run baseline.

I should post a new patch soon...

Comment 25 Frederic Fusier

2007-05-31 12:51:14 EDT

Created attachment 69557 [details]
New patch finalizing Scenario Status Table look

This patch fixes the compilation error + some minor issues I noticed in previous one. I hope it can be released for tomorrow integration build...

Comment 26 Frederic Fusier

2007-05-31 13:03:33 EDT

Comment on attachment 69557 [details]
New patch finalizing Scenario Status Table look

Sorry, I discovered a typo while reviewing my patch! I fixed, but, I'd rather prefer to wait tomorrow that my complete local tests pass before posting the definite final patch...

Comment 27 Frederic Fusier

2007-06-01 03:43:31 EDT

Created attachment 69664 [details]
Final patch for new Scenario Status Table look

This patch passes my local tests, all components html page look OK.
So, ready now for the review :-)

Comment 28 Frederic Fusier

2007-06-01 03:46:34 EDT

Created attachment 69665 [details]
Updated preview of table final look

Here's the results of my local tests, you can have an idea of the proposed final look. As usual, unzip the attached file somewhere on your disk and open the global.php file in your favorite internet browser...

Comment 29 Dani Megert

2007-06-01 04:06:39 EDT

Some remarks to the new look (will check the code once we the new look is agreed upon).

1.
>scenario name may be emphasized when the results are also displayed in the >fingerprint
Why do you write "may"? I assume it *will be* emphasized when it's in the fingerprint.

2. at first I was confused to see some bold entries in the table and then seeing bold in the legend (above). It took a while until I saw the explanation of bold under "Hints". I would move the two last items that explain what can bee seen in the table to the upper section.

3. why not use "bold" instead of "emphasized"? I guess you don't use <em> as
   if so, it would be italic and not bold.

4. in the preview of JDT Text I see two different n/a: one where it just
   displays n/a and one where it shows a yellow check mark and in red  [n/a].
   What's that supposed to mean?

Comment 30 Frederic Fusier

2007-06-01 10:53:55 EDT

Created attachment 69708 [details]
org.eclipse.jdt.text.php preview

New preview of org.eclipse.jdt.text component including remarks of comment 29.
Dani, what's your mind about this proposal?

Comment 31 Dani Megert

2007-06-01 11:52:05 EDT

Nice.

Comment 32 Frederic Fusier

2007-06-01 12:33:59 EDT

Created attachment 69726 [details]
Patch for Scenario Status Table proposed at comment 30

Comment 33 Frederic Fusier

2007-06-01 12:38:23 EDT

Created attachment 69728 [details]
Needed additional icon with previous patch

Kim,

This icon *must* be put in images folder of org.eclipse.test.performance.ui project, otherwise the generation would fail!

Note that, while applying the previous patch, you can also remove following unused icons in this foler:
- FAIL_err.gif
- FAIL_ttest.gif
- OK_err.gif
- OK_ttest.gif
- OK_greyed.gif

TIA

Comment 34 Frederic Fusier

2007-06-01 12:42:15 EDT

I'll completely generate the performance result pages during this week-end using previous patch + additional icon to be 100% that these changes will not break the current results pages generation.

If everything is OK on Monday, I'll start the process to include the proposed changes in 3.3 RC4...

Comment 35 Dani Megert

2007-06-01 15:29:07 EDT

I did not run the patch but finally found time to review the code which looks good to me. Thanks Frédéric!

Comment 36 Frederic Fusier

2007-06-04 09:47:41 EDT

The test launched on Friday failed but on a part of the code unchanged with this fix. I'm currently investigating what happened but I will not changed the patch to avoid restart review process...

I relaunched partial test which covers all the generation of pages affected by the patch and everything was fine :-)

I also added another committer to make this bug accepted for RC4...

Comment 37 Jerome Lanneluc

2007-06-04 10:37:02 EDT

+1 for 3.3RC4

Comment 38 Philipe Mulet

2007-06-05 06:57:08 EDT

+1 for 3.3rc4

Comment 39 Kim Moir

2007-06-05 09:11:38 EDT

Released for next build.

Comment 40 Kim Moir

2007-06-05 09:41:07 EDT

Okay, I have now released the patch that you sent me in an email, added the new icon and removed the five unused icons.  Please attach a patch for these changes to the bug, so they are all in one place :-)

Comment 41 Frederic Fusier

2007-06-05 09:53:07 EDT

Created attachment 70131 [details]
Final released patch

This patch contains comment 33 changes + 2 additional safe changes coming from review feedbacks:
1) fix a typo in scenario status table explanation: "bolded" -> in "bold"
2) add some try/catch + printStackTrace to have more information in case of generation failure

New and obsolete icons are unchanged...

Comment 42 Frederic Fusier

2007-06-05 13:06:47 EDT

Created attachment 70174 [details]
Additional patch

I discovered a problem in previous patch: failing icons were never displayed, instead there were always OK... :-(

This additional patch to be applied on top on previous one fixes this problem introduced in comment 27 patch. This is safe as it is only move of two lines to take into account the failure message and use the correct icon...

Sorry for that

Comment 43 Kim Moir

2007-06-05 13:51:45 EDT

thanks, patch released

Comment 44 Frederic Fusier

2007-06-07 01:36:17 EDT

I verified results for I20070606-0010 and everything is OK :-)
We can consider this bug as FIXED now.