Bug 292831 - Compare files claims that identical lines are different.
Summary: Compare files claims that identical lines are different.
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Compare (show other bugs)
Version: 3.4.2   Edit
Hardware: PC Windows XP
: P3 normal (vote)
Target Milestone: 3.6 M7   Edit
Assignee: Pawel Pogorzelski CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 303533 (view as bug list)
Depends on:
Blocks: 304332 306522
  Show dependency tree
 
Reported: 2009-10-20 18:56 EDT by David Gross CLA
Modified: 2010-05-12 11:05 EDT (History)
8 users (show)

See Also:


Attachments
COBOL programs (222.19 KB, text/plain)
2009-10-20 19:00 EDT, David Gross CLA
no flags Details
Cobol program (185.85 KB, text/plain)
2009-10-20 19:01 EDT, David Gross CLA
no flags Details
Snap (28.26 KB, image/png)
2010-03-25 11:26 EDT, Pawel Pogorzelski CLA
no flags Details
Patch_v01 (9.01 KB, patch)
2010-03-27 05:52 EDT, Pawel Pogorzelski CLA
no flags Details | Diff
Warning image (324 bytes, application/octet-stream)
2010-03-27 05:53 EDT, Pawel Pogorzelski CLA
no flags Details
Patch_v02 (2.44 KB, patch)
2010-04-12 10:14 EDT, Pawel Pogorzelski CLA
no flags Details | Diff
Patch_v03 (1.30 KB, patch)
2010-04-12 10:49 EDT, Pawel Pogorzelski CLA
no flags Details | Diff
Snap (20.47 KB, image/png)
2010-05-12 10:59 EDT, Pawel Pogorzelski CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Gross CLA 2009-10-20 18:56:54 EDT
User-Agent:       Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Build Identifier: 3.4.2R342_20090122

The compare editor appears to loose its place and wrongly marks
lines that are identical as being different.

Reproducible: Always

Steps to Reproduce:
Scroll down in the BA154T.CBL file to line 4626. The right column    
of the BA668T.CBL will scroll with it.  Line 4626 pairs with line    
3972.  That is correct.  The problem is the shaded lines that follow.
RD/z by shading, claims that the next few lines are different too.   
Not so.  Lines 4627 - 4639 which pair with 3973 - 3987 are the same  
and should not be shaded.                                            
The preference setting of "Ignore white space" makes no difference.  
Two files are attached (I hope, because this panel gives me no opportunity to add an attachment).
Comment 1 David Gross CLA 2009-10-20 19:00:14 EDT
Created attachment 150037 [details]
COBOL programs

Compare this file BA154T.cbl to the other attachement BA668T.cbl
Comment 2 David Gross CLA 2009-10-20 19:01:53 EDT
Created attachment 150038 [details]
Cobol program
Comment 3 Tomasz Zarna CLA 2009-10-22 05:32:15 EDT
You're absolutely right. I compared both snippets, checked the lines you mentioned and they do look the same but compare claims they differ.
Comment 4 Pawel Pogorzelski CLA 2010-02-04 10:28:56 EST
It could be a bug in compare algorithm implementation since both Unix diff and CVS server return a correct result.

I'll provide more information after further investigation.
Comment 5 Pawel Pogorzelski CLA 2010-02-05 08:14:57 EST
The result accuracy is capped if the product of files' lengths (in lines) is bigger than 10 million. For the files attached the product is > 23M, that's why the results are not as expected.

The reason for limiting the accuracy is non linear increase in computation time for long files. After removing the limit comparing 5K lines files takes ~1 s (previously ~330 ms). For files 25K lines long the time increases to ~20 s (from ~2.5 s).

The accuracy capping I mention doesn't prevent many big files from being compared correctly. The problem happens when there are multiple changes all over the files.

Not sure if the current algorithm is as good as we could get but it's not very probable to have it changed in the current release. Other solutions are an option to rise/remove limit or doing it without a switch.

CC'ing Dani, maybe he could give some advice.
Comment 6 Dani Megert CLA 2010-02-08 07:10:04 EST
We need to be careful if we're going to change a known algorithm (Myers's LCS in this case): it might not only hit performance but also increase memory consumption to point where it fails with OOMEs. 

What we should definitely do is surface the cutting to the user either directly in the compare editor or via dialog. In the latter case we could allow the user remove the capping.
Comment 7 Szymon Brandys CLA 2010-02-08 07:54:39 EST
I share Dani's concerns about removing the capping. 

We should surface it to users, however most of them would set the maximum value for the capping since they mostly care about the accuracy. Thus, I would place it somewhere in the properties dialog with a warning about consequences of changing the value.
Comment 8 Dani Megert CLA 2010-02-10 03:20:37 EST
>We should surface it to users, however most of them would set the maximum value
>for the capping
It can take minutes in some cases. How about another approach: we move the comparison in another thread and don't cap. If more than e.g. 3 seconds pass we automatically kill the thread and compare using the capped approach. You could surface the timeout as a preference.

Note that we do something similar already in the SDK, e.g. when fetching parameter names from attached Javadoc.
Comment 9 Tomasz Zarna CLA 2010-02-10 17:24:44 EST
I like the idea with the timeout. IIUC under the most pessimistic scenario the user will need to wait 3 secs (the timeout) + the time need to run the comparison again (capped), is that correct?

Another solution could be to do it the other way round: run capped comparison first (as it takes place now) and when done run it again without the cap (in a background thread). Then check if the second comparison brings more detailed result, if so prompt to reload.
Comment 10 Dani Megert CLA 2010-02-11 04:07:47 EST
>...is that correct?
Yes.

Your suggestion at least duplicates the load on the machine for every comparison and it has the risk that the non-capped thread runs for minutes ;-)
Comment 11 Pawel Pogorzelski CLA 2010-02-11 05:13:23 EST
(In reply to comment #8)
> >We should surface it to users, however most of them would set the maximum value
> >for the capping
> It can take minutes in some cases. How about another approach: we move the
> comparison in another thread and don't cap. If more than e.g. 3 seconds pass we
> automatically kill the thread and compare using the capped approach. You could
> surface the timeout as a preference.

We can approximately tell how big file we can compare in 3 seconds. It depends on machine and line lengths of course but the rough approximation shouldn't deviate in a major way. That's why I don't see the point of launching the comparison in a separate thread if in most cases we know a priori if it will be killed.

Couldn't we simply add an option like "Optimize compare for speed (reduces accuracy for large files)"? This would be set to true by default.

I'm in favor of the option since the capped results are pretty much accurate too and this is a first bug that mentions the problem (to my knowledge) since algorithm introduction in 2006.
Comment 12 Dani Megert CLA 2010-02-11 05:32:34 EST
>Couldn't we simply add an option like "Optimize compare for speed (reduces
>accuracy for large files)"? This would be set to true by default.
It could take minutes when disabled and in that case the user has no decent way out of it i.e. he has to remember that he disabled that setting and then he has to go to the preferences, change it and compare again.

> and this is a first bug that mentions the problem (to my knowledge) since
>algorithm introduction in 2006.
Well, that bug is hard to detect since it happens with big files and then you have to carefully look at the diff to see that it's not 100% correct.
Comment 13 Tomasz Zarna CLA 2010-02-16 05:42:05 EST
Another approach could be to run the capped comparison as we have been doing so far, but for large files schedule non-capped comparison in the background thread. When done and the result is different than the one from capped comparison we could display a dialog saying that a more accurate result is available. I'm aware that the user could be in the middle of copying changes from one side to the other so the dialog is a necessity.
Comment 14 Szymon Brandys CLA 2010-02-16 06:02:13 EST
Another option being discussed already is to run the non-capped comparison always, with an option to cancel and fall back to the capped comparison. This could be parameterized (in preferences), so user could decide to have this fallback or not.

This could work this way:
1. A user triggers a compare, he gets a progress dialog
2. The comparisons lasts for minutes, so he wants to cancel it (now he actually can't do it)
3. He cancels and depending on the fallback preference, the comparison ends or falls back to the capped version
4. If the comparison falls back to the capped version, we should perhaps see a message in the dialog about accuracy

For short files, running the non-capped comparison should not be an issue. 
What do you think?
Comment 15 Susan McCourt CLA 2010-02-16 13:16:01 EST
I like the ideas behind comment #9 and comment #13.  I think we should try to make it work the way it does now (speed and algorithm) for cases where the accuracy is good enough.  If the problem only happens for files >10 million lines of code, then I think only users with those large files should pay...

The problem is how to let the user know that the results may not be good, and what should they be able to do while the results are preliminary?  If we are simply running a background non-capped comparison without saying anything, then they could be making changes based on bogus data.  And it sounds as if the cases where it matters could take minutes to correct, right?

What about this idea?

- Run the capped comparison as we always have.
- When we are ready to show results, we know at that time if the file has exceeded the accuracy cap.  Show a dialog.
------------
The files being compared exceed 10 million lines and require special analysis which may take several minutes.  Preliminary results may not be accurate.

(*) Show preliminary results and notify me when final results are available
( ) Wait for final results

     [ ] Remember my decision
                                                         [OK]
------------------
I tend to prefer radios over a yes/no question for a complicated question.

If the first option is chosen, then the results view opens with the capped results, the non-capped comparison is run in the background.  Once it's done, the user gets a dialog telling them either that the preliminary results have been verified, or that the final results are available and will now be shown in the view.

If the second option is chosen, then a jobs progress dialog is shown, the user can dismiss to background and work on other stuff, or wait.  Results are shown when job is done.

I could imagine choosing the first option if I'm just doing compares to find a certain revision (when did this part of the code change?), or when I plan to wait anyway, but want to see how accurate the early results were.  Is the failure case always a "false positive differences" as reported here?  That doesn't seem too bad.  Over time I'd probably come to trust the preliminary results.

One could argue that the "wait for results" is a stupid choice to offer, but it seems even dumber not to offer a choice at all and simply say, "We aren't sure if these results are correct, but wanted to show them to you while we do more work."  And not telling the user anything just opens them up for spending time on data that is bad, and then surprising them later with a dialog that tells them we found different results.

I think it's important to state the specific size cap where this dialog kicks in, so users will understand why it appears "all of the sudden" for files they've been comparing in previous releases without incident.
Comment 16 Dani Megert CLA 2010-02-17 09:08:49 EST
>- When we are ready to show results, we know at that time if the file has
>exceeded the accuracy cap.  Show a dialog.
AFAIK, even though we capped, the result is still correct in most such cases hence we'd show a dialog too often.

I prefer Szymon's solution (comment 14) which is actually what I discussed with Tomasz via chat a few days ago. I always hate to introduce new dialogs if that can be avoided.
Comment 17 Dani Megert CLA 2010-02-17 09:09:56 EST
-1 for comment 13 ;-)
Comment 18 Susan McCourt CLA 2010-02-17 10:30:27 EST
(In reply to comment #16)
> >- When we are ready to show results, we know at that time if the file has
> >exceeded the accuracy cap.  Show a dialog.
> AFAIK, even though we capped, the result is still correct in most such cases
> hence we'd show a dialog too often.
> 
> I prefer Szymon's solution (comment 14) which is actually what I discussed with
> Tomasz via chat a few days ago. I always hate to introduce new dialogs if that
> can be avoided.

As I understand, it though, you are penalizing users for whom the capped comparison is known to be sufficient.  (ie, a 25K line of code is 20 seconds now instead of 2.5).  I completely agree with the principle of avoiding new dialogs, but what you are suggesting introduces more waiting and a progress dialog for files somewhere between 5K and 25K lines of code, and the other approach only introduces a dialog when the line is >10M lines of code.

It just seems wrong to introduce that penalty when there is such a huge difference in where the penalty is introduced and where any bug is actually fixed.
Comment 19 Pawel Pogorzelski CLA 2010-02-17 10:39:27 EST
(In reply to comment #18)

In comment 5 I mention a product of two files' lengths. If the files are around 4K lines each then the 10M limit is exceeded and we cap.
Comment 20 Susan McCourt CLA 2010-02-17 10:47:59 EST
(In reply to comment #19)
> (In reply to comment #18)
> 
> In comment 5 I mention a product of two files' lengths. If the files are around
> 4K lines each then the 10M limit is exceeded and we cap.

my bad, I didn't read that as a mathematical product.  I thought you meant the problem was a "product of the file's length".

Now I understand much better the proposed solution and agree that we would never want to introduce the dialog so often.
Comment 21 Dani Megert CLA 2010-02-17 10:50:56 EST
>  (ie, a 25K line of code is 20 seconds now instead of 2.5)
>It just seems wrong to introduce that penalty when there is such a huge
>difference in where the penalty is introduced and where any bug is actually
>fixed.
Agree. But if that's the case the dialog approach doesn't help much either because it causes more load one every compare (separate thread for each cap case).

How about the approach from comment 8?
Comment 22 Pawel Pogorzelski CLA 2010-02-17 11:01:21 EST
(In reply to comment #21)
> How about the approach from comment 8?

Simply bumping the limit would give a very similar result. An we would't have to launch the task for the second time discarding the uncapped computation done.
Comment 23 Dani Megert CLA 2010-02-17 11:04:35 EST
>Simply bumping the limit would give a very similar result.
Similar to what?
Comment 24 Pawel Pogorzelski CLA 2010-02-17 11:11:47 EST
(In reply to comment #23)

Similar to the approach from comment 8. The only advantage I see is the timeout is explainable to the end user better than a "product of files lengths" parameter. But this is a valid argument only if we want to surface the preference in the UI.

As I wrote in comment 11, we can approximately tell how big file we can compare in 3 seconds. So why to bother to start a thread if we have a very good chance to say in advance if it will be killed.
Comment 25 Dani Megert CLA 2010-02-17 11:20:25 EST
>We can approximately tell how big file we can compare in 3 seconds.
Is that a gut feeling or do you have a mathematical formula? If it's not an exact computation then how would you avoid that someone runs into this same bug again?
Comment 26 Pawel Pogorzelski CLA 2010-02-17 11:27:17 EST
(In reply to comment #25)
> Is that a gut feeling or do you have a mathematical formula?

It's the same that introduced the capping in the first place. It depends on things such a similarity of documents, similarity of lines, their lengths and the machine. So it cannot be known exactly a priori.
Comment 27 Dani Megert CLA 2010-02-17 11:28:56 EST
> So it cannot be known exactly a priori.
OK, so it can't be used fix this bug for all cases, right? It just makes the hole smaller.
Comment 28 Pawel Pogorzelski CLA 2010-02-18 05:20:39 EST
I got following concerns with the thread approach:
 - It adds an extra time for each comparison exceeding the threshold;
 - It's hardly explainable to the end user ("kill the comparison process if it  takes longer than x seconds and start an optimized compare" option?);
 - It gives the user no clue if the results are capped (could be added to the  extra information in the status line but it's not visible by default, moreover it would look odd there);
 - It complicates the flow since the decision should be made per compare operation (I want the check if the files differ vs. I want to examine carefully the differences) and in order to change this the user would have to go into preferences.

Susan, you're right, only false positives are possible. If capping kicks in two subsequent hunks could be merged and the lines between them treated as a change.

I'm not a fan of replacing the document content after the initial optimized comparison is made. It adds unnecessary complexity.

What I think is the most appropriate is a dialog show before first capped comparison with following options:
--------------------------------------------------------
For long files time-optimized comparing algorithm may be used. It could decrese changes granularity and result with false positives for lines between the actual changes.
 (*) Optimize
 ( ) Show precise results
    [] Remember the decision
--------------------------------------------------------
This approach has none of the disadvantages mentioned at the beginning.
Comment 29 Dani Megert CLA 2010-02-18 09:00:10 EST
I see, you are afraid of thread programming ;-)

>I'm not a fan of replacing the document content after the initial optimized
>comparison is made. It adds unnecessary complexity.
I agree on that.


Let's step back for a moment:
- fact is that unless we run the non-capped version we don't know whether our 
  result is accurate or not.
- when someone compares something the main goal is to find the differences hence
  I'd bet that 99.9% prefer the accurate result over time optimization
- there is one small scenario where it really takes too long and one would rather
  get rough comparison than waiting for minutes

Therefore the timeout solution or the one that allows to fallback on cancel is the right choice here.
Comment 30 Pawel Pogorzelski CLA 2010-02-18 11:32:31 EST
(In reply to comment #29)

I'm still not sure about overriding Cancel button's behavior. See bug 287887, I can imagine a similar bug if we implement such feature.

Susan, what do you think about it?
Comment 31 Dani Megert CLA 2010-02-18 11:37:36 EST
>I'm still not sure about overriding Cancel button's behavior. See bug 287887, I
>can imagine a similar bug if we implement such feature.
Pawel, we had a long discussion about pros and cons and a possible suggested solutions. Why just mention that part (which obviously you don't like ;-) and not the issues we discussed with the proposed solution?

Note that in contrast to the wizard it's not easy to sneak in another button here.

I open for better solutions. What I don't want is:
- always get dialogs
- have no way except going to the preference to get a result in case the user
  chooses to get the uncapped version and it takes minutes
Comment 32 Pawel Pogorzelski CLA 2010-02-18 11:51:46 EST
(In reply to comment #31)

I know the pros redefining the button give. I'm fully in favor of them. Just redefining the Cancel doesn't seem the best idea to achieve it. That's what I meant.

If the task could be split into two there could be a more advanced progress dialog shown that enables cancelling one of the tasks (which would result in an optimized compare). That's more compliant with UI rules.
Comment 33 Dani Megert CLA 2010-02-18 11:54:22 EST
What if I've chosen to run the jobs in background?
Comment 34 Susan McCourt CLA 2010-02-18 20:39:19 EST
This is starting to get weird.  
I don't like the timeout because it basically says we would encourage use of the capped/optimized algorithm for the larger files that can't complete in the timeout...but these larger ones are the very ones more likely to have the bug.  And we would always add computation time to the smaller files (the ones that can complete in under 3 seconds) which really don't benefit from the non-capped comparison in the first place.  Yikes! 

What about this idea?  Not ideal but it doesn't penalize anyone except those who have the bug.

- compare as today (capped)
- the code already processes and shows each difference in the view (that's the hard part).  Add code that recognizes that we didn't find any differences.  In this case (only) we put up a dialog:

"There are no differences found in the selection.  This can happen when comparing very large files.  A more accurate (but slower) comparison can be used, which would not show this section of the file as being different.  Would you like to use the more accurate comparison?"

I'm betting that most users would *not* want to compare again, they just want affirmation that they aren't crazy.

If they *do* choose to use the slower comparison, we could offer to remember the choice, but if we are remembering the choice it might be better to remember it per file rather than switch it globally.
Comment 35 Dani Megert CLA 2010-02-19 06:03:43 EST
When I discussed this yesterday with Pawel, we already came to the conclusion that timeout is not the best approach. It's also kind of fuzzy.

Let's state what we know and what the solution must offer:
a) I searched through bugzilla and so far this is the first complaint we got 
   about this, so we should not shoot with cannons on birds ;-)

b) In case where the capping happens the result can either be correct or it can
   contain too much changes - it does not miss any changes. So there's just too
   much noise.

c) Currently cancel does not work. Before we even think about
   using the uncapped algorithm for large files/diffs, this has to be fixed.

d) I agree with Susan that an important part of the fix must be to inform the
   user about what's going on.

e) In case it is decided to offer a solution where the user can choose in a 
   preference or via "remember my choice" to always use the uncapped algorithm
   we have to offer an easy way to fallback to the fast algorithm. Having to go
   to the preference, change it, and compare again is not considered easy.


Given those constraints, how about the following pragmatic solution:
1. We behave the same as now when compare is started.
2. If we cap, we show this inside the compare editor (no dialog!).
3. Once we have fixed item c) we can enrich the message with either a link
   or button that allows to run the slower algorithm. To simplify things
   we could disable that option as soon as the user made changes and make the
   compare editor read-only as soon as the button gets pressed.
Comment 36 Pawel Pogorzelski CLA 2010-02-19 06:19:42 EST
(In reply to comment #35)
> Given those constraints, how about the following pragmatic solution:

Sounds good. It's less invasive than the dialog and makes users aware of the optimization.
Comment 37 Susan McCourt CLA 2010-02-19 16:14:27 EST
(In reply to comment #35)

> Given those constraints, how about the following pragmatic solution:
> 1. We behave the same as now when compare is started.
> 2. If we cap, we show this inside the compare editor (no dialog!).
> 3. Once we have fixed item c) we can enrich the message with either a link
>    or button that allows to run the slower algorithm. To simplify things
>    we could disable that option as soon as the user made changes and make the
>    compare editor read-only as soon as the button gets pressed.

I was suggesting being even more selective in #2.
Because my understanding is we cap much more often than we have incorrect results.

We only show the message when the user moves to a "difference" and we determine that there is actually no difference to show.  At that time we show the message in the editor.  I think this is better because we don't really want to pollute the user's headspace with this problem until we've discovered they actually have it.  The only case where I can imagine this is not a better solution is if the user were wanting only to know "are there any differences at all?" and we incorrectly showed differences and the user never bothered to examine them.

So if the user is comparing the files, gets the noise, and thinks, "I didn't think there were this many changes," as soon as they hit a difference that is not real, then we tell them about the issue, but not before.
Comment 38 Dani Megert CLA 2010-02-22 03:17:00 EST
>We only show the message when the user moves to a "difference" and we determine
>that there is actually no difference to show.
Oh, I see now Susan, good idea! In that case we could probably even get away with running the whole algorithm again and don't even require c) from comment 35 to get fixed. Instead we simply recompute the diff on each 'Next/Previous Diff' and adjust the UI if necessary. Tomek, what do you think about that?
Comment 39 Dani Megert CLA 2010-02-22 03:30:21 EST
>Tomek, what do you think about that?
Sorry, I meant Pawel here.
Comment 40 Pawel Pogorzelski CLA 2010-02-22 04:11:11 EST
The user could edit the file before we find out some of the diffs are not computed precisely. Moreover I'm not sure if we could limit computation to the neighborhood of the change.

I prefer solution from comment 35. It's simpler, results in a predictable flow (maybe less convenient than Susan's proposal) and leaves less questions open.
Comment 41 Dani Megert CLA 2010-02-22 04:26:58 EST
>The user could edit the file before
The compare editor already updates the regions when typing, but I see your point: if we follow Susan's approach then we have to compare those diffs which are currently shown in the viewport - doing it on next/previous diff is not enough.
Comment 42 Susan McCourt CLA 2010-02-22 13:08:32 EST
(In reply to comment #41)
> >The user could edit the file before
> The compare editor already updates the regions when typing, but I see your
> point: if we follow Susan's approach then we have to compare those diffs which
> are currently shown in the viewport - doing it on next/previous diff is not
> enough.

If the complexity introduced in the code is manageable, I still think it's worth doing.  Not doing so is "simpler" for the implementation, but introduces a bunch of implementation-related noise to the user that in most cases is unnecessary anyway.

But I don't live in the code, just trying to provide the outsider point of view, so I trust you guys will do what seems best...
Comment 43 Dani Megert CLA 2010-02-23 03:54:11 EST
Computing the diffs for the viewport is non-trivial, especially when the text already changed (e.g. in a Java editor). I suggest we first go ahead with the simple solution. If we see that the message appears too intrusive we can discuss more complex approaches or maybe tweak the cap a little bit.
Comment 44 Pawel Pogorzelski CLA 2010-02-23 04:09:35 EST
*** Bug 303533 has been marked as a duplicate of this bug. ***
Comment 45 Szymon Brandys CLA 2010-02-23 06:37:27 EST
We should provide a small and safe fix first, since we want to backport it to 3.5.2+ and maybe 3.4.2+. Then we can discuss fancier solutions and improvements in this area.

Thus I agree with Dani and vote for adding the label (message) first.
Comment 46 David Gross CLA 2010-03-03 12:37:03 EST
Quote from the customer in PMR 80179,999,724:
1.)  In case the file size has the critical size RDz should bring up a warning saying that the result may be inaccurate due to the files size. Together with this the user should be asked if he wants to cancel or proceed with the compare.  
I think this is implementable quickly, isn't it ?

2.) As a final solution later on it needs to be possible to compare such large files with accurate results. As I understood from the bugzilla record the most critical point is the time and memory needed for this. Therefore the compare of large files could be run in an long running separate (asynchronous) thread so that  RDz is not blocked and the long running thread can probably be canceled.  (either by the user or in case of memory problem etc.)
Comment 47 Dani Megert CLA 2010-03-03 12:48:19 EST
> Together with
>this the user should be asked if he wants to cancel or proceed with the
>compare.  
To offer Cancel makes no sense since the result might be accurate even with the capping and given the user indeed wanted to compare the files. The result will at least give a coarse grained information about the changes - better than having no comparison at all (or why was compare started in the first place if zero information is desired ;-). The workflow would be to simply show the warning in the compare dialog.

>2.) As a final solution later on it needs to be possible to compare such large
>files with accurate results
Agreed, that this should be the final solution, but most likely not something we'd want to backport to an older release.
Comment 48 Szymon Brandys CLA 2010-03-04 04:10:45 EST
As stated in comment 45, we may provide a label or a message in Compare Dialog for 3.4.2+. 
Cancel and other improvements may be applied to the current stream only and as Dani said, this work will not be backported.
Comment 49 Pawel Pogorzelski CLA 2010-03-25 11:26:39 EDT
Created attachment 163000 [details]
Snap

Please comment on the wording used. I'll attach a patch witch code shortly.
Comment 50 David Gross CLA 2010-03-25 13:43:27 EDT
(In reply to comment #49)
> Created an attachment (id=163000) [details]
> Snap
> Please comment on the wording used. I'll attach a patch witch code shortly.

The wording is fine.  I don't know if the customer will like the effect, 
but the wording is fine.
Comment 51 Pawel Pogorzelski CLA 2010-03-27 05:52:28 EDT
Created attachment 163140 [details]
Patch_v01
Comment 52 Pawel Pogorzelski CLA 2010-03-27 05:53:33 EDT
Created attachment 163141 [details]
Warning image
Comment 53 Pawel Pogorzelski CLA 2010-03-27 06:12:04 EDT
The code is in HEAD. There will be no further changes in this area in 3.6.

Szymon, I've open bug 307280 to allow the user to disable the capping (change the message to a link/button). Should we keep this one open?
Comment 54 Pawel Pogorzelski CLA 2010-03-29 07:06:38 EDT
After discussion with Szymon I decided to close the bug. Changes introduced on this bug will be backported to 3.4.2+ (bug 304332) and 3.5.2+ (bug 306522) streams. Adding a way to disable capping is tracked by bug 307280.
Comment 55 Markus Keller CLA 2010-04-12 07:06:30 EDT
Bug in DocumentMerger#isCapped(..): The test fails if the multiplication causes an integer overflow, see this example:

public class Try {
    public static final double TOO_LONG = 10000000.0;
    
    public static void main(String[] args) {
        int a= 50000;
        int b= 50000;
        System.out.println(a * b);
        System.out.println(a * b > TOO_LONG);
        System.out.println((double) a * (double) b);
        System.out.println((double) a * (double) b > TOO_LONG);
    }
}

Please always get a review from another committer before you release a patch to maintenance branches.
Comment 56 Markus Keller CLA 2010-04-12 07:13:00 EDT
(In reply to comment #50)
> The wording is fine.

The wording is not fine. CompareMessages.properties currently contains:

CompareContentViewerSwitchingPane_optimized=Optimized algorithm used
CompareContentViewerSwitchingPane_optimizedTooltip=To avoid long computation time an optimized comparison algorithm has been used. As a result a non-optimal matching could be found.

- "[..] optimized comparison algorithm [..] non-optimal matching" contains a logical error.
- There should be a comma after the introductory phrase "To avoid long computation time" and after "As a result"
- The tense of the last sentence is wrong.

Proposed wordings:

CompareContentViewerSwitchingPane_optimized=Faster algorithm used

For CompareContentViewerSwitchingPane_optimizedTooltip, e.g. (I prefer A):

A) A faster comparison algorithm has been used to avoid long computation time. As a result, a non-optimal matching could have been found.

B) To avoid long computation time, a faster comparison algorithm has been used. As a result, a non-optimal matching could have been found.

C) To avoid long computation time, a faster comparison algorithm has been used. This could have resulted in a non-optimal matching.
Comment 57 Pawel Pogorzelski CLA 2010-04-12 08:20:44 EDT
(In reply to comment #55)

Good catch Markus, thank you.
Comment 58 Pawel Pogorzelski CLA 2010-04-12 08:33:28 EDT
(In reply to comment #56)
> - "[..] optimized comparison algorithm [..] non-optimal matching" contains a
> logical error.

Agree.

> - There should be a comma after the introductory phrase "To avoid long
> computation time" and after "As a result"

I agree with putting a coma after "As a result", the first sentence sounds better without it.

> - The tense of the last sentence is wrong.

Your proposition ("could have been found") means that there was a probability to find a non-optimal result but in the end it didn't happen.

After a chat with one of our information developers seems like correct messages would are: 

1) "To avoid long computation time a faster comparison algorithm has been used. As a result, a non-optimal matching could be found.".

2) "To avoid long computation time a faster comparison algorithm has been used. As a result, the matching might not be optimal.".
Comment 59 Szymon Brandys CLA 2010-04-12 08:48:55 EDT
(In reply to comment #58)
> 1) "To avoid long computation time a faster comparison algorithm has been used.
> As a result, a non-optimal matching could be found.".
> 
> 2) "To avoid long computation time a faster comparison algorithm has been used.
> As a result, the matching might not be optimal.".

I like 2.
Comment 60 Markus Keller CLA 2010-04-12 10:05:35 EDT
(In reply to comment #58)
> 2) "To avoid long computation time a faster comparison algorithm has been
> used. As a result, the matching might not be optimal.".

That's better, though it's still hard to read (at least on WinXP) in the short time the OS gives the reader until the tooltip goes away again. 

The label will already say "Faster algorithm used". How about putting the additional information to the beginning:

"The matching might not be optimal. A faster comparison algorithm has been used to avoid long computation time."
Comment 61 Pawel Pogorzelski CLA 2010-04-12 10:14:34 EDT
Created attachment 164560 [details]
Patch_v02

(In reply to comment #60)
> "The matching might not be optimal. A faster comparison algorithm has been used
> to avoid long computation time."

I prefer to reflect cause and result relation in the ordering. The message will be: "To avoid long computation time a faster comparison algorithm has been used. As a result, the matching might not be optimal.".
Comment 62 Markus Keller CLA 2010-04-12 10:35:04 EDT
(In reply to comment #61)
OK, but please also fix:
CompareContentViewerSwitchingPane_optimized=Optimized algorithm used

It's strange to show a warning icon and then only talk about "Optimized".

"Faster algorithm used" would better tell what's going on.
Or even spell out the result: "Matching might not be optimal"
Comment 63 Pawel Pogorzelski CLA 2010-04-12 10:49:49 EDT
Created attachment 164569 [details]
Patch_v03

David, do you find following wording appropriate?

Label: Matching might not be optimal
Tooltip: To avoid long computation time a faster comparison algorithm has been used. As a result, the matching might not be optimal.
Comment 64 David Gross CLA 2010-04-12 12:09:34 EDT
(In reply to comment #63)
> Created an attachment (id=164569) [details]
> Patch_v03
> 
> David, do you find following wording appropriate?
> 
> Label: Matching might not be optimal
> Tooltip: To avoid long computation time a faster comparison algorithm has been
> used. As a result, the matching might not be optimal.

The message is fine but fear the customer may not understand the word OPTIMAL.  
We all know that OPTIMAL really means WRONG.  It's probably too late to
change the wording.  The customer's example files show that the result is
wrong, rather than not optimal.  The customer may say this fix is unacceptable,
then all of the work that you did will be wasted. But maybe I'm reading too
much into this and the customer will be placated after all.
Comment 65 Pawel Pogorzelski CLA 2010-04-12 14:50:04 EDT
(In reply to comment #64)
> The customer's example files show that the result is wrong, rather than not optimal.

The algorithm manages to find all the changes. There exists optimal matching
which results in finer grained differences.

There has been a comprehensive discussion on this bug and given the technical
possibilities this is the only solution we can offer for backporting. Solution
for future releases is tracked by bug 307280.

Patch_v02 and Patch_v03 are in HEAD. Marking the bug as FIXED.
Comment 66 David Gross CLA 2010-04-13 14:40:23 EDT
(In reply to comment #65)
> (In reply to comment #64)
> > The customer's example files show that the result is wrong, rather than not optimal.
> 
> The algorithm manages to find all the changes. There exists optimal matching
> which results in finer grained differences.
> 
> There has been a comprehensive discussion on this bug and given the technical
> possibilities this is the only solution we can offer for backporting. Solution
> for future releases is tracked by bug 307280.
> 
> Patch_v02 and Patch_v03 are in HEAD. Marking the bug as FIXED.

For comment 64, the customer confirmed that a warning message is good
enough for now.  A full fix would be nice, he said, but you could do that later.
From PMR 80179,999,724 page 33  April 13, 2010.
Comment 67 David Gross CLA 2010-04-14 18:18:12 EDT
(In reply to comment #56)
> (In reply to comment #50)

> CompareContentViewerSwitchingPane_optimizedTooltip=To avoid long computation
> time an optimized comparison algorithm has been used. As a result a non-optimal
> matching could be found.
The customer now said to David Gross via email on April 14:
From my understanding a tool tip pop up happens  only when you hover a link or icon or something like that. This is strange a little because the system should proactively give the info about the result to the user.
Comment 68 Susan McCourt CLA 2010-04-14 21:16:16 EDT
(In reply to comment #43)
> Computing the diffs for the viewport is non-trivial, especially when the text
> already changed (e.g. in a Java editor). I suggest we first go ahead with the
> simple solution. If we see that the message appears too intrusive we can
> discuss more complex approaches or maybe tweak the cap a little bit.

I've finally encountered this warning while doing real work (comparing changes made to the Workbench class).

I'm running 20100413-1143.
I find the warning quite obtrusive and confusing.
I honestly think that not doing anything is a better solution than what we have, though obviously the customer gets the final say.

- The I-build has the older label "Optimized algorithm used" and I agree that this was a poor label.  In some ways it sounds like a good thing (oh great, things are optimized!) yet there is a warning icon.
- the newer message "Matching might not be optimal" is more negative (which I suppose is good), but it provides no information.  What in the world does this mean?  (It means we don't know what is going on!)
Could we at least be very specific, such as "False differences are possible"
- I suggest the tooltip say
"To avoid long computation time a faster comparison algorithm has been used. As a result, differences might be shown in the viewer where there are none."

I know given the constraints of backporting there is little we can do but I feel strongly that we have made things *much worse* for the average user and not much better for the case where the error actually occurs.   I don't think that showing the warning whenever we cap is the right answer.  Can we guess when the error is more likely to occur?  I'm not against hacking in a specific check that we know fixes the customer's case!  Because I think doing nothing is the right thing in 99% of the cases.
Comment 69 Dani Megert CLA 2010-04-15 02:53:41 EDT
>I'm running 20100413-1143.
Susan, please see previous comments from Markus and Pawel. The wording got improved but didn't make it into 20100413-1143. Please use I20100414-1200 or newer.
Comment 70 Pawel Pogorzelski CLA 2010-04-15 04:13:40 EDT
(In reply to comment #68)
> I find the warning quite obtrusive and confusing.

Dani's right. See comment 63 for the wording currently used.

> Can we guess when the error is more likely to occur?

We certainly can't be sure about it.
Comment 71 Markus Keller CLA 2010-04-15 05:29:13 EDT
Susan talks about the new messages, so she has definitely seen them.

I also ran into this (with the latest messages) and found the warning icon too alarming. Could we replace it with the info icon?

If we change the label to "Faster comparison algorithm used", the user is informed about what's going on but is not immediately alarmed.

(In reply to comment #68)
> - I suggest the tooltip say
> "To avoid long computation time a faster comparison algorithm has been used. As
> a result, differences might be shown in the viewer where there are none."

I like that.
Comment 72 Pawel Pogorzelski CLA 2010-04-15 06:03:15 EDT
(In reply to comment #64)
> The customer's example files show that the result is wrong, rather than not optimal.

These are reporter's feelings about the result. That's why a warning icon was used. I don't mind changing it if all the people involved agree though.
Comment 73 Susan McCourt CLA 2010-04-15 12:17:26 EDT
(In reply to comment #71)
> Susan talks about the new messages, so she has definitely seen them.
Yes, I checked the patches before commenting in this bug and my understanding was that only the wording had changed since the first I-build.

> I also ran into this (with the latest messages) and found the warning icon too
> alarming. Could we replace it with the info icon?

I agree with this.  The warning jumped out at me right away, and in my case, there was nothing wrong with the differences.  So my eye was immediately pulled to something that was described rather mysteriously.

> If we change the label to "Faster comparison algorithm used", the user is
> informed about what's going on but is not immediately alarmed.
Agree.  It still seems kind of weird, but seems the best of all the possibilities.

We can see how this plays.

I still think that increasing the threshhold at which we show the message (ie, not when we use the capped algorithm, but when we think it's more likely to have failed) could help, too, but as I've said before, I don't live in the code and so I don't know if this is a big deal.  What I'm trying to point out is that less precision and a little guessing in the code might really benefit the user.
Comment 74 Markus Keller CLA 2010-05-11 12:59:29 EDT
Even after using this for a few weeks now, I still feel quite alarmed whenever the warning icon and message show up (and I never had wrong results, probably since changes that I review in big files are usually not that big).

There was no opposition and even another +1 for changing the icon to "info" and the message to "Faster comparison algorithm used". Could you do that for RC1?
Comment 75 Pawel Pogorzelski CLA 2010-05-12 10:59:57 EDT
Created attachment 168160 [details]
Snap
Comment 76 Pawel Pogorzelski CLA 2010-05-12 11:05:33 EDT
(In reply to comment #74)
> Could you do that for RC1?

It's been forked to bug 312641 which is targeted for RC1.