199846 – [Patch] Misuse of the Fuzz Factor

Bug 199846 - [Patch] Misuse of the Fuzz Factor

Summary: [Patch] Misuse of the Fuzz Factor

Status:	VERIFIED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Compare (show other bugs)
Version:	3.3
Hardware:	PC Windows XP

Importance:	P2 normal (vote)
Target Milestone:	3.4 M3
Assignee:	Tomasz Zarna
QA Contact:

URL:	http://polishineclipse.blogspot.com/2...
Whiteboard:	hasPatch
Keywords:

Depends on:
Blocks:	205789 205761 206062
	Show dependency tree

Reported:	2007-08-14 06:41 EDT by Tomasz Zarna
Modified:	2007-12-10 04:39 EST (History)
CC List:	6 users (show)

See Also:

Flags:	Szymon.Brandys: review+

Attachments
Testcase for the bug (39.13 KB, patch) 2007-10-01 10:29 EDT, Tomasz Zarna	no flags	Details \| Diff
Patch #1 (16.37 KB, patch) 2007-10-01 10:47 EDT, Tomasz Zarna	no flags	Details \| Diff
Patch #2 (17.86 KB, patch) 2007-10-12 07:10 EDT, Tomasz Zarna	no flags	Details \| Diff
More tests for the bug (48.05 KB, patch) 2007-10-12 07:10 EDT, Tomasz Zarna	no flags	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tomasz Zarna

2007-08-14 06:41:08 EDT

Build ID:  I20070809-1105

While looking for the cause of bug 196847, I got suspicious about a usage of the so called fuzz factor. I thought that I knew what it means and what is it for even though I hadn't use it before. However, after taking a closer look at the code I got confused. 

For example, please, correct me if I'm wrong, in HunkResult#calculateFuzz[1] method we are calculating the most appropriate shift value which has nothing to do with the fuzz factor. Looking further, in Hunk#tryPatch[2], when we came across a context line (' ' char) we should use the fuzz factor. I can see no sign of such an action there. I could give you more example, but that's not my point. 

I think we should take a second look at the patcher algorithm: name some methods more accurately, add some javadocs or maybe even fix some things. But first of all, we should add plenty of unit tests just to make sure after "refactoring" is done, everything still works as expected :)

[1] Compare project: org.eclipse.compare.internal.patch.HunkResult
[2] Compare project: org.eclipse.compare.internal.patch.Hunk

Comment 1 Szymon Brandys

2007-08-14 06:48:30 EDT

Right. While I was looking at the code, I became confused as well.
I agree that we have to take a good look at it.

Comment 2 Tomasz Zarna

2007-08-16 04:08:14 EDT

Also, it would be great if the patching could implement the algorithm described in bug 196847, comment 6.

Comment 3 Stefan Xenos

2007-09-10 10:41:37 EDT

Note: this bug is important for our RCP app.

Comment 4 Tomasz Zarna

2007-10-01 10:29:42 EDT

Created attachment 79481 [details]
Testcase for the bug

This is a testcase showing that the fuzz factor is not working as expected, actually is not working at all. I've modified slightly the PatchTest class, so now one can use the fuzz factor as a parameter. But this is not what the patch is mainly about: the major part of it are the additional tests located in "patchdata" subfolder. They assume that the fuzz factor works and that the patches from the subfolders can be applied. Unfortunately, most of the will fail as the fuzz factor doesn't work. To make the tests pass we will need to apply the patch from the next comment.

The idea behind the "testPatchdataSubfolder" method from the PatchTest class is that now one can easily add a new testcase by creating 3 files: a context, a patch and an expected result. Therefore I encourage all of you to try write a failing testcase. I'm sure that there are still some cases which haven't been covered by the tests I provided.

Comment 5 Tomasz Zarna

2007-10-01 10:47:26 EDT

Created attachment 79484 [details]
Patch #1

This is a working version of the patch for the fuzz factor issue. As stated in the previous comment, I'm aware that there can be still some cases which when discovered will need addressing in a next version of the patch. Nevertheless, I decided to show you the patch so anyone interested can comment on it. I'd be happy to hear your opinion.

There are some limitations in the fuzz factor mechanism I decided to introduce:

1) The default maximum fuzz factor is 2 (which is a consequence of the fact that there are usually 3 lines of context "around" a hunk). So one one choose to calculate the fuzz factor[1] there will be up to 3 iterations when looking for the fuzz (i.e. 0, 1, 2).

2) At this moment I also decided to not match context lines inside inside doPatch[2] method as they are already checked in the tryPatch[3] method in the same class.

[1] org.eclipse.compare.internal.patch.HunkResult#calculateFuzz(List)
[2] org.eclipse.compare.internal.patch.Hunk#doPatch(PatchConfiguration, List, int)
[3] org.eclipse.compare.internal.patch.Hunk#tryPatch(PatchConfiguration, List, int, int) 

Waiting for your reply...

Comment 6 Michael Valenta

2007-10-01 12:54:55 EDT

I've looked at the patch and I'll do my best to provide comments. Given the complexity of the algorithm, I think the most important thing is to have a good set of tests to ensure the proper behavior. Or, in other words, I have a hard time verifying the code via inspection).

Here are my comments:

1) In the HunkResult#calculateFuzz, it seems to scan the whole file looking for possible matches. To me, this implies that it would be possible to get hunks matching out of order. Is this the case?

2) The maximum fuzz seems to be hardcoded at 2 with the assumption that there are 3 lines of context. What if there are less content lines? What if there are more?

3) In Hunk#tryPatch I must admit that I can't really figure out what you are doing there (e.g I don't get the 2+2, 3+2 and 2+3 splits you are doing). The best approach here may be to sit down with Szymon and go over the algorithm with him (i.e. I don't have that much knowledge about the algorithm anyway and going over it with someone there would probably be faster then doing it though a bug report or instant messaging).

4) In regards to ignoring the context lines during the patch application, how is the case handle were there is an addition following the context lines? Wouldn't you need to know how many context lines to skip in order to place the addition properly? Or am I missing something?

Comment 7 Tomasz Zarna

2007-10-05 11:31:31 EDT

 (In reply to comment #6)
> 1) In the HunkResult#calculateFuzz, it seems to scan the whole file looking for
> possible matches. To me, this implies that it would be possible to get hunks
> matching out of order. Is this the case?

Yes, it's possible in some cases. Do you think we should prevent the patcher from allowing such cases? I do agree that it can be confusing and I don't mind if we won't allow it.
 
> 2) The maximum fuzz seems to be hardcoded at 2 with the assumption that there
> are 3 lines of context. What if there are less content lines? What if there are
> more?
 
I was afraid you will ask that question. I know that hardcoding the max fuzz is not enough. I would change current 2 to lines.size() which is number of lines in a file. That would work, wouldn't it? But I've a different question: what should happen when we have fuzz factor greater or equal to number of lines of a hunk? Should we ignore all of the context lines or should we match at least one? What if we're in the middle of "calculating fuzz" and the fuzz goes up? I would allow ignoring all context lines, but that's just my opinion. I know that default values are 3 for context lines and 2 for maximum fuzz factor, but we could inform a user that he/she exceed the maximum value.

> 3) In Hunk#tryPatch I must admit that I can't really figure out what you are
> doing there (e.g I don't get the 2+2, 3+2 and 2+3 splits you are doing). [...]

Well, I will throw it away. My idea was to apply the fuzz factor to any context lines, not only to the ones at the end or at the beginning of a hunk. But I had a discussion with Szymon about that and we decided to keep the context lines in the middle as they are - they must match.

> 4) In regards to ignoring the context lines during the patch application, how is
> the case handle were there is an addition following the context lines? Wouldn't
> you need to know how many context lines to skip in order to place the addition
> properly? Or am I missing something?

Yep, this is how it works. I do collect context lines before processing an addition/change/deletion. The same thing happens with context lines following a change. The only problem is that I need to recognize which context lines are preceding a hunk and which follows it. If I get that information right I can than try to match the hunk ignoring some of the context lines. I'm not sure if that answer your question (as I'm not sure if I understood your question).

Comment 8 Michael Valenta

2007-10-05 12:48:43 EDT

Regarding Hunk ordering, I can see that there might be some cases where changing the order of hunks would find matches that wouldn't otherwise be found. Ideally, I think we should try to match in order first and only try out-of-order matches if an in-order match couldn't be found.

Regarding the default maximum fuzz factor, I'm OK with using a hardcoded value to start with but lets use a constant for it so we can easily find references to it easily. As to whether we can ignore all content lines, I think that we need to be careful. It really depends on the nature of the change. For instance, if the patch is a straight addition, it may be dangerous to ignore even one content line, especially if one or more of the content lines are whitespace. However, if the before state of the patch has many lines in it, you could probably get away with ignoring all the context lines. It would be good to have a set of tests that cover some of these corner cases to ensure that the algorithm behaves as the user would expect.

Comment 9 Tomasz Zarna

2007-10-12 07:10:15 EDT

Created attachment 80217 [details]
Patch #2

Second version of the patch. I cleaned up the code and added back context lines validation in Hunk.doPatch method.

Comment 10 Tomasz Zarna

2007-10-12 07:10:47 EDT

Created attachment 80218 [details]
More tests for the bug

Comment 11 Szymon Brandys

2007-10-16 11:47:27 EDT

I raised a bug for an issue I found in tests. I will raise another one for all things that should be considered, but anyway the patch looks good. People should finally play with it :-)

Comment 12 Tomasz Zarna

2007-10-17 17:25:33 EDT

Both patches (the fuzz factor and tests) released to HEAD.

Comment 13 Szymon Brandys

2007-10-31 08:14:57 EDT

Verified in I20071029-1800 and by a code inspection.