214085 – Create Patch produces bad patch if encodings don't match

Bug 214085 - Create Patch produces bad patch if encodings don't match

Summary: Create Patch produces bad patch if encodings don't match

Status:	CLOSED WONTFIX

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	CVS (show other bugs)
Version:	3.3.1
Hardware:	PC Windows XP

Importance:	P3 major with 8 votes (vote)
Target Milestone:	---
Assignee:	platform-cvs-inbox
QA Contact:

URL:
Whiteboard:	stalebug
Keywords:

Duplicates (3):	215360 295036 333339 (view as bug list)
Depends on:
Blocks:

Reported:	2008-01-02 05:19 EST by Daniel Schneller
Modified:	2021-03-12 06:18 EST (History)
CC List:	16 users (show)

See Also:

Attachments
Screenshot of Apply Patch dialog showing conflict (40.52 KB, image/png) 2008-01-02 05:19 EST, Daniel Schneller	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Schneller

2008-01-02 05:19:01 EST

Created attachment 85981 [details]
Screenshot of Apply Patch dialog showing conflict

Build ID: M20071023-1652

Steps To Reproduce:
1. Create a new class called Test with the following contents:
(the comment simply says "I am a German comment. I may even contain umlauts, e. g. 'ä'")
---------
public class Test {

    /* Ich bin ein deutscher Kommentar.
     * Ich darf sogar Umlaute enthalten, wie zum
     * Beispiel "ä".
     */

    private int dummy = 0;

}
---------

2. Commit to CVS
3. Change the value of the dummy variable to 1.
4. Select Team->Create Patch and leave the defaults (clipboard)
5. Restore from CVS
6. Select Team->Apply Patch from clipboard

Eclipse will report an unmatching patch, because the German umlaut "ä" was not correctly put into the diff output, but replaced with a question mark.

This is of course not limited to block comments. 

I have attached a screenshot of the conflict view.

Comment 1 Tomasz Zarna

2008-01-02 09:20:54 EST

Eclipse's Create Patch operation is based on the diff command called on your CVS repository. I would check (encoding) configuration of the repository first if I were you. Patches with both German Umlauts and Polish tails work fine for me. Let me know if you need some additional info.

Comment 2 Daniel Schneller

2008-01-02 11:07:11 EST

I would indeed like some hints on where to look on the CVS server. What would I have to check/modify? I did not find anything helpful in the CVS manual.

Apart from that I tried to get a better understanding and had a look at the CVS command Eclipse issues upon "Create Patch" and used the same arguments for a command line cvs call (cvs.exe from TortoiseCVS). 

Taking the output from that (redirected to a file) results in a working patch.

Running "cvs diff -N -u -l" (taken from the Eclipse CVS console) on the Linux machine and inspecting the output in hex shows E4 for the German umlaut which is the correct code for codepage 1252 (Windows default) and is also what the file properties dialog in Eclipse shows for "Test.java".

This output is identical to the cvs.exe output apart from the line terminators. After running unix2dos on the Linux side to correct that, both diff files result in the same MD5 hash.

What encoding does Eclipse use when it tries to apply the patch?

Comment 3 Tomasz Zarna

2008-01-04 11:24:05 EST

 (In reply to comment #2)
> What encoding does Eclipse use when it tries to apply the patch?

I belive that the Compare Editor uses an encoding specified in the workspace for the current local resource, but I'm not sure how it determines the encoding of a remote resource. I would need to take a look in the code to check that. Anyway, this case realized me that addressing bug 206586 could save us some effort here.

Comment 4 Stephan Herrmann

2008-10-08 16:35:08 EDT

I have Eclipse configured to use UTF-8 by default. I have checked out
org.eclipse.jdt.core.tests.model from eclipse cvs.

Then I made some modifications to file org.eclipse.jdt.core.tests.model.JavaSearchTests which contains some
non-ascii characters (several occurrences of '§', to be precise).

In this situation I found it very difficult to create a correct
workspace patch containing the differences of this file.

It seems that the cvs checkout operation converts the file to UTF-8
(with no option to select the char encoding).

At the first attempt this particular file was simply dropped (!!) from
the patch while other changes in my workspace were correctly recorded.

Next I changed the project preferences to use ISO-8859-1.
Now the special characters were of course no longer correctly displayed.
At this point I could create a patch with the fake differences discussed
in this bug.

Then I used the compare editor to *individually* fetch a fresh copy of each
chunk containing funny chars into my workspace file, which changed these
characters into ISO-8859-1, so in the editor they would look OK again.
[IMHO This operation is not feasible if you have a large workspace with
many local changes mixed with many differences caused by the encoding mess]

Only then I could create the proper patch file.

The special pain in this scenario stems from the inconsistency between
automatically translating to the local default encoding upon checkout
(which I generally like) and not being able to proceed with this encoding
during "create patch" (which I consider a critical error actually).

Two suggestions:
1. "create patch" should be investigated as it seems to silently die
when encountering two file versions (remote/local) with incompatible
char encoding.
2. the checkout wizard could offer an option to select the char encoding
instead of forcing a recoding to the local default.

Of course it would be great if the user would be informed when checking
out a project with an encoding that differs from the workspace default.

Comment 5 Stephan Herrmann

2009-01-20 11:16:28 EST

Could this issue please be considered for 3.5?

The current status simply means that a machine with default UTF-8
encoding cannot create patches for a CVS repository with ISO encoding.

This is a critical bug.

Comment 6 Stephan Herrmann

2009-08-27 15:37:41 EDT

OK, how does this relate to the fixed bug 223857?

Also, what is the suggested way of working with different encodings
locally/remotely?

On a UTF-8 machine, recent checkouts from eclipse cvs look garbled
(non-ascii characters replaced by inverted "?") 
but compare seems to work, hm?

Comment 7 Stephan Herrmann

2009-10-06 08:47:14 EDT

It hit me once again just a minute ago using the latest integration build
(I20090929-0800):

1. I have an eclipse project checked out for preparing a patch.
2. I do an update on the project -> OK
3. I do a "Synchronize with Repository": I see exactly my changes, nothing more
4. I create a patch:
   It contains garbage because a non-ascii char has changed (to itself)

To reproduce:
1. checkout project org.eclipse.jdt.core.tests.model
   into a workspace with default encoding set to UTF-8.
2. Edit class JavaSearchTest.
3. Create a patch and observer fake changes regarding each occurrence of '§'

Using a binary editor I see that the patch file indeed contains 
null-changes: the extra before and after lines are identical!!
Note that the editor cannot correctly display the special char, because
the checkout operation does not perform any conversion.
So we do have unprintable characters on both sides, which seems to
be fine for compare but not for create patch.

Can someone please at least point out where in the code this is handled,
so we have a chance to get this embarrassing bug off the table?

On a different note: why doesn't the checkout wizard allow to select
the encoding used in the checked out project?

Comment 8 Stephan Herrmann

2009-10-06 09:19:55 EDT

FWIW, the '§' char is consistently converted to 0xef 0xbf 0xbd,
which is a legal UTF-8 sequence but not mapped to any symbol, it seems.
This sequence is not printable in any encoding that I know of.

Comment 9 Stephan Herrmann

2009-11-24 08:59:44 EST

Please see the attachment from bug 295894 comment 12 for a patch
containing the garbage described earlier.

Both the Eclipse install (3.6 M3) and the workspace were fresh.
Not a single preference was changed.

The linux environment contains:
  LANG=en_US.UTF-8

Comment 10 Dani Megert

2009-11-24 09:36:35 EST

Make sure not to mix things:
- if the checked out project (or resource in general) uses special characters
  without specifying the encoding then it's pure luck whether it works or not and
  starting to edit and save such a file will corrupt it unless you're lucky and
  the special character cannot be mapped with the invalid encoding. Of course
  afterwards making a patch on top of the corrupted file results in a corrupted
  patch. This can only be fixed by the project that uses special characters: it
  has to set the project specific encoding accordingly and store it in CVS.
- creating the patch: here the mistake can be made in the code that the wrong
  encoding is used: the resulting patch will normally have the default OS encoding
  and hence this needs to be taken into account
- applying the patch: here the same thing: the patch read from file or clipboard
  needs to be read with OS default encoding

Comment 11 Daniel Schneller

2009-11-24 11:28:48 EST

Just to make sure I did not miss anything I tried my original testcase again with 3.5.1 and it still fails.

Files are stored in Windows default Cp1252 locally, the CVS repository location is configured for UTF-8. From what I understand this should only affect filename and commit comments however.

Creating a patch as described results in the same unusable patch - even when just put into to local keyboard, no files involved.

I did observe however, that when the CVS repository encoding and the file encoding in the workspace - two completely separate things IMHO - match, the patch can be created correctly.

Apparently there seems to be something wrong there. The only option - and I would be reluctant to do so, because I would have to check in the majority of our several 100.000 source files - would be to convert them all to UTF-8 locally. I believe there should be a better, more user-friendly solution.

Comment 12 Dani Megert

2009-11-30 11:57:36 EST

I quickly looked at the code (org.eclipse.team.internal.ccvs.core.client.Diff and org.eclipse.team.internal.ccvs.core.client.listeners.DiffListener) that reads the patch from the server and then writes it to disk or the clipboard: it has two fundamental design issues:

1) The patch which is created on the server is wrongly converted into a string on the client using the encoding specified on the repository properties page. This is doomed to fail because the patch should be kept as bytes until it is compared/applied to the local file (at which point we can use each file's encoding). A patch can patch several files each of them having a different encoding hence converting the bytes into a string using a single encoding can corrupt parts of the patch.

2) When writing the patch out to a file or the clipboard using the platform default encoding more data can be lost. The patch should be written to a (binary) file as is i.e. as byte stream received from the server. The clipboard scenario is harder to fix (would need consistent encoding of the bytes).


This also explains why binary patches (see bug 257263) can't work at this point.

Comment 13 Pawel Pogorzelski

2009-12-08 08:36:42 EST

*** Bug 295036 has been marked as a duplicate of this bug. ***

Comment 14 Pawel Pogorzelski

2009-12-09 04:51:59 EST

(In reply to comment #12)
> 2) When writing the patch out to a file or the clipboard using the platform
> default encoding more data can be lost. The patch should be written to a
> (binary) file as is i.e. as byte stream received from the server.

That's one way of addressing the problem. It should be pretty straightforward to implement. But what if encoding is changed after the patch has been generated? We would misinterpret stored bytes. And in case both UTF-8 and UTF-16 encodings are used (not sure if it's possible) mixing them in one patch file would be a mess.

Wouldn't it be better to enforce one encoding per patch file? This would allow to handle them as text files and that's basically the whole point of creating a patch. I'm thinking of a prompt saying something like: "The patch contains characters that can't be stored using default encoding. Please select different encoding, otherwise a data loss will occur". This could be hard to implement though since all the diffs have to be processed to determine if the default encoding causes data loss.

Comment 15 Dani Megert

2009-12-09 05:13:42 EST

>But what if encoding is changed after the patch has been generated? 
Bad luck.

>Wouldn't it be better to enforce one encoding per patch file?
Definitely no go. The SDK has projects with different encodings. This would prevent me from making a single patch covering changes in such projects. And of course it would be pretty hard to implement since the patch is generated on the server: you would have to do conversions per file on the client which seems wrong to me. Also, it would definitely break binary patching.

>This would allow
>to handle them as text files and that's basically the whole point of creating a
>patch. 
No. There are also requests to support binary stuff in a patch.

Comment 16 Daniel Schneller

2009-12-09 05:19:53 EST

(In reply to comment #15)
> >But what if encoding is changed after the patch has been generated? 
> Bad luck.
> [...]
> >This would allow
> >to handle them as text files and that's basically the whole point of creating a
> >patch. 
> No. There are also requests to support binary stuff in a patch.

Full Ack.

Comment 17 Pawel Pogorzelski

2009-12-09 06:09:04 EST

(In reply to comment #15)
> >Wouldn't it be better to enforce one encoding per patch file?
> Definitely no go. The SDK has projects with different encodings. This would
> prevent me from making a single patch covering changes in such projects.

It would simply work for majority of cases. However for a fraction of patches you would have to specify more comprehensive encoding like UTF-8 for the patch to cover all the characters. I don't see how enforcing single encoding prevents from making one patch in this case.

Supporting binaries in patches is a different thing than the encoding we use for storing text differences. I don't see how they are related.

Anyway, I'm just exploring different approaches. I like the approach of storing diffs as returned from the server. Just want to make sure we discuss other options.

Comment 18 Pawel Pogorzelski

2010-03-05 08:17:21 EST

*** Bug 269247 has been marked as a duplicate of this bug. ***

Comment 19 Pawel Pogorzelski

2010-03-05 08:21:46 EST

Bumping severity based on duplicates.

Comment 20 Dani Megert

2010-05-25 06:33:28 EDT

See bug for 309430 the Apply Patch issue.

Comment 21 Dani Megert

2010-05-25 06:33:51 EDT

>See bug for 309430 the Apply Patch issue.
Should read: "See bug 309430 for the Apply Patch issue.

Comment 22 Tomasz Zarna

2011-01-04 05:16:25 EST

*** Bug 333339 has been marked as a duplicate of this bug. ***

Comment 23 Axel Mueller

2012-01-19 11:00:25 EST

Bug 215360 looks like a duplicate.

Comment 24 Ortwin Glück

2012-01-19 11:16:10 EST

Look at org.eclipse.team.internal.ccvs.core.client.listeners.DiffListener. It treats the diff response as strings. See bug 215360 for why that's wrong.

Comment 25 Dani Megert

2012-01-20 04:18:49 EST

*** Bug 215360 has been marked as a duplicate of this bug. ***

Comment 26 Eclipse Genie

2019-03-14 14:22:21 EDT

This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.

Comment 27 Stephan Herrmann

2019-03-14 14:48:55 EDT

For me this may no longer be relevant due to bug 438875, i.e., by avoiding any encodings other than UTF-8. Not a solution to this bug, though.

Comment 28 Dani Megert

2019-03-22 11:37:30 EDT

This is probably still an issue, but CVS support is no longer under active development.

Comment 29 Eclipse Genie

2021-03-12 06:18:59 EST

This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug.

If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.