215360 – [Patch] File encoding not respected when creating a patch

Bug 215360 - [Patch] File encoding not respected when creating a patch

Summary: [Patch] File encoding not respected when creating a patch

Status:	CLOSED DUPLICATE of bug 214085

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	CVS (show other bugs)
Version:	3.4
Hardware:	PC Windows Vista

Importance:	P4 normal (vote)
Target Milestone:	---
Assignee:	platform-cvs-inbox
QA Contact:

URL:
Whiteboard:
Keywords:	helpwanted

Depends on:
Blocks:

Reported:	2008-01-15 11:06 EST by Pascal Filion
Modified:	2012-01-20 04:18 EST (History)
CC List:	5 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pascal Filion

2008-01-15 11:06:57 EST

I have Java files using the file encoding UTF-8 and some characters used are higher than \u00FF, é (\u00E9) or &#9484; (\u250C) for instance.

When I create a patch, the file encoding is changed to something else, maybe Cp1252 and those characters are converted to â”Œâ” for &#9484; (\u250C).

Even the diff in the CVS console shows those weird characters even if I changed the font to Lucida Console since it has those characters (note: Tahoma does not have &#9484;).

Comment 1 Tomasz Zarna

2008-01-17 04:49:21 EST

Pascal, can you confirm that the encoding on both sides is the same (ie. files located on the repository are also encoded with UTF-8)?

Comment 2 Pascal Filion

2008-01-18 15:59:01 EST

One of the problem was seeing with the Eclipse CVS repository and I believe the files on the server were encoded with ISO-8859-1 but the local file was UTF-8.

One thing I saw is that if the patch is opened with Microsoft Word, it asks which file encoding to use, if I select UTF-8, then I see the extended set of characters correctly.

The other test I did was on another CVS repository and both file were most probably UTF-8 and the e acute (é) was shown incorrectly in the CVS console.

Comment 3 Tomasz Zarna

2008-01-21 08:27:19 EST

First I tested it using Polish tails and it worked fine -- the patch was composed correctly. Then I gave it a second try using French characters like Pascal did, but this time it didn't go so easy. In other words, I can confirm that having both files[1] encoded in UTF-8 results in a corrupted patch file generated by Eclipse, where they look fine when diff'ed using cvs command line.

Unfortunately, I'm afraid we don't have the manpower to address it at this time. Patches will be accepted and I can assist if you need my help.

[1] local and remote

Comment 4 Mauro Molinari

2009-03-05 06:46:20 EST

I may have a different scenario.

I have two machines, a Linux station (UserA) with the default o.s. text file encoding set to UTF-8, and a Windows station (UserB) with the default o.s. text file encoding set to cp1252.

Files in CVS are in ISO-8859-1.

UserA has configured his workspace to use ISO-8859-1 file encoding (Window | Preferences | General | Workspace | Text file encoding).
So, when he checks out Class1.java from the CVS, he sees accented letters correctly.

UserB has left his workspace to use cp1252 (which is a superset of ISO-8859-1). When he checks out Class1.java from the the CVS, he sees accented letters correctly.

UserA edits Class1.java and creates a patch (patch1.txt) with Eclipse, saving it either in the workspace or outside. However, this file (patch1.txt) contains the accented letters corrupted. If he opens Class1.java and patch1.txt in Eclipse side-by-side, he sees accented letters ok in Class1.java, corrupted in patch1.txt.
Of course, if UserA sends patch1.txt to UserB, UserB has difficulty to apply it, because segments do not match and he also risks to corrupt his own local copy of the file with the corrupted accented letters coming from the patch.

UserB can also try to open patch1.txt with an editor that lets you change the encoding in which the file has to be read: even if UserB tries to set the encoding to UTF-8, accented letters are corrupted (you don't see crappy characters, but you see empty squares for them). My suspect is that when creating the patch for UserA, Eclipse (or the diff command?) assumed the source Class1.java file was in UTF-8 encoding (the o.s. default), so it read it the wrong way, getting accented letters corrupted; then it wrote patch1.txt with UTF-8 encoding (the o.s. default). So, if UserB tries to open patch1.txt in UTF-8, he sees the empty squares instead of the accented letters (because of the corruption introduced by the reading of the source file in the patch creation process). However, since UserB usually works with ISO-8859-1 (the same encoding of the java files to which he wants to apply the patch), he doesn't even see the empty squares, but crappy characters (because of the corruption introduced by the writing of the patch file in the patch creation process).

In any case, the patch can't be applied as it should.

So, I think that Eclipse, when creating a patch, should:
- read the source file respecting the resource encoding (set in the workspace properties or overridden in the project properties or overridden in the file properties)
- write the patch file using the same encoding used to read the source file OR let the user choose the encoding to use to write the patch file, assuming the default would be the same as the one used to read the source file

I hope this can help to address the problem.

I think it's a major problem because the interoperability of such an important feature is compromised for non-US people (those using languages with accented characters).

Comment 5 Mauro Molinari

2009-07-08 05:27:58 EDT

Any news on this?
I would like to stress the fact that in the scenario depicted in comment #4 the patch feature is unusable, so this bug is totally breaking the functionality.

Comment 6 Mauro Molinari

2010-03-16 05:01:44 EDT

I would like to stress the importance of this bug. I know you've worked in 3.6 to better handling patches. This bug makes the patch features of Eclipse almost useless in mixed encoding environments.

Could you please have a look at it?

Comment 7 Mauro Molinari

2010-11-12 04:43:26 EST

Is there any plan to address this problem?

Comment 8 Ortwin Glück

2011-04-15 04:47:06 EDT

When Eclipse saves the diff to a file, it uses the platform encoding, trying to interpret the diff content as characters. And that is so totally broken.

The Unix diff command as well as CVS diff are encoding agnostic. And it has to be like that! You can have 3 files in 3 different encodings, and create a diff for all 3 of them and store it in a single patch file. Which encoding does the patch file have? Unspecified! Each patch contained in the file contains the differences in the exact same encoding as the original files. Actually diff operates on raw bytes and not on characters. So you can even diff two files whose encoding doesn't match. Diff doesn't care and does the right thing on the byte level.

Please make Eclipse diff compatible with Unix/CVS diff. Eclipse diff is totally unusable on Windows with UTF-8 XML files. And this bug has been open for 3 years!!!

Comment 9 Mauro Molinari

2011-10-17 08:18:35 EDT

Any news on this? Could this be targeted for 3.8?

Comment 10 Axel Mueller

2012-01-19 04:30:39 EST

I encounter the same problem as described in comment #4. I found a solution (at least for this scenario) which was inspired by bug 214085.
You should change the server encoding to ISO-8859-1. You can do this in the CVS Repository Exporing Perspective. Select the repository and choose properties from tnhe context menu. Then set the server encoding to ISO-8859-1.

Comment 11 Ortwin Glück

2012-01-19 08:39:57 EST

(In reply to comment #10)
That deals with encoding of file names. However this bug report is about file *content*.

Comment 12 Axel Mueller

2012-01-19 08:50:52 EST

(In reply to comment #11)
> (In reply to comment #10)
> That deals with encoding of file names. However this bug report is about file
> *content*.
Yes, I know. I referred specifically to the scenario described in comment #4 about wrong encoding of the content. As far as I understand the encoding option for the CVS server it uses this encoding for all messages that Eclipse gets from the server. If you have a look at the CVS console you can see that Eclipse sends a command to the server to get the diff from which it generates the patch. If you change the server encoding you will see different results for the content. Try yourself. It got a useable patch when I changed the encoding for the server from UTF-8 to ISO-8815-1 (my file has ISO-8815-1 encoding and contains some German umlauts).

Comment 13 Ortwin Glück

2012-01-19 10:14:44 EST

(In reply to comment #12)
> If you change the server encoding you will see different results for the
> content.

OK, I see! And that's so totally wrong. As noted in comment 8 the diff must use bytes, not characters. Assuming any specific encoding of file content is always wrong for diffs. When a file is checked out of CVS its content encoding is not converted to a local encoding. Only line endings are (which in itself is debatable, but CVS works like this). So why should a diff depend on any encoding. It doesn't make sense at all.

Comment 14 Axel Mueller

2012-01-19 10:59:52 EST

(In reply to comment #13)
> (In reply to comment #12)
> > If you change the server encoding you will see different results for the
> > content.
> 
> OK, I see! And that's so totally wrong. As noted in comment 8 the diff must use
> bytes, not characters. Assuming any specific encoding of file content is always
> wrong for diffs. When a file is checked out of CVS its content encoding is not
> converted to a local encoding. Only line endings are (which in itself is
> debatable, but CVS works like this). So why should a diff depend on any
> encoding. It doesn't make sense at all.
Well, I never said that my solution is correct. It's just a workaround. You are completely right that the diff should operate on bytes (just imaging your patch constists of several files each with its own encoding!).
I think the came to the same in conclusion in bug 214085.

Comment 15 Ortwin Glück

2012-01-19 11:13:58 EST

Just look at org.eclipse.team.internal.ccvs.core.client.listeners.DiffListener. As I guessed, it treats the diff response as strings. These have been destroyed by applying the server encoding:

public IStatus messageLine(
 	String line,
 	ICVSRepositoryLocation location,
 	ICVSFolder commandRoot,
 	IProgressMonitor monitor) {
...
 	patchStream.println(line);
...

I say "destroyed" because that "conversion" is not reversible. The encoding of that diff is unknown (it depends on the encoding of the diffed files). Let's say you set the server encoding to UTF-8 and the file encoding is some chinese one, you will for sure have lots of question mark characters in the line string.

Comment 16 Dani Megert

2012-01-20 04:18:49 EST


*** This bug has been marked as a duplicate of bug 214085 ***