72995 – When comparing with CVS we see a wrong encoding

Bug 72995 - When comparing with CVS we see a wrong encoding

Summary: When comparing with CVS we see a wrong encoding

Status:	RESOLVED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Compare (show other bugs)
Version:	3.0
Hardware:	PC Windows 2000

Importance:	P3 normal with 2 votes (vote)
Target Milestone:	3.3 M3
Assignee:	Michael Valenta
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (2):	81637 87300 (view as bug list)
Depends on:
Blocks:

Reported:	2004-08-31 14:21 EDT by Shai Bentin
Modified:	2006-10-12 15:59 EDT (History)
CC List:	12 users (show)

See Also:

Attachments
Asked file: the file from the cvs server and the local version (522 bytes, application/zip) 2004-11-30 09:05 EST, Marc	no flags	Details
Patch correcting bad encoding when comparing (1.64 KB, patch) 2006-01-19 11:37 EST, Gilles Querret	no flags	Details \| Diff
New patch fixing encodings (2.37 KB, patch) 2006-01-20 04:02 EST, Gilles Querret	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Shai Bentin

2004-08-31 14:21:31 EDT

We use UTF8 as our encoding. We have a properties file which has Hebrew values.
When comparing with CVS the remote side, although correctly compared, shows 
gibberish instead of hebrew where as the local side is OK.

I again stress that this does not effect the compare itself and we can sync 
without a problem, it only looks wrong.

Comment 1 Andre Weinand

2004-08-31 14:27:15 EDT

What encoding do you use for the property file?

Since property files are by definition ISO-8859-1, I assume you encode the hebrew characters in 
\uxxxx notation?

Comment 2 Shai Bentin

2004-08-31 15:22:16 EDT

We use UTF-8, as I said.

It's true, the actual file uses \uxxxx but we use a hebrew character file to 
insert our properties and than trascode it. This file is also synchronized in 
CVS and it has the problem.

Thanks

Comment 3 Andre Weinand

2004-08-31 15:40:05 EDT

From your comment I'm not able to understand how you deal with property files.
But anyway, I assume that the file that shows "gibberish instead of hebrew" is stored in UTF-8 encoding 
in CVS?
Is UTF-8 the platform or workbench encoding?
If not, does the problem disappear if you explicitely set the workbench encoding to UTF-8?

Comment 4 Shai Bentin

2004-09-03 19:10:19 EDT

Both the workbench and the platform is set to UTF-8 and the problem still 
exists.

to explain again my other comment. Since /uxxxx is not human readable we keep 
a file with hebrew readable charcters. We than use on that file a special 
parser transcoder which generates the actual properties file with /uxxxx chars 
in it.

Comment 5 Marc

2004-11-22 12:11:24 EST

My computer uses UTF-8 as file encoding and the cvs server is configured as
ISO-8859-1. When performing a "compare to", the remote file is retrieved as
would it be UTF-8 encoded instead of ISO-8859-1.

Comment 6 Andre Weinand

2004-11-23 03:54:28 EST

Marc, you'll have to set the encoding of the local project to match the encoding used on the CVS server, 
so in your case you'll have to set the project to ISO-8859-1 (which overrides the UTF-8 platform 
encoding).

Comment 7 Andre Weinand

2004-11-23 03:58:08 EST

Shai, what encoding does the CVS server use?
Make sure that both the local resources as well as the server have the same encoding.
Eclipse's CVS support does not support automatic translation of encodings.

Comment 8 Marc

2004-11-23 04:05:07 EST

Andre,
the encoding for the project as well as the one for the server are already set
to ISO-8859-1. It just seems that Eclipse doesn't look at this setting when
performing a diff.

Comment 9 Andre Weinand

2004-11-23 04:28:15 EST

Where do you "set the server encoding to ISO-8859-1" ?

Comment 10 Marc

2004-11-23 04:34:33 EST

Andre,
- in the CVS Repository Exploring perspective, the server encoding is set to
Other: ISO-8859-1
- in the Java perspective, the project encoding is set to Other: ISO-8859-1
- each file I want to compare with its previous version has: Default (inherited
from container: ISO-8859-1)

It doesn't change anything if I force a file's encoding explicely to Other:
ISO-8859-1 (and would be quite tedious to do on all files).

Comment 11 Andre Weinand

2004-11-23 05:05:44 EST

Be aware that the "server encoding" that you can specify for a CVS server does not affect the encoding 
of the file's content. It is only used for file names and comments.

Currently there is no way to specify the encoding for the file's content.

Comment 12 Marc

2004-11-23 05:10:54 EST

This means that when performing a diff, Eclipse considers content's encoding of:
- the local file according to project/file settings
- other version received from CVS according to computer's default encoding

No really good isn't it?

Comment 13 Andre Weinand

2004-11-23 06:16:24 EST

When performing a diff, Compare uses the IEncodedStorage to determine the encoding for both (or all 
three) sides. For local resources, this does exactly what you would expect.
The implementation for the CVS resources uses the same algorithm as for local files. This is probably 
not what you would expect because it means that you have to set the encoding for any project shared 
via CVS to the encoding of the CVS server.
I agree, this is suboptimal.
I'm moving this bug to Platform/CVS because Compare doesn't control the implementation of 
IEncodedStorage for CVS resources.

Comment 14 Marc

2004-11-23 06:19:22 EST

I'm not sure that I fully understand your explaination because I have currently
the same encoding for the project and for the cvs server.

Comment 15 Andre Weinand

2004-11-23 06:26:17 EST

And what is the problem you are seeing?

Comment 16 Marc

2004-11-23 06:53:58 EST

for instance ä, ö, ü in the remote file are represented as &#65533; (not the normal "?"
but bold and white on a black point) and therefore a difference is seen.

When comparing 2 versions from the history with each other, both have ? instead
of ö, ä, ü, etc and therefore no diff is detected (what is correct) although the
text is not correctly represented.

If the local file is replaced with a previous version then its content is
correctly displayed.

Comment 17 Andre Weinand

2004-11-23 07:05:57 EST

Without exact steps it is impossible for me to reproduce your problem.
Could you please send me (or attach) both the unmodified local file and the unmodified file stored in 
CVS repository as a zip archive.
Thanks.

Comment 18 Andre Weinand

2004-11-23 07:09:24 EST

Two more questions:
- if you open the local file in a text editor, does it look right?
- if you navigate to the remote file in the CVS repository view and then open it in an editor,
  does it display correctly?

Comment 19 Andre Weinand

2004-11-30 06:07:02 EST

Marc, do you have the requested additional information?

Comment 20 Marc

2004-11-30 09:05:49 EST

Created attachment 16234 [details]
Asked file: the file from the cvs server and the local version

Comment 21 Marc

2004-11-30 09:06:40 EST

Andre,

- the local file opened in a text editor looks right (as long as the text editor
knows it has to read it as ISO-8859-1).
- the file content is not correctly displayed when navigating in the CVS
repository view and clicking on Open for the file 

I've attached the files you asked to the bug.

Comment 22 Andre Weinand

2004-11-30 09:18:07 EST

Michael or Jean-Michel,
So if the file content is not correctly displayed when navigating in the CVS
repository view and opening it, this probably means that the problem is not in Compare?

Comment 23 Michael Valenta

2004-11-30 11:47:56 EST

This in the behavior I saw when I set my platform encoding to UTF-8 and opened 
the file in the repository explorer.

1) The IEncodedStorage from CVS returned null as an encoding for the remote 
file could not be determined from the contents (i.e. there is no link between 
a file in the repository and what is in the workspace).

2) The StorageDocumentProvider (from the texteditor framework) assumed the 
workspace encoding and the file was not displayable

3) A button appeared in the editor which allowed me to change the encoding for 
that file.

4) I changed the encoding and the file was displayed properly.

When I opened a compare editor, the remote file was just blank. I think the 
above behavior would be benefitial to have in the compare editor as well. 
Also, in the case where the remote files return null as the charset and the 
local file has a charset, it may be better for compare to use the encoding 
from the local file as the default instead of the platform encoding. Arguably, 
this could be done by the editor input provided by CVS instead but I think we 
would still want the fallback behavior that allowed the user to set the 
encoding manually.

Comment 24 Michael Valenta

2004-12-01 11:26:48 EST

Aftre some more reflection, I think the best solution to this is for Compare 
to allow the user to manually set the encoding for the remote file. Any other 
solution is a guess. Even if we were right most of the time, we would still 
need the ability for the user to manually set the encoding when the guess was 
wrong. Moving to Compare for comment.

Comment 25 Marc

2004-12-01 11:30:18 EST

Not really practicable: this would mean manually set the encoding for the remote
file each time we perform a compare! Quite tedious.

Comment 26 Andre Weinand

2004-12-01 11:36:35 EST

Yes, I can provide UI for overriding the encoding that I get through IEncodedStorage. However, this 
should be the exception not the rule. If every file stored in a repository uses the specific encoding of 
that repository, it would make more sense to define this encoding once via the repository location 
property dialog. This would be similar to the already existing encoding property for file paths and 
comments.

Comment 27 Michael Valenta

2004-12-01 13:19:30 EST

CVS servers do not support file content encodings. We could try to fake this 
is Eclipse by supporting an encoding on the repository but this makes the 
assumption that all files stored in the repository contain the same file 
content encodings. Although this may be true most of the time, it will not be 
true all the time. For those cases, we would need to the ability to specify 
encodings for individual files or folders, just as the workspace provides. 
This is not something we want to do.

As I said in a previous comment, I think the other part of this solution is to 
pick the default encoding for a remote file based on the encoding found for 
the local file if a remote encoding could not be determined. This may not be 
correct 100% of the time but if the support to manually set the encoding was 
added than that would be enough.

The question than becomes whether Compare should default to the workspace 
encoding if a remote encoding is not available or if CVS should. I think it 
should be in Compare so that other clients that do not support encodings will 
get the added feature as well. However, it could be done in CVS (or, more 
likely Team), if there are compelling reasons not to do this in Compare.

Comment 28 Andre Weinand

2005-01-06 12:06:21 EST

*** Bug 81637 has been marked as a duplicate of this bug. ***

Comment 29 Michael Valenta

2005-03-07 13:55:42 EST

*** Bug 87300 has been marked as a duplicate of this bug. ***

Comment 30 Michael Valenta

2005-05-12 09:03:46 EDT

*** Bug 94914 has been marked as a duplicate of this bug. ***

Comment 31 Philippe Ombredanne

2005-07-13 17:49:01 EDT

Just adding my 2 cents, sligthly off-topic: IDEA has a feature to guess
character encoding that is fairly good.
Guessing encoding should IMHO be a platform core or platform resource service.
The code at http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding hass been
integrated in both Groovy, and IDEA.
Guillaume, the author has already confirmed that it is available for free with a
copyright attribution. The version integrated in Groovy in under an Apache
license (Groovy is an official JSR by the way) and is most likely more up to date.
May be worth a look? The problem is there with Eclipse inside and outside of CVS
use and compare.

Comment 32 kai

2005-08-29 22:54:40 EDT

I just wanna say that I have the same problem on linux using 3.1RC4.

The java source files are encoded using shift jis (even though the cvs server
and development environment are all linux - go figure), and when I use the
compare editor the remote view of the file has "mojibake" or corruption of the
japanese text.

I read through the discussion, but don't understand the problem... why wouldn't
the compare editor just use the same encoding for the remote file that the local
file uses? There is no translation of encoding happening anywhere, and I don't
know when this would even be desirable.

Anyway, I hope we can get a fix soon, cvs compare is an important part of
development, and it's crazy to use it when most of the "changes" are bogus due
to encoding problems.

Thanks.

Comment 33 Hendrik Maryns

2005-11-26 09:17:29 EST

I also have this problem, when synchronising UTF-8 encoded files, ë, ï, é and the like get replaced by Ã«, Ã¯ and Ã© respectively in the remote file, thus differences are shown.  When creating a patch, these are ignored though.  Strange thing is, I only get these differences if there is another change in the file too.  I.e., when there is no real change, but the file contains special characters, it does not appear in the Team Synchronizing perspective.

The file says in its comment header it is in UTF-8, and my local file is set to UTF-8 too.

Comment 34 Oldrich Jedlicka

2006-01-11 05:32:05 EST

(In reply to comment #32)
> I read through the discussion, but don't understand the problem... why wouldn't
> the compare editor just use the same encoding for the remote file that the local
> file uses? There is no translation of encoding happening anywhere, and I don't
> know when this would even be desirable.

I have the same opinion. The contents of file in CVS should be handled in the same was as if it would be stored locally. 

Current behaviour (Eclipse 3.1) when you try to make a diff with the latest version from HEAD is: local file translated with the local file encoding is compared with CVS file with no translation - there is encoding mismatch. Better is to translate the CVS file with the local file encoding and then make a comparison. Other solution is to compare the local file without encoding translation with CVS version and then use the local file encoding for both files for displaying.

I do not see any problem here. The only problem can be with displaying the file directly in CVS Repository Exploring perspective, where the remote file encoding is unknown. But when the encoding is known from the local file encoding, it should be used.

Maybe I used bad wording, sorry for my english. If it is not clear, please ask me for details.

Comment 35 Michael Valenta

2006-01-11 08:47:49 EST

I agree that the best solution is to use the local encoding for the remote contents. The problem is really just a manpower issue. It may be the case that we can find time to address this in 3.2 but it is just as likely that we will not have time. If someone were to provide a patch, that would be helpful.

Comment 36 Gilles Querret

2006-01-19 11:37:03 EST

Created attachment 33272 [details]
Patch correcting bad encoding when comparing

This is a short patch, to correct the wrong encoding behavior. This is my very first patch to an Eclipse plugin, so I hope it's not too crude.
Basically, there's no encoding defined in the stream coming from CVS server. So when creating the TextMergeViewer, I'm using the left frame's encoding to decode the right frame's stream. And this will be the same when comparing two revisions.
Hope it's the right location to correct this bug, I'm by the way looking if it's better to include the encoding at a lower level (i.e. when getting the stream from the CVS server).

Regards,

Gilles QUERRET

Comment 37 Andre Weinand

2006-01-19 12:34:48 EST

Thanks for the patch.
I'll try and verify it later today.

Comment 38 Gilles Querret

2006-01-19 13:13:14 EST

On my way home, I realized in missed something in this patch (comparing two steams with encodings both known). I'll send another one tomorrow morning.

Comment 39 Gilles Querret

2006-01-20 04:02:21 EST

Created attachment 33341 [details]
New patch fixing encodings

New patch fixing bad encoding when comparing files. Cleaned the previous one, and took into account comparing two files with encodings both known. This means two identical files encoded using differents codepages will be shown identical in the compare view.
I've made this patch on R3_1_1 tag of org.eclipse.compare module. 
If someone wants to try this patch in his own environment, a JAR file is available at pct.sourceforge.net/org.eclipse.compare_3.1.1.jar. Congrats to every Eclipse guys, it's a real pleasure to code/test/debug with the PDE :-)

Comment 40 Gilles Querret

2006-02-11 07:06:16 EST

Just for information, has anybody tried the patch in my previous comment ( http://pct.sourceforge.net/org.eclipse.compare_3.1.1.jar ) ? And found correct/incorrect behavior ?

Comment 41 Oldrich Jedlicka

2006-02-11 11:41:08 EST

I can confirm that the file from sourceforge works correctly for at least the Text Compare window, when the Local File has a known encoding and the Remote File has an unknown encoding.

Comment 42 Andre Weinand

2006-03-27 05:40:17 EST

fixed for 3.2M6.
Thanks Gilles!

Comment 43 Bart Vanhaute

2006-10-02 05:04:15 EDT

I am currently using 3.3M2, and I still see CVS compare using different encoding for the local and remote file content. Is this really fixed?

My default encoding is UTF-8 but some files have specific encoding, set to latin-1 (ISO-8859-1).

Comment 44 Michael Valenta

2006-10-02 09:11:04 EDT

If you have a scenario that fails, please log a bug and describe the steps required to reproduce the failure. It's possible that it is a scenario that was not covered by this fix. It is also possible that recent changes have caused a regression.

Comment 45 Bart Vanhaute

2006-10-02 16:04:16 EDT

It must be some kind of regression, because the scenario is pretty straightforward:

1. create a module in a CVS repository that contains a file in latin-1 encoding. make sure the file really contains non-ASCII characters.
2. checkout that module using eclipse
3. right-click on the latin-1 file, and change encoding from default UTF-8 to ISO-8859-1
4. open the file; the contents should look ok
5. change something in the file.
6. right-click the file and do a CVS compare with HEAD. 
7. notice left pane contents looks ok, right pane contents is wrong (latin-1 characters are interpreted as UTF-8)

Comment 46 Michael Valenta

2006-10-02 16:16:50 EDT

Alright, we'll have a look

Comment 47 Michael Valenta

2006-10-12 15:29:05 EDT

I've found the most likely cause and should have a fix for 3.3 M3.

Comment 48 Michael Valenta

2006-10-12 15:59:29 EDT

Fix released to HEAD