Community
Participate
Working Groups
We use UTF8 as our encoding. We have a properties file which has Hebrew values. When comparing with CVS the remote side, although correctly compared, shows gibberish instead of hebrew where as the local side is OK. I again stress that this does not effect the compare itself and we can sync without a problem, it only looks wrong.
What encoding do you use for the property file? Since property files are by definition ISO-8859-1, I assume you encode the hebrew characters in \uxxxx notation?
We use UTF-8, as I said. It's true, the actual file uses \uxxxx but we use a hebrew character file to insert our properties and than trascode it. This file is also synchronized in CVS and it has the problem. Thanks
From your comment I'm not able to understand how you deal with property files. But anyway, I assume that the file that shows "gibberish instead of hebrew" is stored in UTF-8 encoding in CVS? Is UTF-8 the platform or workbench encoding? If not, does the problem disappear if you explicitely set the workbench encoding to UTF-8?
Both the workbench and the platform is set to UTF-8 and the problem still exists. to explain again my other comment. Since /uxxxx is not human readable we keep a file with hebrew readable charcters. We than use on that file a special parser transcoder which generates the actual properties file with /uxxxx chars in it.
My computer uses UTF-8 as file encoding and the cvs server is configured as ISO-8859-1. When performing a "compare to", the remote file is retrieved as would it be UTF-8 encoded instead of ISO-8859-1.
Marc, you'll have to set the encoding of the local project to match the encoding used on the CVS server, so in your case you'll have to set the project to ISO-8859-1 (which overrides the UTF-8 platform encoding).
Shai, what encoding does the CVS server use? Make sure that both the local resources as well as the server have the same encoding. Eclipse's CVS support does not support automatic translation of encodings.
Andre, the encoding for the project as well as the one for the server are already set to ISO-8859-1. It just seems that Eclipse doesn't look at this setting when performing a diff.
Where do you "set the server encoding to ISO-8859-1" ?
Andre, - in the CVS Repository Exploring perspective, the server encoding is set to Other: ISO-8859-1 - in the Java perspective, the project encoding is set to Other: ISO-8859-1 - each file I want to compare with its previous version has: Default (inherited from container: ISO-8859-1) It doesn't change anything if I force a file's encoding explicely to Other: ISO-8859-1 (and would be quite tedious to do on all files).
Be aware that the "server encoding" that you can specify for a CVS server does not affect the encoding of the file's content. It is only used for file names and comments. Currently there is no way to specify the encoding for the file's content.
This means that when performing a diff, Eclipse considers content's encoding of: - the local file according to project/file settings - other version received from CVS according to computer's default encoding No really good isn't it?
When performing a diff, Compare uses the IEncodedStorage to determine the encoding for both (or all three) sides. For local resources, this does exactly what you would expect. The implementation for the CVS resources uses the same algorithm as for local files. This is probably not what you would expect because it means that you have to set the encoding for any project shared via CVS to the encoding of the CVS server. I agree, this is suboptimal. I'm moving this bug to Platform/CVS because Compare doesn't control the implementation of IEncodedStorage for CVS resources.
I'm not sure that I fully understand your explaination because I have currently the same encoding for the project and for the cvs server.
And what is the problem you are seeing?
for instance ä, ö, ü in the remote file are represented as � (not the normal "?" but bold and white on a black point) and therefore a difference is seen. When comparing 2 versions from the history with each other, both have ? instead of ö, ä, ü, etc and therefore no diff is detected (what is correct) although the text is not correctly represented. If the local file is replaced with a previous version then its content is correctly displayed.
Without exact steps it is impossible for me to reproduce your problem. Could you please send me (or attach) both the unmodified local file and the unmodified file stored in CVS repository as a zip archive. Thanks.
Two more questions: - if you open the local file in a text editor, does it look right? - if you navigate to the remote file in the CVS repository view and then open it in an editor, does it display correctly?
Marc, do you have the requested additional information?
Created attachment 16234 [details] Asked file: the file from the cvs server and the local version
Andre, - the local file opened in a text editor looks right (as long as the text editor knows it has to read it as ISO-8859-1). - the file content is not correctly displayed when navigating in the CVS repository view and clicking on Open for the file I've attached the files you asked to the bug.
Michael or Jean-Michel, So if the file content is not correctly displayed when navigating in the CVS repository view and opening it, this probably means that the problem is not in Compare?
This in the behavior I saw when I set my platform encoding to UTF-8 and opened the file in the repository explorer. 1) The IEncodedStorage from CVS returned null as an encoding for the remote file could not be determined from the contents (i.e. there is no link between a file in the repository and what is in the workspace). 2) The StorageDocumentProvider (from the texteditor framework) assumed the workspace encoding and the file was not displayable 3) A button appeared in the editor which allowed me to change the encoding for that file. 4) I changed the encoding and the file was displayed properly. When I opened a compare editor, the remote file was just blank. I think the above behavior would be benefitial to have in the compare editor as well. Also, in the case where the remote files return null as the charset and the local file has a charset, it may be better for compare to use the encoding from the local file as the default instead of the platform encoding. Arguably, this could be done by the editor input provided by CVS instead but I think we would still want the fallback behavior that allowed the user to set the encoding manually.
Aftre some more reflection, I think the best solution to this is for Compare to allow the user to manually set the encoding for the remote file. Any other solution is a guess. Even if we were right most of the time, we would still need the ability for the user to manually set the encoding when the guess was wrong. Moving to Compare for comment.
Not really practicable: this would mean manually set the encoding for the remote file each time we perform a compare! Quite tedious.
Yes, I can provide UI for overriding the encoding that I get through IEncodedStorage. However, this should be the exception not the rule. If every file stored in a repository uses the specific encoding of that repository, it would make more sense to define this encoding once via the repository location property dialog. This would be similar to the already existing encoding property for file paths and comments.
CVS servers do not support file content encodings. We could try to fake this is Eclipse by supporting an encoding on the repository but this makes the assumption that all files stored in the repository contain the same file content encodings. Although this may be true most of the time, it will not be true all the time. For those cases, we would need to the ability to specify encodings for individual files or folders, just as the workspace provides. This is not something we want to do. As I said in a previous comment, I think the other part of this solution is to pick the default encoding for a remote file based on the encoding found for the local file if a remote encoding could not be determined. This may not be correct 100% of the time but if the support to manually set the encoding was added than that would be enough. The question than becomes whether Compare should default to the workspace encoding if a remote encoding is not available or if CVS should. I think it should be in Compare so that other clients that do not support encodings will get the added feature as well. However, it could be done in CVS (or, more likely Team), if there are compelling reasons not to do this in Compare.
*** Bug 81637 has been marked as a duplicate of this bug. ***
*** Bug 87300 has been marked as a duplicate of this bug. ***
*** Bug 94914 has been marked as a duplicate of this bug. ***
Just adding my 2 cents, sligthly off-topic: IDEA has a feature to guess character encoding that is fairly good. Guessing encoding should IMHO be a platform core or platform resource service. The code at http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding hass been integrated in both Groovy, and IDEA. Guillaume, the author has already confirmed that it is available for free with a copyright attribution. The version integrated in Groovy in under an Apache license (Groovy is an official JSR by the way) and is most likely more up to date. May be worth a look? The problem is there with Eclipse inside and outside of CVS use and compare.
I just wanna say that I have the same problem on linux using 3.1RC4. The java source files are encoded using shift jis (even though the cvs server and development environment are all linux - go figure), and when I use the compare editor the remote view of the file has "mojibake" or corruption of the japanese text. I read through the discussion, but don't understand the problem... why wouldn't the compare editor just use the same encoding for the remote file that the local file uses? There is no translation of encoding happening anywhere, and I don't know when this would even be desirable. Anyway, I hope we can get a fix soon, cvs compare is an important part of development, and it's crazy to use it when most of the "changes" are bogus due to encoding problems. Thanks.
I also have this problem, when synchronising UTF-8 encoded files, ë, ï, é and the like get replaced by ë, ï and é respectively in the remote file, thus differences are shown. When creating a patch, these are ignored though. Strange thing is, I only get these differences if there is another change in the file too. I.e., when there is no real change, but the file contains special characters, it does not appear in the Team Synchronizing perspective. The file says in its comment header it is in UTF-8, and my local file is set to UTF-8 too.
(In reply to comment #32) > I read through the discussion, but don't understand the problem... why wouldn't > the compare editor just use the same encoding for the remote file that the local > file uses? There is no translation of encoding happening anywhere, and I don't > know when this would even be desirable. I have the same opinion. The contents of file in CVS should be handled in the same was as if it would be stored locally. Current behaviour (Eclipse 3.1) when you try to make a diff with the latest version from HEAD is: local file translated with the local file encoding is compared with CVS file with no translation - there is encoding mismatch. Better is to translate the CVS file with the local file encoding and then make a comparison. Other solution is to compare the local file without encoding translation with CVS version and then use the local file encoding for both files for displaying. I do not see any problem here. The only problem can be with displaying the file directly in CVS Repository Exploring perspective, where the remote file encoding is unknown. But when the encoding is known from the local file encoding, it should be used. Maybe I used bad wording, sorry for my english. If it is not clear, please ask me for details.
I agree that the best solution is to use the local encoding for the remote contents. The problem is really just a manpower issue. It may be the case that we can find time to address this in 3.2 but it is just as likely that we will not have time. If someone were to provide a patch, that would be helpful.
Created attachment 33272 [details] Patch correcting bad encoding when comparing This is a short patch, to correct the wrong encoding behavior. This is my very first patch to an Eclipse plugin, so I hope it's not too crude. Basically, there's no encoding defined in the stream coming from CVS server. So when creating the TextMergeViewer, I'm using the left frame's encoding to decode the right frame's stream. And this will be the same when comparing two revisions. Hope it's the right location to correct this bug, I'm by the way looking if it's better to include the encoding at a lower level (i.e. when getting the stream from the CVS server). Regards, Gilles QUERRET
Thanks for the patch. I'll try and verify it later today.
On my way home, I realized in missed something in this patch (comparing two steams with encodings both known). I'll send another one tomorrow morning.
Created attachment 33341 [details] New patch fixing encodings New patch fixing bad encoding when comparing files. Cleaned the previous one, and took into account comparing two files with encodings both known. This means two identical files encoded using differents codepages will be shown identical in the compare view. I've made this patch on R3_1_1 tag of org.eclipse.compare module. If someone wants to try this patch in his own environment, a JAR file is available at pct.sourceforge.net/org.eclipse.compare_3.1.1.jar. Congrats to every Eclipse guys, it's a real pleasure to code/test/debug with the PDE :-)
Just for information, has anybody tried the patch in my previous comment ( http://pct.sourceforge.net/org.eclipse.compare_3.1.1.jar ) ? And found correct/incorrect behavior ?
I can confirm that the file from sourceforge works correctly for at least the Text Compare window, when the Local File has a known encoding and the Remote File has an unknown encoding.
fixed for 3.2M6. Thanks Gilles!
I am currently using 3.3M2, and I still see CVS compare using different encoding for the local and remote file content. Is this really fixed? My default encoding is UTF-8 but some files have specific encoding, set to latin-1 (ISO-8859-1).
If you have a scenario that fails, please log a bug and describe the steps required to reproduce the failure. It's possible that it is a scenario that was not covered by this fix. It is also possible that recent changes have caused a regression.
It must be some kind of regression, because the scenario is pretty straightforward: 1. create a module in a CVS repository that contains a file in latin-1 encoding. make sure the file really contains non-ASCII characters. 2. checkout that module using eclipse 3. right-click on the latin-1 file, and change encoding from default UTF-8 to ISO-8859-1 4. open the file; the contents should look ok 5. change something in the file. 6. right-click the file and do a CVS compare with HEAD. 7. notice left pane contents looks ok, right pane contents is wrong (latin-1 characters are interpreted as UTF-8)
Alright, we'll have a look
I've found the most likely cause and should have a fix for 3.3 M3.
Fix released to HEAD