159516 – [encoding] Improve text editor support when wrong encoding is used

Bug 159516 - [encoding] Improve text editor support when wrong encoding is used

Summary: [encoding] Improve text editor support when wrong encoding is used

Status:	RESOLVED DUPLICATE of bug 145754

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Text (show other bugs)
Version:	3.2
Hardware:	All All

Importance:	P3 enhancement (vote)
Target Milestone:	---
Assignee:	Platform-Text-Inbox
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-10-02 16:29 EDT by Olivier Thomann
Modified:	2006-10-26 12:11 EDT (History)
CC List:	2 users (show)

See Also:

Attachments
Proposed fix (4.78 KB, patch) 2006-10-25 14:29 EDT, Olivier Thomann	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Olivier Thomann

2006-10-02 16:29:19 EDT

Using latest, I think the feedback to the user is pretty weak when the wrong encoding is used to read a source file.
I simply get an error saying that this file is unreadable using "...." character encoding. I would expect the editor to still read the file but to create markers on the characters that could not be properly mapped using the given encoding.
You can have a look at the code in the project:
org.eclipse.jdt.compiler.tool/src/org/eclipse/jdt/compiler/tool/Util.java

Providing more feedback to the user would improve the user experience. At least the source could be displayed with '?' or other characters that would replace the character that could not be mapped.

This problem appears mostly on Linux. It seems that on Windows the libraries are handling this case better.

Comment 1 Dani Megert

2006-10-09 03:42:00 EDT

What Eclispe version and VM are you using? In general we do open the file and show the '?' but then warn when saving which is a problem too, see bug 145754.

Anything in the .log?

Comment 2 Olivier Thomann

2006-10-09 10:00:38 EDT

On Linux it seems to fail. I'll check tomorrow with Boris.

Comment 3 Olivier Thomann

2006-10-25 14:26:53 EDT

On Linux, this is not the case.
I suggest to improve the file buffer code to use CharsetDecoder instead of Readers.
I will provide a patch for replacing unmappable characters with the default character for the encoding used to read the file. This is not ideal since there is no feedback to the user. We might want to collect all the locations for the unmappable characters through the file and provide a button at the top of the editor that would be used to change the encoding of the file and another button beside the first one to open a window that would show all the locations where an unmappable character has been found.
I also have code that collects all these positions. Let me know if you are interested.

Comment 4 Olivier Thomann

2006-10-25 14:29:56 EDT

Created attachment 52687 [details]
Proposed fix

If you need to positions for the wrong characters, the patch has to be modified.

Comment 5 David Williams

2006-10-25 14:58:47 EDT

I haven't looked at this bug in detail, so, sorry in advance if I'm reading it wrong, but ... seems to me this solution might help one case ... reading contents into file buffers, but, doesn't that imply then that clients (or different clients) would get different contents, depending on how they read it in? Java Editor versus Java Compiler, for example? 

Perhaps any encoding specific functionality should be abstracted out to a class anyone can use? We have a CodedIO class in WTP that attempts to do some of this (obviously of limited scope, since in WTP). 

Just thought I'd document this here, to see if anyone else sees merit in this abstraction. 

BTW, most users I'm sure would like a chance to "fix" encoding, before seeing the question marks, but then just get the question marks if they didn't know how to fix.

Comment 6 Olivier Thomann

2006-10-25 15:07:13 EDT

The Java compiler outside of Eclipse would still be broken since it cannot use CharsetDecoder class. It is limited to Foundation 1.0 library.
However the same kind of fix can go in the org.eclipse.jdt.internal.core.builder.SourceFile code used by the java builder. Then the builder and therefore the compiler would be able to compile the file instead of reporting that the project cannot be built since a compilation unit could not be read. This would work when working inside Eclipse.
The handling of the encoding should not be done by WTP. All plugins that are using text files would still be broken. So this needs to be a low level operation and this is why I thought it has to be changed at the Platform/Text level.
I could not find a good mechanism to report encoding errors to the user.

Comment 7 Dani Megert

2006-10-26 02:30:27 EDT

Before we go deeper: can you repost the scenario i.e. was the file not opened at all? If so, that's good ;-) But I guess your problem was, that it opened the file but didn't allow to save later i.e. dup of bug 145754.

*** This bug has been marked as a duplicate of 145754 ***

Comment 8 Boris Bokowski

2006-10-26 08:52:32 EDT

(In reply to comment #7)
> Before we go deeper: can you repost the scenario i.e. was the file not opened
> at all?

See Bug 162216.  The problem was caused by an O Umlaut in a file. No problem under Windows with Cp1252, but after checking out the file on a Linux box (default encoding UTF-8), the compiler wouldn't compile the file, and the editor does not open the file. It shows an error message (Character Encoding Problems/The file is unreadable using the "UTF-8" character encoding) and allows you to set the encoding.

When this happens, you can only guess what the correct encoding is, and you have no help as to which character(s) caused this problem.  It happened to us twice in the last couple of weeks, and it is hard to figure out what's going on for the developer who uses Linux because he didn't check in that file in the first place.

Comment 9 Dani Megert

2006-10-26 09:03:41 EDT


*** This bug has been marked as a duplicate of 145754 ***

Comment 10 Dani Megert

2006-10-26 09:07:49 EDT

That's exactly what bug 145754 is about - so please leave this one closed as duplicate.

Stupid question: if your team works with different platforms why don't you specify the default encoding in the project and store this in the repository? Just wondering.

Comment 11 Dani Megert

2006-10-26 09:12:20 EDT

Also note that there might be corrupt files in the repository from earlier days where we didn't warn if invalid characters got stored into a file.

Comment 12 Boris Bokowski

2006-10-26 11:08:58 EDT

(In reply to comment #10)
> That's exactly what bug 145754 is about - so please leave this one closed as
> duplicate.

Sorry about that - I must have missed the comment about the "deluxe solution" which is probably what Olivier and I are suggesting.

> Stupid question: if your team works with different platforms why don't you
> specify the default encoding in the project and store this in the repository?

Because the team was happy with 7-bit-ASCII until we started getting patches from contributors with Umlauts in their names.  :)

Comment 13 David Williams

2006-10-26 12:11:19 EDT

> 
> Because the team was happy with 7-bit-ASCII until we started getting patches
> from contributors with Umlauts in their names.  :)
> 

Yes, we in WTP ran into that as well ... and we set each project to have its default setting. We used ISO-8859-1, since that's what the Eclipse Foundation infrastructure uses on its CVS server. (Which then might require you to specify ISO-8859-1, when you build, since, I've heard, your Eclipse project build machines use UTF-8.