Bug 116841 - [typing] Need much smart encoding error detection for text save
Summary: [typing] Need much smart encoding error detection for text save
Status: VERIFIED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Text (show other bugs)
Version: 3.1   Edit
Hardware: PC Windows XP
: P2 critical (vote)
Target Milestone: 3.2 M6   Edit
Assignee: Dani Megert CLA
QA Contact:
URL:
Whiteboard:
Keywords: api
: 128026 (view as bug list)
Depends on:
Blocks: 127735
  Show dependency tree
 
Reported: 2005-11-17 05:18 EST by Hirotaka Matsumoto CLA
Modified: 2006-06-02 07:12 EDT (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hirotaka Matsumoto CLA 2005-11-17 05:18:44 EST
When the document has the characters which can't be converted to
the specified encoding and it is saved, no error is reported so that
it gives the users that the save is ok. However, when the editor is
closed and the same document is opened again, such characters are all
changed to '?'. ( This is because java converter changes all such
characters to '?' ). For example, if Window > Preferences > 
Workbench > Editors > Text file encoding is set to US-ASCII, and
the text file is opened by TextEditor and the non US-ASCII characters
are entered ( all they are doable now ), and finally that file is
saved, no error is reported. However, after that document is closed
and is opened again, there are '?' characters.

This is a surpirse for the users so that it would be best if
the save detected this kind of invalid character errors and reported
this to the users. For example, OutputStreamWriter#write throws
UnmappableCharacterException, so we can detect this kind of
encoding errors.
Comment 1 Dani Megert CLA 2005-11-17 05:33:05 EST
>the non US-ASCII characters
>are entered ( all they are doable now )
How do you do it? Copy/Paste?

The only way to detect this is to re-read the saved file and then compare it to
the editor content. This significantly slows down save performance.
Comment 2 Hirotaka Matsumoto CLA 2005-11-17 20:02:43 EST
> How do you do it? Copy/Paste?

If you are using non US locale OS, you can do it. In my
environment, I'm always using Japanese Win OS locale and
I can set up US-ASCII to the preference though TextEditor
can accept Japanese characters from IM.

> The only way to detect this is to re-read the saved file

OutputStreamWriter can throw the exception. There is no need
to re-read the file.
Comment 3 Dani Megert CLA 2005-11-18 05:24:33 EST
What do other editors do in that situation (e.g. Notepad)?

>OutputStreamWriter can throw the exception. There is no need
>to re-read the file.
AFAIK it
- throws an exception upon creation if the encoding itself is not supported
- IOException if the writing fails
Can you point me to the method (and spec) that says the writer is checking
whether a character is legal in the specified encoding.
Comment 4 Hirotaka Matsumoto CLA 2005-11-20 21:37:22 EST
- Notepad warns this mis-match encoding and suggests us to use the
  unicode. Also other commercial editors check this encoding error.
- OutputStream<init>(OutputStream out, CharsetEncoder enc) accepts
  CharsetEncoder. According to CharsetEncoder, its doc says :
    How an encoding error is handled depends upon the action requested
    for that type of error, which is described by an instance of the   
    CodingErrorAction class
  And CodingErrorAction#REPORT doc says :
    Action indicating that a coding error is to be reported, either
    by returning a CoderResult object or by throwing a 
    CharacterCodingException, whichever is appropriate for the method
    implementing the coding process.

  I think they are good starting point for you to investigate
  this problem.

.

Comment 5 David Williams CLA 2005-11-20 21:50:35 EST
Just FYI, I agree that saving text, and "losing data" due to charset mismatch, without warning user really is "losing data". We used to handle/gaurd against this in SSE layer ... but, since "the platform" now handles encoding/charset "the platform" is the right place to handle this. I'd suggest a user choice "cancel save" or "save anyway" sort of choice. (And, of course, its really that "model layer" that needs to throw an exception, and let the UI layer decide how to handle it). 




Comment 6 Hirotaka Matsumoto CLA 2005-11-20 22:39:28 EST
(In reply to comment #5)
> I'd suggest a user choice
> "cancel save" or "save anyway" sort of choice.

I think this choice is enough for users to see there are some
encoding error so that this 'save' may cause some data-loss
error.

.
Comment 7 Dani Megert CLA 2006-02-16 06:02:14 EST
*** Bug 128026 has been marked as a duplicate of this bug. ***
Comment 8 Dani Megert CLA 2006-02-24 10:56:02 EST
This is now correctly handled by file buffers and document providers: they throw a CoreException with a custom status code if the mapping fails.
  - for the file buffers: IFileBufferStatusCodes.CHARSET_MAPPING_FAILED
  - for document providers: EditorsUI.CHARSET_MAPPING_FAILED

The editor reports this to the user and suggests to either change the encoding or the offending characters.

Fixed in HEAD.
Available in builds > N20060224-0010.
Comment 9 John Arthorne CLA 2006-02-24 15:25:01 EST
Out of curiosity, do you need to define the status constant as API in two places? Presumably any client of EditorsUI will also have the filebuffers plugin available?
Comment 10 Dani Megert CLA 2006-02-27 04:25:05 EST
John, as you can see from IStatus.getCode():
	/**
	 * Returns the plug-in-specific status code describing the outcome.
	 *
	 * @return plug-in-specific status code
	 */
	public int getCode();

the value space for the code is per plug-in. If we used the same value in file buffers and Editors UI we'd have to make sure the codes don't overlap and that's a burden we don't want to take.
Comment 11 John Arthorne CLA 2006-02-27 12:20:32 EST
Ok, makes sense.  CCing Mcq for PMC approval of this new API.  The new API is two status code constants (ints).  The constants allow a client to distinguish a generic failure from failure to save due to presence of characters that can not be saved in the chosen encoding.
Comment 12 Dani Megert CLA 2006-02-27 12:29:31 EST
>Ok, makes sense.  CCing Mcq for PMC approval of this new API.
Note that one constant is added to a final class and the other constant is added to a constant interface which is marked as:
 * <p>
 * Clients are not supposed to implement that interface.
 * </p>
and also correctly listed in the component.xml. That's why I did not raise this as being an API change.
Comment 13 Mike Wilson CLA 2006-02-27 14:23:09 EST
Makes sense to me.
Comment 14 Tobias Widmer CLA 2006-03-28 06:01:34 EST
Verified using I20060328-0010