Bug 237567 - [document] html editor mishandles encoding - it may destroy file
Summary: [document] html editor mishandles encoding - it may destroy file
Status: RESOLVED FIXED
Alias: None
Product: WTP Source Editing
Classification: WebTools
Component: wst.html (show other bugs)
Version: 2.0.2   Edit
Hardware: PC Windows XP
: P2 major with 4 votes (vote)
Target Milestone: 3.5 M3   Edit
Assignee: Nick Sandonato CLA
QA Contact: Nitin Dahyabhai CLA
URL:
Whiteboard:
Keywords: helpwanted
Depends on:
Blocks:
 
Reported: 2008-06-17 21:42 EDT by Toshihiro Izumi CLA
Modified: 2012-10-24 15:02 EDT (History)
3 users (show)

See Also:


Attachments
Preferences - workspace (23.35 KB, image/png)
2008-06-17 21:43 EDT, Toshihiro Izumi CLA
no flags Details
Properties - file (18.43 KB, image/png)
2008-06-17 21:44 EDT, Toshihiro Izumi CLA
no flags Details
First - create > edit (12.73 KB, image/png)
2008-06-17 21:44 EDT, Toshihiro Izumi CLA
no flags Details
Second - save > close > open (12.64 KB, image/png)
2008-06-17 21:45 EDT, Toshihiro Izumi CLA
no flags Details
Third - save > close > open (12.72 KB, image/png)
2008-06-17 21:46 EDT, Toshihiro Izumi CLA
no flags Details
sample code (EncodingGuesser.java) (8.54 KB, text/plain)
2010-04-29 02:20 EDT, Toshihiro Izumi CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Toshihiro Izumi CLA 2008-06-17 21:42:29 EDT
Environment:
WindowsXP SP3 Japanese - Native encoding is MS932(Shift JIS)
Eclipse 3.3.2
WST 2.0.2

Steps to reproduce:
0. Create new workspace (for initialization)
1. Open Preferences > General > Workspace, set 'UTF-8' to Text file encoding (default is MS932)
2. Create Static Web Project
3. Create html file (New > File)
4. Input Japanese characters in the editor
5. Save it and close editor
6. Reopen it
Now you see collapsed characters.

Problematic conditions:
a) Workspace encoding is set to other than OS default
b) Not having explicit setting of file encoding (default -> determined from content)
c) Not having meta tag (charset) in the file

Editor treats a file with 'determined from content' = Shift_JIS.
But editor saves it with workspace encoding = UTF-8.
This may damage files and make it unrecoverable.
Comment 1 Toshihiro Izumi CLA 2008-06-17 21:43:24 EDT
Created attachment 105234 [details]
Preferences - workspace
Comment 2 Toshihiro Izumi CLA 2008-06-17 21:44:08 EDT
Created attachment 105235 [details]
Properties - file
Comment 3 Toshihiro Izumi CLA 2008-06-17 21:44:49 EDT
Created attachment 105236 [details]
First - create > edit
Comment 4 Toshihiro Izumi CLA 2008-06-17 21:45:34 EDT
Created attachment 105237 [details]
Second - save > close > open

(file is UTF-8 but editor shows it as Shift_JIS)
Comment 5 Toshihiro Izumi CLA 2008-06-17 21:46:21 EDT
Created attachment 105238 [details]
Third - save > close > open

(editor saved collapsed(misconverted) characters, damage will be enlarged)
Comment 6 Toshihiro Izumi CLA 2008-06-23 02:57:36 EDT
The direct cause is because org.eclipse.wst.html.core.internal.contenttype.EncodingGuesser does not support UTF-8.
I don't know whole logic though.
Comment 7 David Williams CLA 2008-07-05 00:54:56 EDT
pretty sure this isn't intended for "3.0 Patch" ... I'm guessing you meant 3.0.1, Nitin. 

Comment 8 Nick Sandonato CLA 2010-04-28 16:06:25 EDT
I'm definitely not familiar with the logic behind these tables. I assume it's some kind of mapping between ascii and Kanji?
Comment 9 Toshihiro Izumi CLA 2010-04-29 02:20:07 EDT
Created attachment 166430 [details]
sample code (EncodingGuesser.java)

UTF-8 and MS932 cannot be detected automatically because their code overlaps each other.
Reading all characters and searching for best matched encoding may be 'best effort'. (org.eclipse.wst.html.core.internal.contenttype.EncodingGuesser returns the first detected encoding only.)
It is 'best effort' but not perfect. There is no perfect method because of overlapping.

I'm sorry that I cannot contribute a patch. I don't need auto-detection feature. So *my* hope is having option which enables/disables auto-detection. I will disable it forever.
Or it should be able to choose 'determined from contents' or 'inherited from container' or 'inherited from content-type' or...
Comment 10 Toshihiro Izumi CLA 2010-06-12 22:26:23 EDT
juniversalchardet - Project Hosting on Google Code
http://code.google.com/p/juniversalchardet/

Is it possible to use this?
Comment 11 David Williams CLA 2010-06-12 23:56:52 EDT
I'm not sure what's going on here (have only glanced at bug, and it has been a while since I looked at code), but I can describe how its supposed to work, a bit, and perhaps would at least clarify. 

You say "The direct cause is because
org.eclipse.wst.html.core.internal.contenttype.EncodingGuesser does not support
UTF-8" but I think that EndcodingGuesser is supposed to only detect a few old cases of Kanji and stuff based on the bit patterns, and if it does not find a positive "hit", then the other, more modern ways of determining encoding are used. 

So, are you saying EncodingGuesser is returning the wrong value? Or ... just returning no value (which is what I'd expecting). 

Next, ideally, the eclipse code detects a <meta data tag with charset set in the file itself ... it is best to define in the file itself, what its encoding is). It sounds like you are not using that? 

If it finds no <meta tag in the html file ... well, here my memory gets fuzzy and guess there's no reason to guess. I suggest using the <meta tag and would be interested to hear if/why that wouldn't work for you. 

I'm not sure if anyone will be motivated to investigate/use the Mozilla code to guess encoding. Could be great. But, there is some danger of subtle changing what is detected that might "break" clients from previous releases (And, don't mean to be negative ... whether is can be sued depends on the license, etc., and if anyone wanted to investigate the functionality that'd be great. 

So, let us know if using charset in meta tag doesn't work for you.
Comment 12 Ed Wright CLA 2010-06-14 02:29:26 EDT
Using linux (basically Debian Lenny)
Eclipse 3.5.2
LANG=ja_JP.utf8

I have set default content type to "UTF-8" in Preferences:General:Content types:Text:HTML

I have set Encoding to UTF-8 in the containing directory.

However, when creating a completely empty anyname.html file, the encoding is set to Shift_JIS.

I use many include files, and I cannot use a <meta...> tag in those files.

I'm not familiar with Eclipse internals, but seems to me if the EncodingGuesser is unable to determine the encoding, as it surely cannot on an empty file, it would be better to fall back to one of the above, or enable the options Toshihiro suggested (2010-04-29) rather than defaulting to Shift_JIS.
Comment 13 David Williams CLA 2010-06-14 03:20:41 EDT
> 
> However, when creating a completely empty anyname.html file, the encoding is
> set to Shift_JIS.
> 
> I use many include files, and I cannot use a <meta...> tag in those files.
> 
> I'm not familiar with Eclipse internals, but seems to me if the EncodingGuesser
> is unable to determine the encoding, as it surely cannot on an empty file, it
> would be better to fall back to one of the above, or enable the options
> Toshihiro suggested (2010-04-29) rather than defaulting to Shift_JIS.


Ok, vague memory round two ... I seem to recall some old requests (like 5, 6 years ago) that for html, if the native machine had a japanese charset, and the content type was not in the file, then the naive encoding be used, since many people had thousands of existing HTML files that depended on that (since HTML was late in comeing to standards, and the HTML files were created before Eclispe ... bottom line, have you tried starting eclipse with a system property of -Dfile.encoding=utf8?  I realize on your last post you said you had LANG=ja_JP.utf8 set on your system ... but, I simply don't know what what means, or if Java uses it ... whereas the file.encoding property might be enough to work around this old funky behavior? 

Hope this helps ... and btw, I'm not saying this the right way it should work, or that it couldn't be improved ... but, if this works for you, that help us understand the problem better, and maybe get you one step closer to a workable system?
Comment 14 Ed Wright CLA 2010-06-14 05:01:43 EDT
(In reply to comment #13)

> Ok, vague memory round two ... I seem to recall some old requests (like 5, 6
> years ago) that for html, if the native machine had a japanese charset, and 
> the content type was not in the file, then the naive encoding be used,

Linux doesn't really have a "native encoding" per se.... it uses the concept of locale for various language related settings including encoding. Since I've set my locale to ja_JP.utf8 then I guess the "native encoding" should be UTF-8 Japanese.

> bottom line, have you tried starting eclipse with a system property
> of -Dfile.encoding=utf8?  

I have now... it has no effect... files are still created with encoding set to Shift_JIS.

Dunno if it's relevant, but I did some googling and came up with:

http://bugs.sun.com/view_bug.do?bug_id=4163515

which states in part:

 "The "file.encoding" property is ... an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only...

The preferred way to change the default encoding used by the VM and the runtime
system is to change the locale of the underlying platform before starting your
Java program."

> I realize on your last post you said you had
> LANG=ja_JP.utf8 set on your system 

This is one of the locale settings. I have also tested with LC_ALL=ja_JP.utf8 which should provide a complete Japanese UTF-8 environment. But this also has no effect.

Not sure where Eclipse takes over from Java, but it seems likely that the fallback encoding (Shift_JIS) would be an Eclipse thing rather than Java?
Comment 15 Motoi Washida CLA 2012-09-02 14:20:11 EDT
I also have the same issue.

System
--------------------------------

OS: Mac OS X 10.7.4
Java Runtime: 1.6.0_33
Eclipse Platform: 4.2.0.I20120608-1400
WST: 3.4.0.v201203141800


Language Settings
--------------------------------

- Logged in as Japanese user.
- Workspace encoding is set to UTF-8.
- file.encoding is SJIS.
    - I set file.encoding (by appending to eclipse.ini) to UTF-8, but the result was same.
- user.language is ja.


I'm using Play Framework (2.0.3), which searches app/views/*.scala.html files as HTML templates (and many of them are fragments of HTML). When I open UTF-8 encoded templates, it is opened as SJIS file.

I know this auto-detection can be disabled by setting "user.language" system property to "en" (it seems that EncodingGuesser#canGuess() returns true only if system locale is Japanese). But I guess it is better to provide option to disable EncodingGuesser as Toshihiro suggested.
Comment 16 Nick Sandonato CLA 2012-10-24 15:02:28 EDT
I've pushed changes based on the suggestions made in the comments of this defect. In the case that it can't reasonably guess the encoding of a file, it falls back to checking the preferences found under Loading Files on the HTML Files preference page, and the Workspace defaults. And, as an added precaution, I added a check for a system property "org.eclipse.wst.html.encoding.guess", which can be set to "false" if one does not want any encoding guessing done at all, as requested.

As was pointed out in comment 15, the encoding guesser will only run if the system locale is Japanese.

http://git.eclipse.org/c/sourceediting/webtools.sourceediting.git/commit/?id=1c861a2e4abcdeeddb228c3bbcefe80d8f72abbe

http://git.eclipse.org/c/sourceediting/webtools.sourceediting.git/commit/?id=1e464cc79cc1b0ae733ca7822893ef6bbcf0c046