Bug 136854

Summary:	[api][encoding] Autodetect files encoding and force BOM save to files
Product:	[Eclipse Project] Platform	Reporter:	Marc Bauer <marc.bau>
Component:	Resources	Assignee:	Platform-Resources-Inbox <platform-resources-inbox>
Status:	NEW ---	QA Contact:
Severity:	enhancement
Priority:	P4	CC:	daniel_megert, eclipse, kashihara, pawel.pogorzelski1, sebastianzartner
Version:	3.1.2
Target Milestone:	---
Hardware:	PC
OS:	Windows XP
Whiteboard:

Description Marc Bauer

2006-04-14 11:57:34 EDT

Daniel Megert forced me in https://bugs.eclipse.org/bugs/show_bug.cgi?id=78455 to issue a new bug for this.

Primary Issue:
We have BIG troubles with non-existing BOM in UTF-8 files.

Cause of Problem:
We come from Dreamweaver and are developing ColdFusion Apps... All old
development is done in Homesite <=5 and Dreamweaver <=6.1 and therefor file encoding is "windows 1252".

Problem description:
So we tryed to change the workspace to UTF8 for creating new files only in UTF8 for getting the apps internationalized in future. After we change this setting all files in the workspace are going to be destroyed if it comes to german umlauts like ö ä ü and others non windows 1252 chars.

if i change the workspace back to windows-1251 and create new files, they are not UTF8 what results in destroyed chars the coldfusion web app (website).

What we expect and will solve the problem for future:
1. we must create new files only in UTF-8 *with* BOM (ColdFusion requires this - Dreamweaver and others, too).

2. encoding autodetection should be in place for older windows-1252 files and they should be opened/saved as windows 1252 whatever the workspace config says and should inherit.

If this is not possible in this way to detect, we must change the workspace to windows 1252 until all files are converted and new files must be created in UTF8 and detected as UTF8 if opened/saved and not like the workspace encoding (win1252).

(*Critical* - eclipse destroyes all our files without our knowledge if workspace is currently UTF8).

3. Additional feature should be a converter message popup, if the opened file does not have similar encoding then the current workspace and convert the file after a choice to the current workspace encoding.

4. new files must saved as UTF8 and add a BOM by default, everytime! Specification (http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.base.doc/info/aes/ae/cwbs_wsiprofile.html) tells us BOM in UTF8 is optional, but we can see with this problems BOM is a must have and should be inside the files everytime. Marking UTF8 only "optional" with BOM looks to me like a backward compatibility issue for Java and nothing more, so we should go forward for future and save it everytime. Who likes to use 10 years old software - i don't!?

5. additional i found there is the same problem with UTF16 in eclipse. IBM
wrote in http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.base.doc/info/aes/ae/cwbs_wsiprofile.html
there must be BOM in UTF-16, but eclipse do not write this everytime. Only the
UTF-16 setting saves a BOM to the files the UTF-16LE and UTF-16BE not!

What we must do today:
Wait until this Bug is fixed and go back to Dreamweaver what is supporting BOM's in UTF8 by default.

Regards
Marc

Comment 1 Dani Megert

2006-04-17 15:12:35 EDT

>Daniel Megert forced me in https://bugs.eclipse.org/bugs/show_bug.cgi?id=78455
>to issue a new bug for this.
Not true. I didn't force anybody, just indicated that you might raise this again. So please stay with facts.

Moving to Platform Resource since encoding detection is there.

Comment 2 Rafael Chaves

2006-04-17 17:34:27 EDT

The support for encoding in the workspace is based on what is available from Java. For any given resource in the workspace, it is possible to obtain a charset string that can be used with any Java APIs that take charset strings. Examples are: 'US-ASCII', 'UTF-8', 'Cp1252', 'UTF-16' (Big Endian, BOM inserted automatically), 'UTF-16BE' (Big Endian, BOM not inserted automatically), 'UTF-16LE' (Little Endian, BOM not inserted automatically).

With this mindset, I will now try to address your points one by one:

1) For Java encodings, except for the 'UTF-16' encoding, BOMs are not inserted (when writing) or discarded (when reading) for free. Even if this is puzzling to end users, this is how all Java applications work. If applications want to support creating UTF-8 files with BOMs to match their users' expectations, they need to provide such capability on their own (as neither Java nor the Resources model will help with that). Eclipse does provide some improvements towards detecting BOMs, but not with generating or skipping them.

2) I don't know any way of reliably detecting the encoding of a file unless it has a BOM or declares it explicitly (as in the <?xml ...?> declaration).

3) This does not belong here. You might want to raise this as a separate bug against Platform/Text, but my personal opinion is that this is bogus. Users have files with different encodings in their workspace. Different file formats and applications demand the use of different encodings.

4) see point #1 above. You might want to open an issue with Sun.

5) I mentioned this in point #1 above. UTF-16LE for Java apps (as Eclipse) explicitly means: UTF 16 little endian *without* a BOM. This is how any Java apps taking charset names work, we cannot be different.

Comment 3 Marc Bauer

2006-04-18 02:55:56 EDT

hi

i'm not so deep inside the encoding and what java does out of the box or not. i know - Coldfusion and some other Apps requires to have a BOM and Coldfusion is for sure based on JRUN, what is pure Sun Java v1.4.2. So it looks like Java knows about the BOMs very well!?

IBM's specification says additional in the link:

Bytes Encoding: 
EF BB BF UTF-8 
FF FE UTF-16, Little-Endian 
FE FF UTF-16, Big-Endian 
00 00 FE FF UTF-32, Big-Endian 
FF FE 00 00 UTF-32, Little-Endian 

And they explains - the BOM is optional for UTF8 and obligatory for the UTF16. optional meens from my understanding, it can or cannot be inside. Optional sounds from my view - like - there are some compatibility issues with older apps and this "optional" meens therefor a workaround for these apps - not supporting BOMs today... but in future. Or with other words - they don't like to revoke a legitimation not using them, for older apps not using BOMs :-). isn't it? do you have a different specification?

regarding 2)
If we use the above knowledge list of BOMs and this IBM specification, we have this problem fixed for now and future. if i change the workspace then to "Windows 1252" and have in every UTF-8, UTF-16 - Little-Endian, UTF-16 - Big-Endian, UTF-32 - Big-Endian, UTF-32 - Little-Endian file the correct BOM, you are able to detect all this file encodings aside the "windows 1252" - 100% correct, whatever the workspace says, isn't it? And this this what dreamweaver is doing since years...


Regards
Marc

Comment 4 Marc Bauer

2006-04-18 03:11:21 EDT

Additional Info: http://www.unicode.org/unicode/faq/utf_bom.html#22

snipped out of the text:
1. a BOM can be used as a signature no matter how the Unicode text is transformed

2. BOM can be used as a signature. If there is no BOM, the encoding could be anything

Comment 5 John Arthorne

2006-04-18 11:09:35 EDT

Note that the "IBM specification" you reference is actually a WS-I (web services interoperability) specification. I don't see how this is relevant to the behaviour of Eclipse. I.e., the existence of a specification in some application domain requiring a certain encoding scheme does not necessarily hold true in other domains for which Eclipse is used.  I suspect there are other tools that would choke on additional BOMs in UTF-8 files, so there is likely no solution that will satisfy all possible readers of files developed in Eclipse.

Comment 6 Rafael Chaves

2006-04-18 12:42:51 EDT

Note that if a file has a BOM, Eclipse applications are expected to preserve them when saving the files back. Not doing so is wrong, and a bug should be opened against the provider of the plug-in/application.

Also, I am not denying that UTF-16 mandates a BOM to be present. I was just explaining what the UTF-16LE and UTF-16BE charset names mean in the context of the Java platform. You are right when you claim that that goes against the spec, and if you search Sun's bug database you will see that there are lots of issues in this area. For Java 6 (mustang), they considered supporting skipping UTF-8 BOMs, but had to be backed out because it breaks all existing applications that were working around that flaw themselves:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

Also, for Mustang, expect a new charset to be supported: UTF_16LE_BOM

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6230129

That will give you the ability of having UTF-16 Little Endian files with BOMs (you already have for Big Endian - just use UTF-16).

Comment 7 Marc Bauer

2006-04-18 14:29:34 EDT

thank you for the links. i know the IBM is only for Webservices, but the unicode page is for general and the IBM table was nice to copy and past here :-).

I learned today, this Bugs are known since Sep 27, 2001 what don't make me only more worried... it sounds good to know this is fixed partly in Java 6, but this is currently Beta and i don't see the day when Adobe and others have moved their apps to this new runtime. So there is no viewable future.

currently we don't need UTF_16LE_BOM, but more we need UTF_8_BOM today :-). so if you see a way to implement a workaround for eclipse implementing this as a special user selectable encoding, this may be a good workaround, isn't it? Later if Sun is going to fix this, you are able to detect the Java Runtime version and change the function behind to the standard way... but until this time comes we have a well working solution!

i'm not so deep inside Java, but it sounds very strange to have same encodings for different programming languages somtimes with BOM and sometimes without. it sounds scaring to know UTF8 is not UTF8. So from programmers side you never can trust anything, this is realy annoying and won't help any "Standard"! i realy thought UTF8 is a standard until now. looks not so.

Ends up - what will happen now?

Should i please *every* plugin maintainer, "PLEASE implement a checkbox for BOM" support? or should i create blank UTF8 files in Dreamweaver and opening them in Eclipse... sounds little bit worldly innocent.

Or will you implement such a encoding type named "UTF_8_BOM", what sounds like a very good idea... if i can say so!

Comment 8 Marc Bauer

2006-04-18 14:36:23 EDT

about this "please every maintainer" - in the last weeks i learned some of them have implemented such a feature and then they removed it after a short time, while they have had two BOMs inside one file and such things... it will be realy better to have one way that is working for all plugins and the whole app. i haven't found such a check box in any plugin i'm using until now (for e.g. CFEclipse, Eclipse Web Tools Platform Project).

this UTF_8_BOM sounds here as "the" solution...

Comment 9 Sebastian Zartner

2009-04-18 16:42:10 EDT

I am also programming in ColdFusion at work and had the same problems as Marc. It is very annoying when file encodings are not saved correctly and thereby translation files are destroyed.
Besides that I am also programming a Firefox plugin (Regular Expressions Tester) and had problems with the BOM (even in Firefox 3.0.8). Normally I am saving all my UTF-8 files including the BOM. This works well for most of the scripts for Firefox plugins, but some seem to have to be encoded without BOM, so that the translations are displayed correctly.
So I agree with Marc, that an additional encoding "UTF_8_BOM" would be the best and easiest solution for that and would give best control over the files created in Eclipse. At the moment I have to use special tools, that are either adding the BOM or removing it. So it would be very handy to have that option integrated in Eclipse.