Bug 78455 - [api][encoding] Provide an option to force writing a BOM to UTF-8 files
Summary: [api][encoding] Provide an option to force writing a BOM to UTF-8 files
Status: CLOSED WONTFIX
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Text (show other bugs)
Version: 3.1   Edit
Hardware: PC Windows XP
: P3 normal with 4 votes (vote)
Target Milestone: ---   Edit
Assignee: Platform-Text-Inbox CLA
QA Contact:
URL:
Whiteboard: stalebug
Keywords:
Depends on:
Blocks: 78446
  Show dependency tree
 
Reported: 2004-11-11 18:00 EST by Nitin Dahyabhai CLA
Modified: 2022-02-28 03:58 EST (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nitin Dahyabhai CLA 2004-11-11 18:00:12 EST
WebTools had an option to save its UTF-8 encoded files with the correct BOM. 
Now that it is using FileBuffers for opening and saving files, the previous
setting and preference no longer applies.  It would be nice if the platform
could provide this itself for all text file types.
Comment 1 Dani Megert CLA 2004-11-15 08:52:50 EST
The following scenario does work for me:
- open an UTF-8 file with BOM
- change it
- save it
==> BOM written back to file

You are asking for a way to specify the BOM when creating a new file, correct?
How did you do this in WebTools before switching to file buffers?

Possible solutions could be
- introduce an "UTF-8 BOM" encoding
  This would allow to specify the BOM for UTF-8 in the UI without modifications
  to the API and in the UI.

- add Core API which tells whether a resource (workspace, container, file) wants
to force a BOM and change the current BOM indication in the properties dialog to
show a checkbox which depending on the encoding is enabled (e.g. for UTF-8) or
disabled.

Moving to Platform Resources for comment. Once API is in place we can adapt the
file buffers.
Comment 2 Rafael Chaves CLA 2004-11-15 10:32:49 EST
I am not convinced that Resources is the right place to provide such thing.

It seems to me that the use case is different than the encoding case. In the
encoding case, users (not plugins) are making choices that need to be preserved
for the life of the resource. Here, it looks like it is either an one-time
user's choice (like in a save as... dialog) or a tool's choice (e.g. one may
always want create UTF-8 files with BOMs), so I can't see the value of having a
per-file/container setting of whether BOMs should be created. 

Why can't ITextFileBuffer allow clients to programmatically say if they want to
save a BOM or not?
Comment 3 Dani Megert CLA 2004-11-15 11:30:50 EST
>Why can't ITextFileBuffer allow clients to programmatically say if they want to
>save a BOM or not?
I did not say they can't, I actually wrote that we'll have to adapt the file
buffers based on the BOM info that we get from Platform Resources: I think the
BOM info corresponds to the encoding which can be set for resources, containers
and the workspace. Since the encoding and detecting the BOM is already handled
by Platform Resources I think it is also the right place to offer this UTF-8 BOM
setting.

It would allow different plug-ins (including those not relying on file buffers)
to access the UTF-8 BOM information and attach the BOM when creating a file in a
container that has that flag set.
Comment 4 Rafael Chaves CLA 2004-11-15 11:45:41 EST
I just don't think there is need for an extra setting. As you said before, an
existing BOM is automatically preserved, and that is cool. So, for existing
files (which is usually the common case), there is no need for a setting. When
creating a new file, it is just a matter of the client saying what it wants for
that file. Tools may alwys want to create UTF-8 files with BOMs. I see BOM
enablement a much less frequent use case to justify the overhead of having a
scheme similar to the one provided for encoding. Actually, the originator is not
requesting that much flexibility. They have their own preference, they just
don't have means to make it effective. 
Comment 5 Dani Megert CLA 2004-11-15 11:55:52 EST
Nitin, can you confirm that this flexibility is not needed and having API on the
file buffers to force the BOM fits your needs?
Comment 6 Nitin Dahyabhai CLA 2004-11-15 14:04:15 EST
Yes, I'm not requesting the level of flexibility that Daniel discusses in
comment 3, it is exactly like Rafael says.  SSE only needs to be able to force
the addition of the BOM when it otherwise would not be written out.
Comment 7 Dani Megert CLA 2004-11-16 03:24:20 EST
OK, then.
Comment 8 Marc Bauer CLA 2006-04-01 08:03:59 EST
hi

are you working on this issue? we have BIG troubles with this not extisting BOM.

For e.g. we come from Dreamweaver and developing ColdFusion... All old development is done in Homesite <=5 and Dreamweaver <=6.1. and therefor encoding is windows 1252.

So i'm changed the workspace to UTF8 for creating new files only in UTF8. After we change this all files in the workspace looks destroyed if it comes to german umlauts like צה� and so on. if i change back to windows-1251 and create new files, they are not UTF8.

1. we must create new files only in UTF-8 *with* BOM (ColdFusion requires this and Dreamweaver, too).

2. we need a autodetection of older windows-1252 files and they should be opened as windows 1252 whatever the workspace config says and should inherit. (Critical - eclipse destroyes our files without our knowledge)

3. new files must saved as UTF8 and add a BOM by default, everytime.

4. additional i found there is the same problem with UTF16 in eclipse. IBM wrote in http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.base.doc/info/aes/ae/cwbs_wsiprofile.html there must be BOM in UTF-16, but eclipse do not write this everytime. Only the UTF-16 setting saves a BOM to the files the UTF-16LE and UTF-16BE not!


Is there any timeframe we can expect a fix? This is critical issue and it is not fixed in more then 2 years until now.

PLEASE change priority to P1 and start working on this ASP.

Regards
Marc
Comment 9 Dani Megert CLA 2006-04-02 11:55:01 EDT
If I understand you correctly you're looking for what I outlined in comment 3 and which I was told is not needed/requested.: the ability in the UI to specify that new files are created with BOM. If so, you should try again to raise this in a separate bug logged against Platform Resources (please add myself to the cc-list if you do so). This bug iabout adding API to text file buffers.

>2. we need a autodetection of older windows-1252 files
Sorry, this is not possible.
Comment 10 Marc Bauer CLA 2006-04-14 11:19:05 EDT
(In reply to comment #9)
> If I understand you correctly you're looking for what I outlined in comment 3
> and which I was told is not needed/requested.: the ability in the UI to specify
> that new files are created with BOM. 

i cannot realy understand why this is not "needed"!?

> >2. we need a autodetection of older windows-1252 files
> Sorry, this is not possible.

why? if this is done, we will never ever run in a encoding detection problem. and this should be *the* goal.
Comment 11 Eric Hildum CLA 2007-06-07 19:05:49 EDT
Note that if the "BOM" is actually required by ColdFusion and Dreamweaver, then these applications are not processing UTF-8 correctly. From the Unicode reference:

Because the UTF-8 encoding form already deals in ordered byte sequences, the UTF-8 encoding scheme is trivial. The byte ordering is already obvious and completely defined by the UTF-8 code unit sequence itself. The UTF-8 encoding scheme is defined merely for completeness of the Unicode character encoding model. 

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the <EF BB BF> byte sequence at the beginning of a data stream can, however, be taken as near-certain indication that the data stream is using the UTF-8 encoding scheme.
Comment 12 Marc Bauer CLA 2007-06-07 19:23:12 EDT
I found out that i can put a UTF-8 encoded file with a BOM from Dreamweaver into a Eclipse project. After i added the file it is WRONGLY detected by Eclipse as cp1252. 

1. If i open the file the German "Sonderzeichen" are destroyed. 
2. close the file
3. change Encoding to UTF-8 in settings of this one file (container is not inherited!!!)
4. open the file, all works, German Sonderzeichen are displayed correctly

Do you think that UTF-8 detection in Eclipse is bugfree? I think NOT!!! Eclipse and encoding suxxx all people working with UTF-8 files, be assured - i know 10+ next to me.
Comment 13 Nitin Dahyabhai CLA 2009-11-30 14:36:50 EST
(In reply to comment #12)
> Do you think that UTF-8 detection in Eclipse is bugfree? I think NOT!!!

It can also depend on the kind of file you were working with, as some files that declare their encoding internally expect to be read in that way as well.
Comment 14 Eclipse Webmaster CLA 2019-09-06 16:14:37 EDT
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
Comment 15 Eclipse Genie CLA 2022-02-28 03:58:34 EST
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug.

If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--
The automated Eclipse Genie.