Summary: | [api][encoding] Autodetect files encoding and force BOM save to files | ||
---|---|---|---|
Product: | [Eclipse Project] Platform | Reporter: | Marc Bauer <marc.bau> |
Component: | Resources | Assignee: | Platform-Resources-Inbox <platform-resources-inbox> |
Status: | NEW --- | QA Contact: | |
Severity: | enhancement | ||
Priority: | P4 | CC: | daniel_megert, eclipse, kashihara, pawel.pogorzelski1, sebastianzartner |
Version: | 3.1.2 | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Windows XP | ||
Whiteboard: |
Description
Marc Bauer
2006-04-14 11:57:34 EDT
>Daniel Megert forced me in https://bugs.eclipse.org/bugs/show_bug.cgi?id=78455
>to issue a new bug for this.
Not true. I didn't force anybody, just indicated that you might raise this again. So please stay with facts.
Moving to Platform Resource since encoding detection is there.
The support for encoding in the workspace is based on what is available from Java. For any given resource in the workspace, it is possible to obtain a charset string that can be used with any Java APIs that take charset strings. Examples are: 'US-ASCII', 'UTF-8', 'Cp1252', 'UTF-16' (Big Endian, BOM inserted automatically), 'UTF-16BE' (Big Endian, BOM not inserted automatically), 'UTF-16LE' (Little Endian, BOM not inserted automatically). With this mindset, I will now try to address your points one by one: 1) For Java encodings, except for the 'UTF-16' encoding, BOMs are not inserted (when writing) or discarded (when reading) for free. Even if this is puzzling to end users, this is how all Java applications work. If applications want to support creating UTF-8 files with BOMs to match their users' expectations, they need to provide such capability on their own (as neither Java nor the Resources model will help with that). Eclipse does provide some improvements towards detecting BOMs, but not with generating or skipping them. 2) I don't know any way of reliably detecting the encoding of a file unless it has a BOM or declares it explicitly (as in the <?xml ...?> declaration). 3) This does not belong here. You might want to raise this as a separate bug against Platform/Text, but my personal opinion is that this is bogus. Users have files with different encodings in their workspace. Different file formats and applications demand the use of different encodings. 4) see point #1 above. You might want to open an issue with Sun. 5) I mentioned this in point #1 above. UTF-16LE for Java apps (as Eclipse) explicitly means: UTF 16 little endian *without* a BOM. This is how any Java apps taking charset names work, we cannot be different. hi i'm not so deep inside the encoding and what java does out of the box or not. i know - Coldfusion and some other Apps requires to have a BOM and Coldfusion is for sure based on JRUN, what is pure Sun Java v1.4.2. So it looks like Java knows about the BOMs very well!? IBM's specification says additional in the link: Bytes Encoding: EF BB BF UTF-8 FF FE UTF-16, Little-Endian FE FF UTF-16, Big-Endian 00 00 FE FF UTF-32, Big-Endian FF FE 00 00 UTF-32, Little-Endian And they explains - the BOM is optional for UTF8 and obligatory for the UTF16. optional meens from my understanding, it can or cannot be inside. Optional sounds from my view - like - there are some compatibility issues with older apps and this "optional" meens therefor a workaround for these apps - not supporting BOMs today... but in future. Or with other words - they don't like to revoke a legitimation not using them, for older apps not using BOMs :-). isn't it? do you have a different specification? regarding 2) If we use the above knowledge list of BOMs and this IBM specification, we have this problem fixed for now and future. if i change the workspace then to "Windows 1252" and have in every UTF-8, UTF-16 - Little-Endian, UTF-16 - Big-Endian, UTF-32 - Big-Endian, UTF-32 - Little-Endian file the correct BOM, you are able to detect all this file encodings aside the "windows 1252" - 100% correct, whatever the workspace says, isn't it? And this this what dreamweaver is doing since years... Regards Marc Additional Info: http://www.unicode.org/unicode/faq/utf_bom.html#22 snipped out of the text: 1. a BOM can be used as a signature no matter how the Unicode text is transformed 2. BOM can be used as a signature. If there is no BOM, the encoding could be anything Note that the "IBM specification" you reference is actually a WS-I (web services interoperability) specification. I don't see how this is relevant to the behaviour of Eclipse. I.e., the existence of a specification in some application domain requiring a certain encoding scheme does not necessarily hold true in other domains for which Eclipse is used. I suspect there are other tools that would choke on additional BOMs in UTF-8 files, so there is likely no solution that will satisfy all possible readers of files developed in Eclipse. Note that if a file has a BOM, Eclipse applications are expected to preserve them when saving the files back. Not doing so is wrong, and a bug should be opened against the provider of the plug-in/application. Also, I am not denying that UTF-16 mandates a BOM to be present. I was just explaining what the UTF-16LE and UTF-16BE charset names mean in the context of the Java platform. You are right when you claim that that goes against the spec, and if you search Sun's bug database you will see that there are lots of issues in this area. For Java 6 (mustang), they considered supporting skipping UTF-8 BOMs, but had to be backed out because it breaks all existing applications that were working around that flaw themselves: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 Also, for Mustang, expect a new charset to be supported: UTF_16LE_BOM http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6230129 That will give you the ability of having UTF-16 Little Endian files with BOMs (you already have for Big Endian - just use UTF-16). thank you for the links. i know the IBM is only for Webservices, but the unicode page is for general and the IBM table was nice to copy and past here :-). I learned today, this Bugs are known since Sep 27, 2001 what don't make me only more worried... it sounds good to know this is fixed partly in Java 6, but this is currently Beta and i don't see the day when Adobe and others have moved their apps to this new runtime. So there is no viewable future. currently we don't need UTF_16LE_BOM, but more we need UTF_8_BOM today :-). so if you see a way to implement a workaround for eclipse implementing this as a special user selectable encoding, this may be a good workaround, isn't it? Later if Sun is going to fix this, you are able to detect the Java Runtime version and change the function behind to the standard way... but until this time comes we have a well working solution! i'm not so deep inside Java, but it sounds very strange to have same encodings for different programming languages somtimes with BOM and sometimes without. it sounds scaring to know UTF8 is not UTF8. So from programmers side you never can trust anything, this is realy annoying and won't help any "Standard"! i realy thought UTF8 is a standard until now. looks not so. Ends up - what will happen now? Should i please *every* plugin maintainer, "PLEASE implement a checkbox for BOM" support? or should i create blank UTF8 files in Dreamweaver and opening them in Eclipse... sounds little bit worldly innocent. Or will you implement such a encoding type named "UTF_8_BOM", what sounds like a very good idea... if i can say so! about this "please every maintainer" - in the last weeks i learned some of them have implemented such a feature and then they removed it after a short time, while they have had two BOMs inside one file and such things... it will be realy better to have one way that is working for all plugins and the whole app. i haven't found such a check box in any plugin i'm using until now (for e.g. CFEclipse, Eclipse Web Tools Platform Project). this UTF_8_BOM sounds here as "the" solution... I am also programming in ColdFusion at work and had the same problems as Marc. It is very annoying when file encodings are not saved correctly and thereby translation files are destroyed. Besides that I am also programming a Firefox plugin (Regular Expressions Tester) and had problems with the BOM (even in Firefox 3.0.8). Normally I am saving all my UTF-8 files including the BOM. This works well for most of the scripts for Firefox plugins, but some seem to have to be encoded without BOM, so that the translations are displayed correctly. So I agree with Marc, that an additional encoding "UTF_8_BOM" would be the best and easiest solution for that and would give best control over the files created in Eclipse. At the moment I have to use special tools, that are either adding the BOM or removing it. So it would be very handy to have that option integrated in Eclipse. |