Community
Participate
Working Groups
ICU4J project (http://oss.software.ibm.com/icu4j/) has some of it's source data text files utf-8 encoded with BOM. When these files are opened and/or compared, BOM is ignored and the utf-8 file is displayed wrong. One workaround I found is to explicitly set "Text file encoding" option in Preferences->Workbench->Editor to UTF-8, but I think that BOM should be taken into acount when opening text files. After all, that's what it is there for.
You can now set the file encoding per resource (project, folder, file).
Did you change this bug to "resolved" and "worksforme" based on the fact that user can manually set the encoding? If so, I don't think that's right. The idea behind the BOM is that its automatically detected. Imagine you had 1000 files, 200 of them with UTF-8 BOM, 200 of them with UTF-16 BOM, and 600 with no BOM. Then its simply not reasonable to ask a user to set the encoding correctly for those 400 files. The BOM detection is standard unicode processing, which the new platform encoding should be able to easily support now.
I was closing too fast - thought it's a content type issue but what you're asking is to detect the BOM for text files being opened with the text editor.
Actually my fast decision was correct: we now get the file's encoding from Core (see IEncodedStorage) and use this to open the file. Exception to this rule are files where the user already set the encoding manually (using 2.1.x). If it is not working for you using I20040427 then please file a bug report against Platform Core.
Daniel, Would you mind moving (or if I move) bug 58859 to Platform/Core? It is definitely our limitation here, and I was about to open a BOM-related bug against us... Thanks, Rafael
Thanks Daniel. I fixed this in the Core level (ended up creating bug 60588 for Core... go figure...). So for clients not re-writing contents, marks, everything should be fine now. But for you guys the encoding story is a little bit more complicated... IFile#getCharset will return 1) a charset forced by the user, or 2) a charset corresponding to the BOM if there is one, or 3) the one embedded in the contents (as XML allows), or 4) the default for the parent. Right now, the text editor: a) ignores the existence of BOMs (does not rewrite them) b) uses the same encoding used to read when writing (opened bug 60636) So imagine the text editor is used to open a XML file that has the following XML decl *and* a UTF-8 BOM: #UTF-8-BOM#<?xml version="1.0" encoding="ISO-8859-1"?> In this case, IFile#getCharset will return "UTF-8" as the encoding (determined using the BOM). When saving, the same UTF-8 encoding will be used, but no BOM will be written. Next time the file is opened, Core will say its encoding is actually ISO-8859-1, and so any non-ASCII contents for that file will be corrupted. This could be avoided if the BOM was rewritten, or the BOM was not rewritten but the encoding used when writing was re-computed (bug 60636). To be able to figure out whether there was a BOM in the contents, IFile #getContentDescription should be used instead of #getCharset. Also, a minor glitch is that Java readers do not properly recgnize and ignore UTF-8 BOMs, so the BOM will actually appear as part of the contents (an extraneous non-printable char). Clients may want to skip the first char when a UTF-8 BOM is found.
see last section of comment 6
Rafael, May I have some precision on how and when replace getCharset() with getContentDescription()? Currently, getContentDescription() returns null when getCharset() returns correct information. So, will UTF-8-BOM specific charset be stored in content description in future release? If so, which property name will be used for it? When do you plan to integrate these changes?
These changes went in for this week's i-build, Frédéric. IFile#getCharset uses content type information if available, but it will default to the parent charset if no content description can be obtained. This is why it will return a valid charset even when IFile#getContentDescription returns null. The fact you are getting null content descriptions for Java files is because there is not content type currently associated to the .java extension (I forgot this detail, sorry about that). JDT/Core should be providing an additional content type for Java compilation units. For example (feel free to change the content type id and name key): <extension point="org.eclipse.core.runtime.contentTypes"> <!-- declares a content type for Java Source files --> <content-type id="javaSource" name="%javaSourceName" base-type="org.eclipse.core.runtime.text" priority="high" file-extensions="java"/> </extension> You already have an extension to the contentTypes extension point in JDT/Core, so you might just add the new content type there. So if you are not going to provide an workaround for Java reader's lack of support to UTF-8 BOMs (Sun javac does not): http://developer.java.sun.com/developer/bugParade/bugs/4508058.html just by having the content type described above will allow you to use IFile#getCharset and not to worry about BOMs, getCharset takes care of that (the Java compiler would work fine with UTF-16 BOMs). An easy workaround to support the UTF-8 BOM would be to skip the first char if it is 0xFEFF and the encoding is UTF-8. You can also check the content description to see if there is a BOM (property name: IContentDescription.BYTE_ORDER_MARK).
My comments above were regarding reading text files. If you are going to rewrite existing files (are you?), see comment 6 above and also bug 60636.
Fixed. jdt-core does not write ".java" files, only ".class" file, so there's nothing to do for us on writing side... While reading Java files, we know skips first char for files with "UTF-8 BOM" encoding. [jdt-core-internal] Changes made in method getInputStreamAsCharArray(InputStream,int,String) of jdt.internal.compiler.util.Util class. Also, modified plugin.xml and plugin.properties to add content type for Java files as specified by Rafael. Test case added in EncodingTests
Verified in 200405180816. It fails in M8 and pass with 200405180816.
Platform/Text improved file buffers and document providers to detect UTF-8 BOM, remove it from the document and later on write it back to the file. This is in builds > 200405271600.