Bug 58859 - [encoding] Editor does not detect BOM on .txt files
Summary: [encoding] Editor does not detect BOM on .txt files
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.0   Edit
Hardware: PC Windows 2000
: P3 normal (vote)
Target Milestone: 3.0 M9   Edit
Assignee: Frederic Fusier CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 60588
Blocks:
  Show dependency tree
 
Reported: 2004-04-16 11:44 EDT by Vladimir Weinstein CLA
Modified: 2004-05-27 12:56 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vladimir Weinstein CLA 2004-04-16 11:44:08 EDT
ICU4J project (http://oss.software.ibm.com/icu4j/) has some of it's source data
text files utf-8 encoded with BOM. When these files are opened and/or compared,
BOM is ignored and the utf-8 file is displayed wrong. 

One workaround I found is to explicitly set "Text file encoding" option in
Preferences->Workbench->Editor to UTF-8, but I think that BOM should be taken
into acount when opening text files. After all, that's what it is there for.
Comment 1 Dani Megert CLA 2004-04-27 06:18:19 EDT
You can now set the file encoding per resource (project, folder, file).
Comment 2 David Williams CLA 2004-04-27 09:46:53 EDT
Did you change this bug to "resolved" and "worksforme" based on the fact that 
user can manually set the encoding? If so, I don't think that's right. The idea 
behind the BOM is that its automatically detected. Imagine you had 1000 files, 
200 of them with UTF-8 BOM, 200 of them with UTF-16 BOM, and 600 with no BOM. 
Then its simply not reasonable to ask a user to set the encoding correctly for 
those 400 files. The BOM detection is standard unicode processing, which the 
new platform encoding should be able to easily support now. 
Comment 3 Dani Megert CLA 2004-04-27 09:56:36 EDT
I was closing too fast - thought it's a content type issue but what you're
asking is to detect the BOM for text files being opened with the text editor.
Comment 4 Dani Megert CLA 2004-04-27 10:29:39 EDT
Actually my fast decision was correct: we now get the file's encoding from Core
(see IEncodedStorage) and use this to open the file. Exception to this rule are
files where the user already set the encoding manually (using 2.1.x).

If it is not working for you using I20040427 then please file a bug report
against Platform Core.
Comment 5 Dani Megert CLA 2004-04-29 02:21:46 EDT
Daniel, 

Would you mind moving (or if I move) bug 58859 to Platform/Core? It is
definitely our limitation here, and I was about to open a BOM-related bug
against us...

Thanks,

Rafael
Comment 6 Rafael Chaves CLA 2004-04-30 16:25:01 EDT
Thanks Daniel. I fixed this in the Core level (ended up creating bug 60588 for
Core... go figure...). So for clients not re-writing contents, marks, everything
should be fine now. But for you guys the encoding story is a little bit more
complicated...

IFile#getCharset will return 1) a charset forced by the user, or 2) a charset
corresponding to the BOM if there is one, or 3) the one embedded in the contents
(as XML allows), or 4) the default for the parent. 

Right now, the text editor:
a) ignores the existence of BOMs (does not rewrite them)
b) uses the same encoding used to read when writing (opened bug 60636)

So imagine the text editor is used to open a XML file that has the following XML
decl *and* a UTF-8 BOM:

#UTF-8-BOM#<?xml version="1.0" encoding="ISO-8859-1"?>

In this case, IFile#getCharset will return "UTF-8" as the encoding (determined
using the BOM). When saving, the same UTF-8 encoding will be used, but no BOM
will be written. Next time the file is opened, Core will say its encoding is
actually ISO-8859-1, and so any non-ASCII contents for that file will be
corrupted. This could be avoided if the BOM was rewritten, or the BOM was not
rewritten but the encoding used when writing was re-computed (bug 60636). To be
able to figure out whether there was a BOM in the contents, IFile
#getContentDescription should be used instead of #getCharset.

Also, a minor glitch is that Java readers do not properly recgnize and ignore
UTF-8 BOMs, so the BOM will actually appear as part of the contents (an
extraneous non-printable char). Clients may want to skip the first char when a
UTF-8 BOM is found.
Comment 7 Dani Megert CLA 2004-05-03 05:59:28 EDT
see last section of comment 6
Comment 8 Frederic Fusier CLA 2004-05-06 07:19:05 EDT
Rafael,
May I have some precision on how and when replace getCharset() with
getContentDescription()?
Currently, getContentDescription() returns null when getCharset() returns
correct information. So, will UTF-8-BOM specific charset be stored in content
description in future release? If so, which property name will be used for it?
When do you plan to integrate these changes?
Comment 9 Rafael Chaves CLA 2004-05-06 12:04:19 EDT
These changes went in for this week's i-build, Frédéric.

IFile#getCharset uses content type information if available, but it will default
to the parent charset if no content description can be obtained. This is why it
will return a valid charset even when IFile#getContentDescription returns null.
The fact you are getting null content descriptions for Java files is because
there is not content type currently associated to the .java extension (I forgot
this detail, sorry about that). JDT/Core should be providing an additional
content type for Java compilation units. For example (feel free to change the
content type id and name key):

<extension point="org.eclipse.core.runtime.contentTypes">
<!-- declares a content type for Java Source files -->
  <content-type id="javaSource" name="%javaSourceName" 
    base-type="org.eclipse.core.runtime.text"
    priority="high"				
    file-extensions="java"/>
</extension>

You already have an extension to the contentTypes extension point in JDT/Core,
so you might just add the new content type there.

So if you are not going to provide an workaround for Java reader's lack of
support to UTF-8 BOMs (Sun javac does not):

http://developer.java.sun.com/developer/bugParade/bugs/4508058.html

just by having the content type described above will allow you to use
IFile#getCharset and not to worry about BOMs, getCharset takes care of that (the
Java compiler would work fine with UTF-16 BOMs). An easy workaround to support
the UTF-8 BOM would be to skip the first char if it is 0xFEFF and the encoding
is UTF-8. You can also check the content description to see if there is a BOM
(property name: IContentDescription.BYTE_ORDER_MARK).
Comment 10 Rafael Chaves CLA 2004-05-06 12:08:51 EDT
My comments above were regarding reading text files. If you are going to rewrite
existing files (are you?), see comment 6 above and also bug 60636.
Comment 11 Frederic Fusier CLA 2004-05-10 06:50:09 EDT
Fixed.

jdt-core does not write ".java" files, only ".class" file, so there's nothing 
to do for us on writing side...
While reading Java files, we know skips first char for files with "UTF-8 BOM" 
encoding.

[jdt-core-internal]
Changes made in method getInputStreamAsCharArray(InputStream,int,String) of 
jdt.internal.compiler.util.Util class.
Also, modified plugin.xml and plugin.properties to add content type for Java 
files as specified by Rafael.
Test case added in EncodingTests
Comment 12 Olivier Thomann CLA 2004-05-18 13:02:44 EDT
Verified in 200405180816.
It fails in M8 and pass with 200405180816.
Comment 13 Dani Megert CLA 2004-05-27 12:56:00 EDT
Platform/Text improved file buffers and document providers to detect UTF-8 BOM,
remove it from the document and later on write it back to the file. This is in
builds > 200405271600.