Bug 37933 - [plan item] Improve file encoding support
Summary: [plan item] Improve file encoding support
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 2.1   Edit
Hardware: All All
: P4 enhancement (vote)
Target Milestone: 3.0   Edit
Assignee: Rafael Chaves CLA
QA Contact:
URL:
Whiteboard:
Keywords: plan
: 36950 (view as bug list)
Depends on:
Blocks: 39068
  Show dependency tree
 
Reported: 2003-05-21 12:31 EDT by Jim des Rivieres CLA
Modified: 2022-04-14 08:31 EDT (History)
10 users (show)

See Also:


Attachments
Proposed Contribution to handle encoding based on a file's content (655.64 KB, application/octet-stream)
2004-03-08 02:00 EST, David Williams CLA
no flags Details
sample java program encoded by EUC-JP (1.70 KB, application/octet-stream)
2004-04-20 06:48 EDT, Masayuki Fuse CLA
no flags Details
screen shot of Compare (116.08 KB, image/jpeg)
2004-04-20 06:54 EDT, Masayuki Fuse CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jim des Rivieres CLA 2003-05-21 12:31:28 EDT
Improve file encoding support. Eclipse 2.1 uses a single global file encoding 
setting for reading and writing files in the workspace. This is problematic; 
for example, when Java source files in the workspace use OS default file 
encoding while XML files in the workspace use UTF-8 file encoding. The 
Platform should support non-uniform file encodings. [Platform Core, Platform 
UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: User experience]
Comment 1 Jim des Rivieres CLA 2003-05-21 12:32:33 EDT
*** Bug 36950 has been marked as a duplicate of this bug. ***
Comment 2 Rafael Chaves CLA 2003-06-06 10:38:07 EDT
Original PR: bug 5399.
Comment 3 Bob Foster CLA 2003-06-12 04:49:37 EDT
Re: Non-uniform file encodings in the Eclipse Platform

Many worthwhile ideas here. Other comments...

1. I assume in the "basic algorithm" steps are performed in order listed. In
that case, steps 2 and 3 must be interchanged. The encoding interpreter must
always be consulted first. Multiple encodings are possible with the same BOM.
The result of (current) step 3 should be final. Otherwise, the BOM test should
be ignored unless it is inconsistent with the result of step 4 or 5.

2. Encoding must be determined upon save as well as open. This determination may
require calling an output encoding interpreter, which you do not have in your
scheme. (Use case: User has an <?xml encoding declaration in an XML file and
changes text of the encoding attribute.) The editor should not be required to
track these changes character-by-character and blast off encoding change
notifications. In fact, the editor may not be aware of encoding at all. (Use
case: Rick Jellife has proposed an encoding declaration that would appear in
comments at the beginning of a file.) Instead, an output encoding interpreter
should be called at save time. IOW, the "basic algorithm" should be applied at
save time, too, using an encoding interpreter that operates on the Unicode text
instead of a byte stream.

3. In light of the above, notifying of encoding changes seems of limited value,
since may be re-determined at open/save time. Encoding should be discovered when
it is needed. Notification may be counter-productive, leading editors to take
actions they should not be taking, like calling setCharset().

4. setEncoding() should be removed and the basic algorithm should be the
description of how getEncoding() works. setEncoding() is a potential source of
problems. For example, if setEncoding() is called on an open resource and the
resource is then saved and closed, the resource cannot be re-opened successfully
unless the encoding set is remembered. This makes it a resource property, but
there is already a resource property that may contain an encoding and the two
may be in conflict. What is a valid use of setEncoding()?

5. It should be possible for an editor to have associated encoding
interpreter(s), so that the user is not forced to set the encoding interpreter
and the editor separately. It is highly likely that the user will not be aware
of the encoding interpreter feature and will not correctly set it in advance of
having encoding problems. In fact, users seem to have problems learning how to
set editors associated with extensions, and they already know what an editor is.
Likewise, editors should not have to establish their own encoding interpreters
programmatically.

6. What is the use/purpose of isDefaultEncoding()? There may be several
"defaults". If anyone cares that a resource is not using the workspace-level
encoding, they should stop caring.

7. Workspace-level, resource-level and interpreter-determined are requirements,
but I am not convinced there are use cases to support directory-level encoding,
and they do add overhead. If the feature exists, someone may find it useful, if
that's the threshhold.
Comment 4 David Williams CLA 2003-06-16 11:17:06 EDT
I'll give a couple high level comments, though not an exhaustive list (just 
wanted to document some main things first). 

1. I agree with Bob's comment that the BOM step should be done first, but my 
memory of the standards is that step should be final, if there is a BOM. That 
is, I thought the BOM was definitive. I'm not aware of cases where "Multiple 
encodings are possible with the same BOM". Bob, perhaps you can explain?

2. Encoding (interpreters) definitely needs to be associated with content-type, 
not file extension. That seems to be assumed, understood by everyone, but just 
wanted to add my voice to the importance of that. 

3. While an "output" interpreter would also be necessary, I think some token 
remembering the encoding used during input (or the last output) is required 
too. For one thing, if a 3 byte BOM (for UTF-8) is detected during read, its 
only polite to maintain that when written. For a second thing, which does 
depend on the token being "kept up to date" with notifications, it is possible 
that someone can paste text into a document that "violates" the encoding. Some 
well behaved editors might want to give some warning about that. 

4. I might be explicit that the above comment assumes EncodingMemento (my name 
for the token :) should be associated not only with the IFile/Resouce, but also 
the IDocument. It is, after all, possible to save/copy a document independently 
of its original resource. In those cases the encoding token should ride along. 

5. Something that seems missing from the spec is the association of IANA 
encoding names and Java encoding names. I suggest this be provided at a "base" 
level since 1) there's some ambiguities, 2) its dependent on VM and platform, 
and 3) should be a "base" preference that allows users to control that 
association, when needed. (Most users don't need to do this, but some do, and 
for those that do, there's no alternative workaround).

6. I suspect there's a few well known interpreters that should be included as 
part of the base support: XML, at least. HTML, JSP, CSS also come to mind. 
Others?

7. The spec also doesn't mention how conflict resolution between interpreters 
is handled. I suspect that if the "well known" ones were included as base 
support, there'd be little need beyond a warning message in the log file, but 
if, for example, everyone needs to re-invent an XML interpreter, there'd be 
plenty of opportunity for conflict and users may desire a choice as to which 
was used (which would be unfortunate). 

8. Also not well covered in the spec is exception handling. From experience, 
its easy for a file (e.g. an XML file) to specify an encoding but that some 
character(s) in the file don't actual use that encoding, and a "MalFormedInput" 
exception will be raised. Some mechanism is then needed to allow users and/or 
client code to "override" whatever the "default" behavior should be. For 
example, an editor might want to give the user a choice to "use default" or 
pick another encoding to try. 

9. The above point reminds me that in Java 1.4 there's an encoding setting that 
allows different behavior on encoding errors during input. One option throws an 
exception, the other substitutes '?' for unreadable characters (well, actually, 
they say they make an attempt to "guess" what the character is, but I'm not 
sure that's very accurate). The point being that some parallel setting should 
exist with base Eclipse support, which would then "pass through" to the 
underlying Java support. There's a similar situation with encoding errors on 
output, but a little worse. Even with 1.3, invalid characters are written 
as '?' instead of automatically throwing an exception, and some care is needed 
to handle invalid characters on output. The degree of care, I suggest, should 
also be "settable" by client code. 
Comment 5 Bob Foster CLA 2003-06-16 20:41:45 EDT
David -

This is a longish response, though I agree with most of your points.

1/3a. I have a general concern that Eclipse doesn't quite "get it" when it comes
to plug-ins. In case after case, we find super-privileged plug-ins that are
allowed to get in first and establish default behaviors by means that do not
follow extension point rules and are difficult or impossible for other plug-ins
to override. Much of 3.0 is an effort to correct this problem for menus and
toolbars; the 2.1 cycle tried to do the same for keyboard shortcuts. Whenever I
see some "default" behavior making irrevocable decisions without consulting
plug-ins, I get nervous.

For example, according to Unicode, the BOM is definitive and since the BOM was
written by the last writer or agent, it probably reflects the current encoding.
So it would be useful to see the encoding interpreter plug-in called with an
argument that indicates the BOM encoding (or none), as well as the length of the
BOM.

There is no harm, either, in recording the BOM-determined encoding in a memento,
provided the mememto clearly indicates the origin of the information. To carry
this information around as some sort of vague "preferred encoding" would be
counter-productive.

When a document is modified and saved, or created and saved for the first time,
the encoding indicated within the document, if any, must always be consulted.
The precedence for determining the write encoding should be:

1. In-document encoding, if any.
2. Resource property encoding, if any.
3. Directory default encoding(s), if any.
4. Platform encoding

If any of these is different than the input BOM encoding, the plug-in should ask
the user to confirm or switch to the input encoding. The input encoding really
has no privileged relationship to the output encoding, but there are two good
reasons to ask:

- One of the default encodings may lose information.
- Users most often don't know what the platform default encoding is (at least,
until they are screwed the first time).

2. Having done it this way in another product because I thought it was the right
answer, I am keenly aware of two issues:

- Users really don't understand content types and don't appreciate the added
level of indirection (extension -> content type -> encoding
interpreter/editor/whatever). Especially note the editor/whatever part.

- If by content type you mean MIME type, it doesn't mean and doesn't uniquely
identify "document type". E.g., text/xml and application/xml are the same
document type with allegedly different encoding (but don't count on it). If you
don't mean MIME type, then I don't know what you mean, but inventing a new
document type naming convention is, as they say, fraught.

3a. Can't say that ignoring all of the user's encoding preferences and using
instead an encoding the user can't see is "polite". Seems downright rude.

3b. The editor is going to dynamically track changes to the many encoding
preferences, which requires resolving to the single relevant encoding preference
each time, so that it can check the entire document and every document change
between preference changes to make sure it is within that encoding? I don't
think so. Or at least, I hope not. It might be valuable to check the document on
save to ensure that information won't be lost - not needed very often, but
certainly handy when it is, but that check can be non-trivial (and unfortunately
is most useful when it is non-trivial) and is quite beyond the resources of most
editor-writers.

4. I'm not sure that copying the contents of a document, which are
encoding-neutral, has anything to do with how the copy is treated afterward.

5. I don't recall that Java is so perverse that it doesn't recognize any IANA
names, nor that it gives a different interpretation to any. I certainly could be
wrong, but that seems more like material for a bug report to Sun. What I do is
present choices only in terms of IANA names but attempt to map any name provided
first into a Java name otherwise to an IANA name, using maps that are not
case-sensitive and accomodate common variations in punctuation, otherwise in a
dialog I will flag an error; if an possibly invalid name appears in a document,
I try the name anyway. I don't consider it possible to predict what names Java
will accept in any given release.

Then what to do if Java rejects a name? I could probably do this better, but
here's what I currently do: If Java rejects a user-specified encoding on save, I
save in UTF-8 and depend on the platform to write a BOM. If the platform
co-operates, then no information is lost, the document is always readable and
save always succeeds. I felt that save succeeding was more important than user
notification (or trying to choose another preference, which might again result
in user notification). To allow a user to do a Save All and walk away without
seeing some dialog pop up, and later experience a power failure, seemed to me to
be a hanging offense. That said, some user notification after the fact, and some
assurance that the BOM is actually written, would improve matters.

6/7. All encoding interpreters should be treated equally; the
platform-contributed ones should be just like any other. There needs to be a
resultion algorithm, which was the source of my previous comment that if a user
selects an editor for a document and the editor contributes an encoding
interpreter, that interpreter ought to be used. It is really asking a bit much
for users to deal with this as a separate issue, and I don't care that files can
be opened by plug-ins that are not editors, this use case is important enough to
get special treatment.

As always, the resolution algorithm should take into account the important use
cases. Open in editor and Search come readily to mind. (Maybe Search is simple:
the contents should be obtained from already open documents, if possible,
otherwise apply the algorithm.)

8. I agree for Open in editor the user should be given a choice on input if a
specified encoding throws; I'm not so sure for Search; probably this should just
go into the status messages and let the search continue without the file. Use
cases rule.

9. For encoding, the default should be try/catch/recover for both input and
output. Font/code page selection is also involved. The most common report I see
is that the file is read correctly but the user sees boxes or garbage, and
that's often a presentation problem. The current one-font-fits-all behavior
doesn't really cut it, any more than one-encoding-fits-all does. It seems
obvious that the correct presentation can, and should, be selected based on the
encoding if it is, say, ASCII or SHIFT-JIS, but not obvious if it is UTF-8. But
all this is platform-dependent and not one of my areas of expertise.
Comment 6 Bob Foster CLA 2003-06-16 20:46:31 EDT
Actually, WRT 3b, the determination is trivial enough if you try the translation
and it throws.
Comment 7 David Williams CLA 2003-06-25 21:17:41 EDT
Thought I should add some comments on Bob's comments. Hopefully, we'll approach 
some clear statements of principles or uses cases, and then could decide if new 
encoding support can/will support those use cases. 

First, on 3b. ... I hate to be thought of as rude :) so I'll clarify I 
originally meant ... given a resource was read in as UTF8 (and it had a 3 byte 
BOM) then if the resource is written as UTF8, its only polite to also include 
the 3 byte BOM when written. But, if read in as UTF8 (and did not have 3 byte 
BOM) then if the resource is written as UTF8, it is only appropriate to not 
include the 3 byte BOM. And, my point was, the only way to know what to do about 
the 3 byte BOM is to "carry along" that "how read" info, such as in an 
EncodingMemento. 

At more of a "principle" level, I do think it important to carry along the whole 
encoding info that a resource was read with. And, I think there should be a rule 
that says "in the absence of over-riding information, a resource should be 
written as it was read. The "over-riding" part would be rules 1 and 2 in Bob's 
list, so I guess this would be a "rule 2.1", coming before "directory setting". 
The "use cases" in support of this principle are easy to find: a. If a resource 
is read in with unicode encoding due to some BOM (3 or 4 (or more?) bytes) AND 
it is not otherwise specified, then I think it should be written out the same 
way. I think that's the intent of the unicode standard BOM, though I don't know 
in practice how many people use this technique (since most 'modern files' would 
contain the encoding in the file itself (e.g. for XML, etc.). [Guess it would 
apply to Java files! :) b. Another use case, I've personally seen, is that there 
are some cases for HTML files (which is not so standard on encoding spec's) 
there are some Japanese systems which "peek" inside the file to determine 
encoding (peek in the sense of looking at byte patterns) and if one of those 
Japanese encodings is found, those users would expect (require!) it to be 
written the same way. 

And, this above use cases is the reason why its important to "carry along" the 
encoding info through to IDocument (or similar), and not just resources, so that 
if a resource from one of the above cases results in an IDocument, and that 
IDocument is cloned/copied/savedAs... some other resource, then it would have 
the same characteristics as the original. 

Just a quick note on the importance of associating encoding rules with "content 
type" ... the prima-dona use case for this is JSPs. Typically using a file 
extension of .jsp, there's no reason why any user can't change the settings on 
their web server, and want 'jst' to also be interpreted as a JSP file. They 
should then have an easy way to let the tool/platform know that 'jst' should be 
in the JSP "family" (is that a better term?) and then have everything in the 
platform that in some way handles .jsp files handle .jst files in the same way 
(not just editor association). 

Lastly, and what may be a different view between Bob and I, is that I do see 
encoding/decoding only working correctly, from a platform point of view, if the 
resolution algorithm always results in the same interpreter/ rule being used to 
do encoding/decoding. The reason for this is that there are so many functions 
(compilers, builders, validaters, databases, search indices, fixups, code 
generators, etc.) that all depend on the resource being "interpreted" the same 
way. Of course, this should not preclude some functions (e.g. editors) from 
changing the encoding/decoding, but I think this should be a separate "concept" 
from the "interpreter resolution". In fact, if I can end with a "wild idea" off 
the top of my head, maybe even editors should always assume the same 
encoding/decoding, but there be a convenient "encoding explorer" as part of the 
platform that would give an easy way to view files, determine effects of 
changing encodings (and fonts!), and save new interpretation. Well, just a 
thought. 

Thanks, hope these ramblings are clear enough so they can be distilled to the 
use-case or "encoding principle" level. (Glad I'm not writing the spec :)
Comment 8 Nick Edgar CLA 2003-06-27 11:52:16 EDT
See bug 3970 for a request to allow configurable line delimeters.
If we allow this, it should be supported at the same granularity as the 
encoding settings (see Kai's comment).
Comment 9 Rafael Chaves CLA 2004-02-23 13:54:42 EST
A new revision of the improved file encoding proposal has been made available
off the Platform/Core web page:

http://dev.eclipse.org/viewcvs/index.cgi/%7Echeckout%7E/platform-core-home/dev.html#plan_current

Comments are welcome and should be made on the Platform/Core development list
(platform-core-dev@eclipse.org) or this PR.
Comment 10 Nick Crossley CLA 2004-02-23 14:54:24 EST
While I appreciate the desire to simplify usage, I feel the suggested new plan 
(per-project encodings, with some attempt to discover encodings automatically) 
is a step backwards from the previous one (per-project, folder, and file 
encoding settings).

First, there are good reasons to have documents with multiple encodings in a 
single project.  For example, suppose I want to ship a product on Windows, Mac, 
and UNIX.  I have README or other files with some requirements for non-ASCII 
characters.  For Windows, I might want to use the Windows variant of Latin-1, 
CP 1252.  For Mac, I might want to use Mac Roman.  For UNIX, I might use ISO 
Latin-1.  I might also have Japanese variants in Shift-JIS, Chinese in Big-5, 
and other Asian variants.  These variants are necessary, because a large number 
of sites do not have Unicode system locales - many do not even have Unicode 
locales installed.

Second, note that discovering an encoding by reading a file is difficult, and 
potentially expensive (for files not marked with Unicode BOMs).  How much of a 
file would one read before deciding one knew the encoding?  What if my README 
file contained 2K or more of 7-bit ASCII text before a section that contained 
some symbols, or some accented or Asian characters?

Third, the proposal seems to assume that certain file types imply certain 
encodings - notably, that XML should be in UTF-8.  This is not always the 
case.  There are times when we want to use Latin-1 or even 7-bit ASCII for XML 
(using XML character references for all characters outside those ranges), for 
compatibility with transport mechanisms and older code that cannot handle UTF-
8.  As an extreme case, consider sending an XML document through a legacy 
protocol that is not 8-bit clean.
Comment 11 David Williams CLA 2004-03-08 01:52:05 EST
Sorry for the long 'comment' ... this is the description of at contribution I'd 
like to propose to see if helpful. It represents (after much refactoring) some 
code taht we've been using to do encoding/decoding on previous products based on 
Eclipse 2.x stream. I've tried to keep is similar to current spec's and 
proposals I've read, but I suspect much work still needs to be done there, so 
this is something of a "stand alone" version. 





Extensible Content Sensitive Encoding
Contact: David M. Williams
david_williams@us.ibm.com
919-254-0362

The attached contribution (to follow) provides the ability to "peek" inside a 
file to determine its appropriate decoding. This is required for files for which 
this are common and "industry standard" rules ... such as UnicodeStreams, XML, 
JSP, DTDs, HTML and CSS.-- and this 'readme' file is contained in the zip file, 
as the package.html file under the primary project, com.ibm.encoding.resource


It complements some of the work that's been going on with the Platform and Text 
teams, since I don't think the "detector" part of that spec has been 
implemented, so I hope this code can save them some effort, as well as provide a 
good "test case" if the ideas really work. This contribution focus entirely on 
the content sensitive part of the requirements .. there's other cases and other 
file types for which the content does not even give a hint as to the encoding. 
There are "hooks" in my code where that work can be tied into so, when 
appropriate, the algorithms can go and look up the settings according to user 
settings.





For anyone taking a look at this code, here's an outline of the places to start 
... the primary packages and classes.

com.ibm.encoding.resource 

     CodedReaderCreator -- creates a Reader with correct encoding set to read 
characters. 

     CodedStreamCreator -- creates a ByteOutputStream, the bytes of which being 
correctly encoding for storing. 

com.ibm.encoding.resource.contentspecific 

contains "detectors" and infrastructure for XML, JSPs, HTML, CSS, and DTDs. The 
infrastructure simple means the mechanisms to associate the right encoding rules 
(detectors) with the right content type. This infrastructure makes use of 
"contentTypeIdentifers" which I contributed as an append to another bugzilla, 
but have re-included in this package for convenience. This means it should work 
even for .project files which contains NL characters. NOTE: the base eclipse has 
said they might provide one for XML, and I'm sure there might be hesitancy to 
include the others, for JSP, HTML, CSS, and DTDs, but I felt obligated to 
include them for two reasons: 1) just in case there is a desire (from community 
or others) to include them, and 2) You can't adequately test the design with 
just one case, so they will be helpful at least for that.

While there's many unit tests which might prove helpful in understanding and 
verifying the code, its really hard to "see" and appreciate the results, without 
an editor, or some other to visually see the correct characters are there.





How to use with an editor

An outline of how to changed basic text editor (via file buffers) to have the 
content read and written correctly. Given the following sorts of changes, the 
basic text editor can open XML, JSP, HTML, CSS, and DTD files no matter what 
their "internal" encoding is (well, as long as its fairly well formed, and is 
supported by the VM). [And, yes, you heard right, that's the same editor, the 
editor shouldn't have much to do with encoding/decoding, since that's a "model 
level" responsibility.]


ResourceTextFileBuffer

commitFileBufferContent

			CodedStreamCreator codedStreamCreator = new CodedStreamCreator();
			codedStreamCreator.set(fFile.getName(), fDocument.get());
			ByteArrayOutputStream byteStream = codedStreamCreator.
getCodedByteArrayOutputStream();
			InputStream stream = new ByteArrayInputStream(byteStream.
toByteArray());
			//InputStream stream= new ByteArrayInputStream(fDocument.get().
getBytes(encoding));


initializeFileBufferContent

		    CodedReaderCreator codedReaderCreator = new CodedReaderCreator(fFile);
		    fEncoding = codedReaderCreator.getEncodingMemento().
getJavaCharsetName();
		    //fEncoding= fFile.getPersistentProperty(ENCODING_KEY);


There's actually simpler ways than the above code indicates, but it would 
require more modification of the existing code, so I fit into what's there as 
easily as possible. (Also, I might note, these simple changes don't begin to 
cover error conditions, or eliminate incorrect messages. It does not update that 
file property or file states, or anything like that.

Note, the code as contributed was based on I20040304. I know of some bugs still 
in it, and want to do more cleanup and documentation, but have finally gotten it 
to the point that I think others would study it and give any comments they'd 
find constructive.

Possible Issues

Maybe its the state of the current code, but it seems lots of objects need to 
know and have the file encoding set and synchronized, all at just the right 
times. This seems very confusing to me. I've tried to "centralize" all the 
encoding "intellegence" and algorithms in just a few classes. AND NOTE: these 
current classes could be "left seperate" or encapuslated under IFile/IStorage, 
but am not sure of the advantage of that, exactly ... maybe I've just gotten too 
used to thinking of IFile and IStorage as binary providing objects.

I've argued elsewhere, and still believe that the simple 'charset' string is not 
enough to know how to re-write a file. In many well known cases, it depends on 
how it was read. For example, a file can be UTF-8 with or without the 3 Byte 
BOM. So, I advocate a simple "encodingMemento" be available that can provide 
detailed information about how a stream was decoded, and in a good system, this 
information would influence how it was later encoded again. Even if the base 
didn't want to support such elaborate stategies, if getCharsetMemento() returned 
a mememto with one method, getCharset(), then this would seem to leave the way 
open for other to implement more elaborate strategies as required by their 
products.

ContentTypeIdentifier. I haven't heard any direct feedback, about my original 
proposal, but have seen discussions of even more complicated collections of 
arbitrary information about the contents of a file. That might be a nice thing 
to have, and might work, but its hard for me to envision how that works without 
a low-level identifiction first. There's a certain order that, seems to me, to 
be required. It might be summarized as before you can determine complex type 
information, you need to know how to decode the file, and before you can decode 
the file you need to know what type the file is [and sometimes the file's 
extension is not enough to determine its type.

Maybe I'm over-reacting, but I fear if there's lots of different, uncoordinated 
information about files that are all need to be saved and modified, all in just 
the right order, that Eclipse will end up appearing like some "proprietary" IDE 
... meaning that things work fine when in Eclipse, but not work fine when 
exported. As a simple example, if a user (or program) sets a documents encoding 
to UTF-16, but due to project settings, or some timing problem it actually gets 
saved as UTF-8, then that file would be unreadable. I prefer the central object 
that knows, rather than many objects that all have to stay in synch. So ... just 
a "possible issue". 
Comment 12 David Williams CLA 2004-03-08 02:00:30 EST
Created attachment 8384 [details]
Proposed Contribution to handle encoding based on a file's content

This zip file contains 7 projects, but only one is the 'primary' one,
com.ibm.encoding.resource. I20040304 is the last build I've looked at, so hope
I'm not way behind :) and hope some use can be made of this contribution.
Thanks.
Comment 13 DJ Houghton CLA 2004-04-15 09:33:17 EDT
New file encoding support is now available in the latest builds. 
Moving to Rafael for comment/closure.
Comment 14 Masayuki Fuse CLA 2004-04-20 06:47:44 EDT
I've looked at I200404131323. Compile run and debug look fine, but I got errors 
on Compare and Search function. I've tried following scenario.
1-create java program on RHEL 3.0WS Japanese locale (workbench default encoding 
is EUC-JP).
2-create java project on windows2003 (workench default encoding is MS932), and 
change EUC-JP at the project properties > info page
3-import the java programs into the project
4-try run, debug, search, edit and compare from local history

Run and debug look fine. Search with Japanese text pop up an error and .log 
attached below. Compare indicated unmodified Japanese text since Japanese text 
are garbled in original file. it looks like it's encoded by EUC-JP.

!SESSION 4 20, 2004 19:16:10.500 -----------------------------------------------
java.fullversion=J2RE 1.4.2 IBM Windows 32 build cndev-20040322 (JIT enabled: 
jitc)
BootLoader constants: OS=win32, ARCH=x86, WS=win32, NL=ja_JP
!ENTRY org.eclipse.core.runtime 4 2 4 20, 2004 19:16:10.516
!MESSAGE An internal error occurred during: "Search for References".
!STACK 0
java.lang.NullPointerException
	at org.eclipse.search2.internal.ui.SearchView.queryFinished
(SearchView.java:449)
	at org.eclipse.search2.internal.ui.QueryManager.fireFinished
(QueryManager.java:108)
	at org.eclipse.search2.internal.ui.QueryManager.queryFinished
(QueryManager.java:126)
	at org.eclipse.search2.internal.ui.InternalSearchUI.searchJobFinished
(InternalSearchUI.java:151)
	at org.eclipse.search2.internal.ui.InternalSearchUI.access$1
(InternalSearchUI.java:149)
	at 
org.eclipse.search2.internal.ui.InternalSearchUI$InternalSearchJob.run
(InternalSearchUI.java:133)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:62)

 
Comment 15 Masayuki Fuse CLA 2004-04-20 06:48:43 EDT
Created attachment 9688 [details]
sample java program encoded by EUC-JP
Comment 16 Masayuki Fuse CLA 2004-04-20 06:54:48 EDT
Created attachment 9689 [details]
screen shot of Compare
Comment 17 Andre Weinand CLA 2004-04-20 06:56:46 EDT
Please file separate bug reports against the search and compare components and list detailled steps for 
how to reproduce the problem.
Comment 18 Masayuki Fuse CLA 2004-04-20 08:09:46 EDT
ok, I've just filed bug 59228 for Search and bug 59232 for Compare problem.
Comment 19 Rafael Chaves CLA 2004-04-23 11:57:10 EDT
Plan items must target release.
Comment 20 Rafael Chaves CLA 2004-04-30 16:43:23 EDT
Many of the issues raised in this bug or the platform-core dev list were
addressed (at least at the Core level). Encoding doc to be updated accordingly
soon. Please open separate bug against Platform/Core for any encoding-related
issues that may arise.
 
Interesting starting points:
- IContentTypeManager#getDescriptionFor
- IEncodedStorage#getCharset
- IFile#getCharset([boolean])
- IFile#getContentDescription
- IContentDescription#BYTE_ORDER_MARK/CHARSET

Thanks for the great feedback.
Comment 21 Rafael Chaves CLA 2004-05-07 16:57:42 EDT
Closing.