126293 – Non-8 bit characters are garbled in in-editor validation

Bug 126293 - Non-8 bit characters are garbled in in-editor validation

Summary: Non-8 bit characters are garbled in in-editor validation

Status:	CLOSED FIXED

Alias:	None

Product:	WTP Source Editing
Classification:	WebTools
Component:	wst.xml (show other bugs)
Version:	1.5
Hardware:	PC Windows 2000

Importance:	P3 major (vote)
Target Milestone:	1.5 RC2
Assignee:	Lawrence Mandel
QA Contact:

URL:
Whiteboard:
Keywords:	greatbug

Duplicates (1):	126604 (view as bug list)
Depends on:
Blocks:

Reported:	2006-02-02 18:35 EST by Jesper Moller
Modified:	2006-06-19 22:01 EDT (History)
CC List:	2 users (show)

See Also:

Attachments
XSD containing Chinese and Korean elements and patterns. (1.02 KB, application/octet-stream) 2006-02-02 18:38 EST, Jesper Moller	no flags	Details
Instance document for use with the first attachment. (669 bytes, application/octet-stream) 2006-02-02 18:39 EST, Jesper Moller	no flags	Details
Patch for org.eclipse.wst.xml.core and .ui which uses UTF-8 (2.40 KB, patch) 2006-02-02 18:59 EST, Jesper Moller	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jesper Moller

2006-02-02 18:35:42 EST

When validating a XML or XSD in an open editor, the Unicode string from the SSE document is converted to bytes using thesystem's default charset and back and then fed into the XML parser/validator.

The error is in: /org.eclipse.wst.xml.ui/src/org/eclipse/wst/xml/ui/internal/validation/DelegatingSourceValidator.java line 176 (HEAD). The error is the call to String.getBytes() without encoding name. UTF-8 should be used (if bytes are what we really want)

On most systems, this will convert all characters outside Unicode codepoint 0-255 into questionmarks, and the validation will fail.

This byte array is then read back into a StringReader (again using the system default charset) and fed to the validator, in line 155 of path /org.eclipse.wst.xml.ui/src/org/eclipse/wst/xml/ui/internal/validation/DelegatingSourceValidator.java

The culprit on this end is the default constructor on java.io.InputStreamReader.

I cannot grasp why these haven't been made deprecated in modern Java:
- java.io.InputStreamReader.<ctor>(InputStream)
- java.lang.String.<ctor>(byte[])
- java.lang.String.getBytes()

Caveat: If you just happen to be on a system where UTF-8 is the default charset (as I guess a Linux box could have), this bug won't show.

I've attached a schema file and an xml instance document which references the schema file. The schema uses Korean and Chinese text (for elements and enumerations)

To reproduce:
* Add the two files to a project in the same folder
* Right click and validate them (one at a time) from the Navigator or Package Explorer. Should be valid.
* Open 'international.xsd' in the XSD Editor. Should show some squiggly lines in the elements with asian names.
* Open 'international-instance.xsd' in the XSD Editor. Should show some squiggly lines in the elements with asian names. Shows an error (in right column) for a problem with a processing instruction, because the characters of the element are turned into '?' while makes it look to the parser as a PI.

Also find the patch attached. There are quite possibly MANY of these out there.

Two unrelated bugs:
* On my machine, the Korean fonts are fine in the editor, but just boxes in the outline and content assist. This is a rendering issue, not a problem with the content model (the chinese characters are fine).
* The text lines in the SSE editor have unequal lines, which messes up some text calculations. I don't know if this is in fact the SSE editor or something higher up, like SWT. It also appears in the Java editor.

Comment 1 Jesper Moller

2006-02-02 18:38:15 EST

Created attachment 34052 [details]
XSD containing Chinese and Korean elements and patterns.

Note: This is UTF-8, attached as binary. Save it before you try to open it.

Comment 2 Jesper Moller

2006-02-02 18:39:37 EST

Created attachment 34053 [details]
Instance document for use with the first attachment.

Note: This is UTF-8, attached as binary. Save it before you try to open it.

Comment 3 Jesper Moller

2006-02-02 18:59:51 EST

Created attachment 34054 [details]
Patch for org.eclipse.wst.xml.core and .ui which uses UTF-8

This is the easy fix which just uses UTF-8 for the XML validation.

A better solution may be to just pass the string from the SSE, since it will be fed into a charactersource on the Input to the XML parser anyway, but that would require code signature changes.

Comment 4 David Williams

2006-02-06 10:22:58 EST

I haven't looked at this code in detail, but ... seems that your suggestion of 

InputStreamReader inputReader = new InputStreamReader(inputStream, "UTF-8");

assume's the input streaam is UTF-8? Admittedly common for XML, but not required. 

In general, any thing in Eclipse that needs to know the charset of an xml input stream can/should use the API's around "ContentType". 

To roughly(from memory) outline what it loos like, is something similar to 

IFile file = [getIFile()]

description = file.getContentDescription();

charset = description.getContentProperty(IContentDescription.CHAR_SET);

Naturally, error and sanity checks should be used to make sure the 
retrieved contentType of 'file' isKindOf("platform.xml")

and that the charset actually returns a valid value (could be empty string, or just a nonsensical value). 

Hope this helps.

Comment 5 David Williams

2006-02-06 10:24:28 EST

Lawrence, are you still resolving validator bugs?

Comment 6 David Williams

2006-02-06 15:16:12 EST

*** Bug 126604 has been marked as a duplicate of this bug. ***

Comment 7 Jesper Moller

2006-02-06 15:42:22 EST

(In reply to comment #4)
> I haven't looked at this code in detail, but ... seems that your suggestion of 
> 
> InputStreamReader inputReader = new InputStreamReader(inputStream, "UTF-8");
> 
> assume's the input streaam is UTF-8? Admittedly common for XML, but not
> required. 

Short version:

Confusing perhaps, but not lossy, and independent of encoding of the on-disk file.
The UTF-8 byte array is just an internal representation which is used to transfer the file contents from a String in org.eclipse.wst.xml.ui.internal.validation.DelegatingSourceValidator to 
a different string in org.eclipse.wst.xml.core.internal.validation.XMLValidator, by convoluted means, indeed. The data is never interpreted as an XML file, only as a string carrying XML (if that makes sense).

The long version:

I should have described the bug and fix in greater detail. It is used in the situation where an open editor has its (string) contents validated by the generic XML validator (by Xerces, I guess), to show squiggly lines and markers.

In the special case of the validator being called on a file opened in an SSE editor, the IFile approach is not used. What happens is this:

When the editor is open, the StructuredRegionProcessor runs the ValidatorStrategy.reconcile which activated the ReconcileStepValidator which in the end (for an XML file) delegates to DelegatingSourceValidatorForXML for the given "delta" which is the file being edited (an IFile).
This is where the fun begins, most of it in DelegatingSourceValidator (the superclass).

The DelegatingSourceValidator get the SSE model from the StructuredModelManager. This is the in-memory model of the IFile. This model is turned into a string (which is fine for XML, in that parsing a string disregards the encoding information in the header - since strings are already unambiguous UTF-16), but this string is then turned into a byte array encoded in the platform's native character set. This is the bug, since this is where information is dropped.
What happens next is the tricky part.

The DelegatingSourceValidator implements IValidatorContext (class MyHelper) where the one and only URI of the resource being reconciled is used as the array of URIs, and the one and only model to load is an InputStream off of the converted bytes from above.

The validator runs off an IValidatorContext which can provide a bunch of URIs to validate and a "loadModel" method which returns a model using a well known name (e.g. "getFile" or "inputStream"). It tries the "inputStream" as well as the "getFile", and passes the "inputStream" on to the ValidateAction if found. This in turn calls XMLValidator (in ui) which calls XMLValidator (core) which is where the InputStream is turned back into a string and fed into the InputSource to the XML parser along with the URI of the resource.

That's the problem in a nutshell. I'd say that a bit of refactoring is in order.

My take is that a better patch would leave the string a string and pass that along in place of a InputStream, making the point of encoding moot.
Not only does it save CPU and memory, but is easier to follow as well, isn't it?
Also, the many different classes to do with validation (DelegatingSourceValidator (ui), XMLValidator (ui), ValidateAction(ui) and XMLValidator (core)) are duplicated for XSD, WSDL and XML, and do the same thing, basically.

> In general, any thing in Eclipse that needs to know the charset of an xml input
> stream can/should use the API's around "ContentType". 
> 
> To roughly(from memory) outline what it loos like, is something similar to 
> 
> IFile file = [getIFile()]
> 
> description = file.getContentDescription();
> 
> charset = description.getContentProperty(IContentDescription.CHAR_SET);
> 
> Naturally, error and sanity checks should be used to make sure the 
> retrieved contentType of 'file' isKindOf("platform.xml")
> 
> and that the charset actually returns a valid value (could be empty string, or
> just a nonsensical value). 
> 
> Hope this helps. 

I haven't looked at the rest of the validator stuff, but I think the XML validator uses the URI and the entityresolver to load the XML as binary, since it must be self-describing in most cases.

In conclusion: 
* The patch is safe, since it doen't assume anything about the XML file's encoding.

* A slightly better solution is to use a string in place of the InputStream. I can do that.

* IMO, an even better solution is to rethink the interface between the different callers of the validators and the validators themselves. I don't know if I should be doing that, others probably grasp the code a lot better (Lawrence, is it?)

Comment 8 Lawrence Mandel

2006-02-06 17:11:29 EST

Thanks for the detailed feedback Jesper.

>IMO, an even better solution is to rethink the interface between the
>different callers of the validators and the validators themselves. 

Absolutely. Some of this rethinking has to be done to cleanly separate the various validator UI and core components (the validators have unnecessary UI dependencies). Part of this is due to the validators age and the various different people that have worked on them. I'm aiming to get a lot of this work complete in 1.5 for maintainability and performance reasons. Please keep the feedback coming.

Comment 9 Nicolas Mailhot

2006-02-06 17:30:42 EST

I'll test the patch whenever a patched binary is available BTW

Comment 10 Lawrence Mandel

2006-05-02 10:06:29 EDT

I was able to reproduce this problem and confirmed that the supplied fix solves the problem and does not break any existing test cases.

I've applied the fix (although due to changes in the XML validator I was unable to literally apply the patch) and added the provided files to the XML UI automated test suite.

I've released the changes to 20060502 with the version v200605020740.

Thanks for the detailed description and fix Jesper.

Comment 11 Lawrence Mandel

2006-05-02 10:39:29 EDT

I'm nominating this as a great bug. Jesper identified the bug and provided two test cases that allowed me to reproduce the problem. Jesper also provided a suggested patch and has been very responsive with comments and has offered to test the resolution at any time. This bug showcases the type of involvement we (Eclipse) want to encourage from our community.

Comment 12 Lawrence Mandel

2006-05-02 12:54:16 EDT

>* On my machine, the Korean fonts are fine in the editor, but just boxes in
>the outline and content assist. This is a rendering issue, not a problem with
>the content model (the chinese characters are fine).

I've opened bug 139751 for this problem.

* The text lines in the SSE editor have unequal lines, which messes up some
text calculations. I don't know if this is in fact the SSE editor or something
higher up, like SWT. It also appears in the Java editor.

I'm not sure how to reproduce this problem. If this is still a problem on current WTP 1.5 drivers please open a separate bug against the wst.xml component.

Comment 13 Jesper Moller

2006-06-19 19:21:25 EDT

Verified in WTP1.5 RC5

Comment 14 Lawrence Mandel

2006-06-19 22:01:46 EDT

Thanks for verifying. Closing bug.