Bug 189192 - [Help] Named HTML entities cause errors in help viewer if doctype is xhtml
Summary: [Help] Named HTML entities cause errors in help viewer if doctype is xhtml
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: User Assistance (show other bugs)
Version: 3.3   Edit
Hardware: All Windows XP
: P3 major (vote)
Target Milestone: 3.4 M1   Edit
Assignee: Adam Archer CLA
QA Contact:
URL:
Whiteboard:
Keywords: contributed
Depends on:
Blocks: 204281 206367 245984
  Show dependency tree
 
Reported: 2007-05-25 14:06 EDT by Douglas Dirks CLA
Modified: 2008-09-11 19:53 EDT (History)
2 users (show)

See Also:


Attachments
Example doc plugin to demonstrate html entity problem (2.95 KB, application/octet-stream)
2007-05-25 17:27 EDT, Douglas Dirks CLA
no flags Details
patch (7.65 KB, patch)
2007-06-27 13:12 EDT, Adam Archer CLA
no flags Details | Diff
patch (9.05 KB, patch)
2007-07-10 11:10 EDT, Adam Archer CLA
no flags Details | Diff
patch (10.26 KB, patch)
2007-07-10 12:34 EDT, Adam Archer CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Douglas Dirks CLA 2007-05-25 14:06:14 EDT
If a documentation plugin has content stored in an HTML file for which 
the DOCTYPE is:

  XHTML 1.0 Transitional

named entities such as   or → will either cause a parsing
error or simply fail to display.

When an error occurs, it looks like this:

  org.xml.sax.SaxParseException: Reference to undefined entity "→"

I have seen this behavior on Windows XP and Linux (Red Hat) systems.
I have also see the page in question rendered properly *except* for the
named entities; this happens on a Windows XP system with IE 7 installed.

If the DOCTYPE is set to html instead:

  HTML 4.01 Transitional

the file diplays properly in all cases.

Note that the named entity references display properly in Eclipse 3.2.
Although there is a workaround (using the decimal entity numbers rather
than the entity names), I've marked this as major because files that
rendered properly in 3.2 will now cause errors or display incorrectly.

I have a simple doc plugin that demonstrates the problem, but I'm not
sure how to attach it to this bug report. I'm happy to supply it if
someone contacts me.
Comment 1 Douglas Dirks CLA 2007-05-25 14:09:03 EDT
In the initial description I inserted "&" for the ampersand character, thinking it might be intercepted by the bugzilla web interface. It looks like
it's not. The entity names in question are really " " and "→" and
the like. Sorry for the confusion.
Comment 2 Chris Goldthorpe CLA 2007-05-25 14:15:15 EDT
Can I go ahead and close this bug?
Comment 3 Chris Goldthorpe CLA 2007-05-25 14:16:15 EDT
Or is the bug still valid but the description has changed?
Comment 4 Douglas Dirks CLA 2007-05-25 14:24:45 EDT
My comment #1 had only to do with the formatting of the text of the initial bug report. None of the particulars of the bug are changed. Thanks...
Comment 5 Chris Goldthorpe CLA 2007-05-25 17:17:36 EDT
If you can attach the plugin that would be very helpful. 

File/Export/Plug in Development/Deployable Plugins and Fragments is one way to do this, the default settings will create a single jar file which can be attached.
Comment 6 Douglas Dirks CLA 2007-05-25 17:27:55 EDT
Created attachment 68836 [details]
Example doc plugin to demonstrate html entity problem

To see the behavior described in this bug, put the attached html_entities.jar file in the plugins directory and launch the help system. The top-level TOC entry "HTML Entities Test Plugin" contains example HTML and XHTML.

With this plugin, both the HTML and XHTML pages will display (even though the XHTML page displays incorrectly) on my WinXP SP2 machine with IE7 installed. (The behavior is the same if I use Firefox as the help system browser.) On some other WinXP machines and on Linux (RedHat Enterprise 3) machines, I see the SAX parser errors described above.
Comment 7 Chris Goldthorpe CLA 2007-05-25 17:40:56 EDT
I can see the bad rendering (using IE6) but not the parse error - can you paste a complete stack trace in?
Comment 8 Douglas Dirks CLA 2007-05-25 18:23:34 EDT
Here's the stack trace from a Windows XP SP2 box:

 
An error occured while processing the requested document: 

org.xml.sax.SAXParseException: Reference to undefined entity "→".
	at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3376)
	at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3370)
	at org.apache.crimson.parser.Parser2.expandEntityInContent(Parser2.java:2697)
	at org.apache.crimson.parser.Parser2.maybeReferenceInContent(Parser2.java:2606)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:2017)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1691)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1963)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1691)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1963)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1691)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1963)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1691)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1963)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1691)
	at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:667)
	at org.apache.crimson.parser.Parser2.parse(Parser2.java:337)
	at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:448)
	at org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185)
	at org.eclipse.help.internal.dynamic.DocumentReader.read(DocumentReader.java:56)
	at org.eclipse.help.internal.dynamic.XMLProcessor.process(XMLProcessor.java:49)
	at org.eclipse.help.internal.xhtml.DynamicXHTMLProcessor.process(DynamicXHTMLProcessor.java:66)
	at org.eclipse.help.internal.webapp.servlet.DynamicXHTMLFilter$1.close(DynamicXHTMLFilter.java:79)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
	at org.eclipse.help.internal.webapp.servlet.FilterHTMLHeadAndBodyOutputStream.close(FilterHTMLHeadAndBodyOutputStream.java:290)
	at org.eclipse.help.internal.webapp.servlet.EclipseConnector.transfer(EclipseConnector.java:136)
	at org.eclipse.help.internal.webapp.servlet.ContentServlet.doGet(ContentServlet.java:42)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	at org.eclipse.equinox.http.registry.internal.ServletManager$ServletWrapper.service(ServletManager.java:177)
	at org.eclipse.equinox.http.servlet.internal.ServletRegistration.handleRequest(ServletRegistration.java:91)
	at org.eclipse.equinox.http.servlet.internal.ProxyServlet.processAlias(ProxyServlet.java:110)
	at org.eclipse.equinox.http.servlet.internal.ProxyServlet.service(ProxyServlet.java:68)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	at org.eclipse.equinox.http.jetty.internal.HttpServerManager$InternalHttpServiceServlet.service(HttpServerManager.java:277)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
	at org.mortbay.jetty.servlet.ServletHandler.dispatch(ServletHandler.java:677)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
	at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
	at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
	at org.mortbay.http.HttpServer.service(HttpServer.java:909)
	at org.mortbay.http.HttpConnection.service(HttpConnection.java:820)
	at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:986)
	at org.mortbay.http.HttpConnection.handle(HttpConnection.java:837)
	at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:245)
	at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
	at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Here's the trace from a Linux (RedHat 3 Enterprise) box:

An error occured while processing the requested document:

org.xml.sax.SAXParseException: Reference to undefined entity "→".
	at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3339)
	at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3333)
	at org.apache.crimson.parser.Parser2.expandEntityInContent(Parser2.java:2660)
	at org.apache.crimson.parser.Parser2.maybeReferenceInContent(Parser2.java:2569)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1980)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1926)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1926)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1926)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
	at org.apache.crimson.parser.Parser2.content(Parser2.java:1926)
	at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
	at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:634)
	at org.apache.crimson.parser.Parser2.parse(Parser2.java:333)
	at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:448)
	at org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185)
	at org.eclipse.help.internal.dynamic.DocumentReader.read(DocumentReader.java:56)
	at org.eclipse.help.internal.dynamic.XMLProcessor.process(XMLProcessor.java:49)
	at org.eclipse.help.internal.xhtml.DynamicXHTMLProcessor.process(DynamicXHTMLProcessor.java:66)
	at org.eclipse.help.internal.webapp.servlet.DynamicXHTMLFilter$1.close(DynamicXHTMLFilter.java:79)
	at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
	at org.eclipse.help.internal.webapp.servlet.FilterHTMLHeadAndBodyOutputStream.close(FilterHTMLHeadAndBodyOutputStream.java:290)
	at org.eclipse.help.internal.webapp.servlet.EclipseConnector.transfer(EclipseConnector.java:136)
	at org.eclipse.help.internal.webapp.servlet.ContentServlet.doGet(ContentServlet.java:42)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	at org.eclipse.equinox.http.registry.internal.ServletManager$ServletWrapper.service(ServletManager.java:177)
	at org.eclipse.equinox.http.servlet.internal.ServletRegistration.handleRequest(ServletRegistration.java:91)
	at org.eclipse.equinox.http.servlet.internal.ProxyServlet.processAlias(ProxyServlet.java:110)
	at org.eclipse.equinox.http.servlet.internal.ProxyServlet.service(ProxyServlet.java:68)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
	at org.eclipse.equinox.http.jetty.internal.HttpServerManager$InternalHttpServiceServlet.service(HttpServerManager.java:277)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
	at org.mortbay.jetty.servlet.ServletHandler.dispatch(ServletHandler.java:677)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568)
	at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
	at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
	at org.mortbay.http.HttpServer.service(HttpServer.java:909)
	at org.mortbay.http.HttpConnection.service(HttpConnection.java:820)
	at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:986)
	at org.mortbay.http.HttpConnection.handle(HttpConnection.java:837)
	at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:245)
	at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
	at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Comment 9 Chris Goldthorpe CLA 2007-05-25 18:30:37 EDT
Hi Adam, can you take a look into this?
Comment 10 Adam Archer CLA 2007-05-28 14:31:25 EDT
I've reproduced both the org.xml.sax.SaxParseException as described and the behaviour that Chris was seeing, which is that no exception is thrown, but that the escape characters do not render correctly.

The SaxParseException can only be seen on the Apache Crimson parser, which is no longer in use on more recent Sun JREs (it is used by Sun JRE v1.4.2). On all newer parsers, the escape characters are still not rendered correctly, but no exception is thrown.

The reason for this behaviour is that the help system is running XHTML documents through an XML parser before it passes the HTML on to the browser in order to determine if any dynamic content has been included that needs to be resolved (see http://help.eclipse.org/help32/topic/org.eclipse.platform.doc.isv/guide/ua_dynamic.htm). The XML parser is trying to resolve all of the entities it sees to determine if they are valid. To do this it first checks the XML DTD (included with the parser) and if they are not found it will attempt to go to the DTD for the specified doctype. This requires an excessive number of network calls which are now being supressed in 3.3 for performance reasons. This is why things like ">" still work. They are included in the XML DTD and are therefore found by the parser.

Fixing this would require significant changes to the code and we are now too late in the cycle to address this for 3.3. For now XHTML docs will need to be writen (or updated) to use the entity numbers rather than the entity names.

Here is a list of the most common entities: http://www.w3schools.com/tags/ref_entities.asp
Comment 11 Adam Archer CLA 2007-06-27 13:12:38 EDT
Created attachment 72626 [details]
patch

We stopped shipping the DTDs due to legal concerns and we started suppressing the excessive number of network calls to them for performance reasons.

This patch works around both of those issues by retrieving DTDs the first time they are requested (with a network call) and caching them in the eclipse configuration directory (under "<configuration>/org.eclipse.help/DTDs". It then uses the cached copies for all subsequent requests.
Comment 12 Chris Goldthorpe CLA 2007-07-09 14:56:17 EDT
The patch is good and has been applied to HEAD. The patch relies on an internet connection to read the DTDs but this is an improvement over being unable to resolve the entities. Without an internet connection the original problem still exists.
Comment 13 Chris Goldthorpe CLA 2007-07-09 16:59:27 EDT
Backing out the patch and reopening as the JUnit test "testXMLProcessor" is failing. I should have run the JUnits before I committed this patch.
Comment 14 Adam Archer CLA 2007-07-10 10:07:44 EDT
The test is failing because the DocumentReader is now adding the following to the script tag for live help:

xml:space="preserve"

This must have to do with the presence of the DTD, but I'm not yet sure why the parser would be arbitrarily inserting this attribute.
Comment 15 Adam Archer CLA 2007-07-10 11:10:32 EDT
Created attachment 73432 [details]
patch

After some investigation it seems that the behaviour of the parser is to insert the default value whenever it finds a node that is missing a required attribute. Since it now has access to the DTDs, it is able to find these nodes. This does not affect the functionality of the parsing it just means that the outputted document model does not exactly match the inputted xhtml.

To workaround this problem, the <script> and <a> nodes in "xhtml_expected.txt" in the have been updated in this new patch to include the missing required attributes ("xml:space" and "shape", respectively).

The alternative would be to implement our own XML parser that does not insert defaults. This seems like overkill.
Comment 16 Chris Goldthorpe CLA 2007-07-10 12:14:02 EDT
With the new patch will the tests pass with or without an internet connection?
Comment 17 Adam Archer CLA 2007-07-10 12:34:22 EDT
Created attachment 73443 [details]
patch

With the DTDs not yet in the config directory and no internet connection, the test would fail since it would revert to the old behaviour and would not add the required attributes.

In this version of the patch, the xhtml input file for the test also contains the default values. This ensures that even with the old behaviour (no DTD) they will be included in the output.

The test should now pass under all circumstances.
Comment 18 Chris Goldthorpe CLA 2007-07-10 13:23:52 EDT
Patch committed to HEAD (with copyright statements added).
Comment 19 Chris Goldthorpe CLA 2008-01-14 19:15:21 EST
*** Bug 214376 has been marked as a duplicate of this bug. ***
Comment 20 Rick Sapir CLA 2008-09-02 14:59:49 EDT
Reopening. This was re-discovered in https://bugs.eclipse.org/bugs/show_bug.cgi?id=245984  
Comment 21 Chris Goldthorpe CLA 2008-09-03 16:51:24 EDT
What version of Eclipse are you using?

I just tried this using I20080812-0800 with the example plugin which is attached to this bug and it is working fine. Can you test using the example attached to this bug.

It is possible that the other bug you referred to is using an entity which is not in one of the DTDs which gets included in Eclipse.
Comment 22 Rick Sapir CLA 2008-09-08 13:50:51 EDT
It was (re)reported in https://bugs.eclipse.org/bugs/show_bug.cgi?id=245984
The text uses &nbsp and &copy entities:

Copyright&nbsp;&copy;&nbsp;2006, 2008,&nbsp;Oracle.&nbsp;All&nbsp;rights&nbsp;reserved.
Comment 23 Chris Goldthorpe CLA 2008-09-09 15:53:10 EDT
I was able to use the &copy; and &nbsp& entities without problem in an xhtml document and have them render correctly using Eclipse 3.4. It seems that Bug 245984 is a different problem, and one which does not have a test case. If Bug 245984 is a problem with the help system then a new bug with test case should be opened but since the example in this bug works fine and continues to work when I add a copy or nbsp entity I plan to set the state of this bug back to fixed.
Comment 24 Chris Goldthorpe CLA 2008-09-11 19:53:48 EDT
I'm going to set the state back to FIXED because the test case attached works fine. If you have a different test case or a different scenario that is failing then please open a new bug.