[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [platform-help-dev] Is UTF-8 encoding assumed for all languages?

Here are my results from testing the default configurations of various HTTP
servers:

Apache HTTPd 1.3.26: BAD -- passes the charset=iso-8859-1 HTTP header; NL
content corrupted through proxy
Apache HTTPd 1.3.27: GOOD - does not pass a charset HTTP header; NL content
works through proxy
Apache HTTPd 2.0.45: BAD -- passes the charset=iso-8859-1 HTTP header; NL
content corrupted through proxy

IBM HTTP Server 1.3.19.2: BAD -- passes the charset=iso-8859-1 HTTP header;
NL content corrupted through proxy
IBM HTTP Server 1.3.26.1: GOOD -- passes the charset=UTF-8 HTTP header; NL
content works through proxy

I also tested some NL documentation that has been converted to UTF-8 and
which contains the correct <meta> element in the content. Unfortunately,
the iso-8859-1 statement passed by the "BAD" HTTP servers for the proxied
URL even corrupts that content.

Note that Apache 1.3.12 and up and Apache 2.0.x have a an httpd.conf
directive called AddDefaultCharset, which, if turned off or commented out,
will prevent the proxy from adding the charset statement to the headers
(and thereby enable NL content to be served up from a proxied URL
correctly).

In Apache 1.3.12+ (this also appears to work for IBM HTTP Server), set the
configuration directive to:
AddDefaultCharset Off

In Apache 2.0.x, comment out the configuration directive:
#AddDefaultCharset ISO-8859-1

Note that I have not tested HTTP servers that are not Apache-based; results
from anyone with easy access to IIS or other typical HTTP servers would
help us nicely round out the documentation for InfoCenter mode.


Dan




                                                                                                                                              
                      Dan                                                                                                                     
                      Scott/Toronto/IBM@IBMCA         To:       platform-help-dev@xxxxxxxxxxx                                                 
                      Sent by:                        cc:                                                                                     
                      platform-help-dev-admin@        Subject:  Re: [platform-help-dev] Is UTF-8 encoding assumed for all languages?          
                      eclipse.org                                                                                                             
                                                                                                                                              
                                                                                                                                              
                      08/05/2003 08:56 AM                                                                                                     
                      Please respond to                                                                                                       
                      platform-help-dev                                                                                                       
                                                                                                                                              
                                                                                                                                              



Hi Konrad:

The files I am viewing do contain the expected <meta HTTP-EQUIV blah>
element specifying the code page.

Another data point to consider: the only time I see corrupted characters is
when I'm viewing the help system through a proxied URL, rather than viewing
the help system directly through the port (e.g.
http://<hostname>:<port>/help/ works fine, but http://<hostname>/infocenter
shows corrupted characters for non-latin1 encodings).

This probably explains why a colleague of mine couldn't reproduce the
problem (and why he thought I was crazy, heh).

I'm running the Eclipse help system on a Linux machine proxied through
Apache 1.3.26. Just noticed that Apache 1.3.27 has been released with the
following bug fix:



<quote>


The following bugs were found in Apache 1.3.26 and have been fixed in
Apache 1.3.27:
      mod_proxy fixes:
            The cache in mod_proxy was incorrectly updating the
            Content-Length value from 304 responses when doing validation.
            Fix a problem in proxy where headers from other modules were
            added to the response headers when this was already done in the
            core already.
</quote>

I wondered whether Apache 1.3.26 was adding a charset header to the
returned document in the proxied help system, so I played with wget asking
for a Russian document (which is encoded in 'win1252'). The wget output is
below; but you can clearly see that in the first case (proxied URL) the web
server is adding a "charset=iso-8859-1" header, which we don't see in the
second case (connecting directly to help system port).

I'll see if I can upgrade to Apache 1.3.27 to reproduce the test (but
hopefully see better test results!). If it turns out that Apache 1.3.27
solves the problem, this will probably be a useful warning to document in
the 'Installing the help system as an infocenter' topic.

wget output:

dan@daniels:~$ wget -S --header='Accept-Language: ru'
http://daniels.hostname.com/prod/infocenter/topic/com.prod.doc/core/filename.htm

--08:45:53--
http://daniels.hostname.com/prod/infocenter/topic/com.prod.doc/core/filename.htm

           => `filename.htm.1'
Resolving daniels.hostname.com... done.
Connecting to daniels.hostname.com[9.26.162.217]:80... connected.
HTTP request sent, awaiting response...
 1 HTTP/1.1 200 OK
 2 Date: Thu, 08 May 2003 12:45:53 GMT
 3 Server: Apache Tomcat/4.0.6 (HTTP/1.1 Connector)
 4 Content-Type: text/html; charset=iso-8859-1
 5 Cache-Control: max-age=10000
 6 X-Cache: MISS from daniels.hostname.com
 7 Connection: close

    [ <=>
] 5,581          5.32M/s

08:45:53 (5.32 MB/s) - `filename.htm.1' saved [5581]

dan@daniels:~$ vim filename.htm.1
dan@daniels:~$ wget -S --header='Accept-Language: ru'
http://daniels.hostname.com:8084/help/topic/com.prod.doc/core/filename.htm
--08:46:38--
http://daniels.hostname.com:8084/help/topic/com.prod.doc/core/filename.htm
           => `filename.htm.2'
Resolving daniels.hostname.com... done.
Connecting to daniels.hostname.com[9.26.162.217]:8084... connected.
HTTP request sent, awaiting response...
 1 HTTP/1.1 200 OK
 2 Content-Type: text/html
 3 Date: Thu, 08 May 2003 12:46:38 GMT
 4 Server: Apache Tomcat/4.0.6 (HTTP/1.1 Connector)
 5 Cache-Control: max-age=10000
 6 Connection: close

    [ <=>
] 5,581          5.32M/s

08:46:38 (5.32 MB/s) - `filename.htm.2' saved [5581]


Dan

--
Dan Scott




                      Konrad

                      Kolosowski/Toronto/IBM@I        To:
platform-help-dev@xxxxxxxxxxx

                      BMCA                            cc:

                      Sent by:                        Subject:  Re:
[platform-help-dev] Is UTF-8 encoding assumed for all languages?
                      platform-help-dev-admin@

                      eclipse.org



                      07/05/2003 05:45 PM

                      Please respond to

                      platform-help-dev






Hi Dan.

There is no assumption on which encoding documents come in.  I think your
problem might be that some documents do not specify encoding correctly (for
example, <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=big5">, that Eclipse translation have in the head, for Chinese),
and the browser has to resort to auto detection.  The auto detection part
of a particular browser may look at the containing frameset document to
guess.

If the charset is specified as above and you still see the problem, open a
bug against help and we will investigate it.

Konrad Kolosowski
Eclipse Help System




                      Dan

                      Scott/Toronto/IBM@IBMCA         To:
platform-help-dev@xxxxxxxxxxx

                      Sent by:                        cc:

                      platform-help-dev-admin@        Subject:
[platform-help-dev] Is UTF-8 encoding assumed for all languages?

                      eclipse.org



                      05/07/2003 05:13 PM

                      Please respond to

                      platform-help-dev






Hi:

I'm experiencing some strangeness with NL content in the help system. I
have Russian documents (navigation and help files) encoded in windows-1251
code page that sometimes display as gibberish.

It looks to me like the frameset document (index.jsp) encoding of UTF-8 is
interfering with the browser's interpretation of the encodings in the
individual frames.

This problem occurs in both Mozilla 1.3.1 and Internet Explorer 6. Is this
a known limitation of the help system or of browsers?

I suppose a workaround would be to convert all of our help content to UTF-8
before generating the doc plugins... yikes.

Dan
--
Dan Scott

_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev



_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev



_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev