Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [platform-help-dev] Lucene analyzers for double-byte languages?

Erik,

In Eclipse < 3.0, requests for search are encoded using Javascipt encode()
method to support older browsers.  What is worse, encode() produces
different results on different browsers.  On IE and Netscape it does not
support encoding all characters and results in non standard encoding.  Try
searching help from Mozilla (that encodes correctly), I think the search
should work.  You can use TCP/IP monitor to record URLs that the different
browsers are sending to the infocenter.

Since some browsers use non standard encoding, help does not call server
API to obtain URL parameters, but contains custom code for parsing the
requests.  There was no problem with that observed when running internal
Tomcat or running infocenter on Tomcat 4.0.x.  You are running infocenter
using Tomcat 4.1, with Coyote connector, which might parses the request
when not asked to do so, and fails.  Try setting up the infocenter on
Tomcat 4.0.x and see if it eliminates exception for Japanese searches.

Konrad Kolosowski
Eclipse Help System



|---------+----------------------------------->
|         |           Dorian                  |
|         |           Birsan/Toronto/IBM@IBMCA|
|         |           Sent by:                |
|         |           platform-help-dev-admin@|
|         |           eclipse.org             |
|         |                                   |
|         |                                   |
|         |           07/17/2003 03:48 PM     |
|         |           Please respond to       |
|         |           platform-help-dev       |
|---------+----------------------------------->
  >------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                              |
  |       To:       platform-help-dev@xxxxxxxxxxx                                                                                |
  |       cc:                                                                                                                    |
  |       Subject:  Re: [platform-help-dev] Lucene analyzers for double-byte languages?                                          |
  |                                                                                                                              |
  |                                                                                                                              |
  >------------------------------------------------------------------------------------------------------------------------------|




Erik,

You may be hitting a number of bugs fixed in 2.1, this one being the most
likely candidate:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=25935.
Basically, it is about the machine locale having to match the document
locale, unless UTF-8.

Also https://bugs.eclipse.org/bugs/show_bug.cgi?id=30138 could affect the
results you see.

-Dorian


                                                                          
   Erik                                                                   
   Hennum/Oakland/IBM@IBMUS         To:                                   
                            platform-help-dev@xxxxxxxxxxx                 
   Sent by:                         cc:                                   
   platform-help-dev-admin@         Subject:        Re:                   
   eclipse.org              [platform-help-dev] Lucene analyzers for      
                            double-byte languages?                        
                                                                          
   07/17/2003 02:55 PM                                                    
   Please respond to                                                      
   platform-help-dev                                                      
                                                                          








Hi, Dorian:

Part of the problem seems to have been in the encoding, which was
ISO-8859-1, so all of the Japanese characters were represented as text
entities (suboptimal, to put it mildly).  I understand that text entities
aren't indexed.

I tried the Shift-JIS encoding, but the Tomcat console reported parse
errors during indexing.

So, I switched to UTF-8 encoding, which had some benefit.  Searching for a
Japanese string now creates the fulltext search index without errors.
However, searching on a Japanese string still doesn't match anything, while
searching on an embedded English string does.

Regarding declaring the locale, I've been testing as follows:

*  The order of locale preferences in my browser are:  ja, en-us, en.

*  The only documents are localized Japanese documents in
...\eclipse\plugins\our.plugin.doc\nl\ja
  (I removed the English default documents from the plugin on my latest
tests to eliminate any potential for matching the wrong language.)

*  The localized Japanese pages for this plugin are displaying correctly
when I navigate within InfoCenter Eclipse 2.0.2

*  If I search on an English string embedded in the Japanese documents (and
now on a Japanese string, too), the search indexes are created in

   ...\jakarta-tomcat-4.1.10
\work\Standalone\localhost\help\.metadata\.plugins\org.eclipse.help\nl\ja

Doesn't this imply that Eclipse is receiving the correct locale?  If not,
what else needs to be done to select the correct locale?

By the way, I was mistaken in reporting that
"java.io.CharConversionException: isHexDigit" is thrown by the search.
Watching more carefully, I see that it was thrown earlier when I select the
"book" for the localized plugin in the table of contents.  Despite the
exception, the table of contents for the plugin displays correctly.

Because we'd like the user to be able to select the locale via the browser
preference, I don't think we would want to hard-code the locale in a proxy
web application.

Do you have any suggestions of other things to try?


Thanks in advance,


Erik Hennum
ehennum AT us.ibm.com




                     "Dorian Birsan"

                     <birsan@xxxxxxxxxx>             To:
platform-help-dev@xxxxxxxxxxx
                     Sent by:                        cc:

                     platform-help-dev-admin@        Subject:  Re:
[platform-help-dev] Lucene analyzers for double-byte languages?
                     eclipse.org



                     07/17/2003 05:34 AM

                     Please respond to

                     platform-help-dev








Erik, other groups have successfully tested 2.0.2 infocenter on many
languages, including those your mentioned, so something must be different
in the setup.
Since things work fine in the stand-alone, it appears that the locale
passed to the infocenter is not correct. Unlike the stand-alone, the locale
is picked up from the request, not from the host machine. So you must
ensure your browser sends the appropriate locale.
An alternative to detecting the browser locale is to proxy the infocenter
by another webapp, and have a dispatcher servlets that wraps the incoming
request, changes the locale to a desired locale and then delegates to the
real infocenter.
There is a fix in 3.0 to fix some locale related issues, but in your case
the problem is likely caused by your browser not sending the expected
locale.

-Dorian



  Erik
  Hennum/Oakland/IBM@IBMUS         To:
                           platform-help-dev@xxxxxxxxxxx
  Sent by:                         cc:
  platform-help-dev-admin@         Subject:        Re:
  eclipse.org              [platform-help-dev] Lucene analyzers for
                           double-byte languages?

  07/16/2003 10:19 PM
  Please respond to
  platform-help-dev









Hi, Dorian:

That's reassuring!

And, in fact, when I try the search using the standalone version of Eclipse
2.0.2, I can select a Japanese string from a topic, copy it into the search
field, run the search, and return a list of search results.  When I
selected a search results, the matched Japanese string is correctly
highlighted in the displayed topic.

However, if I do the same thing in the InfoCenter version of Eclipse 2.0.2,
the query string doesn't seem to get to the backend Eclipse web
application. The status bar shows a correct-looking URL for a time, but
nothing comes back to change the prompt in the search results frame.  The
Tomcat console reports the error "java.io.CharConversionException:
isHexDigit" during conversion of parameters (I've appended the full
exception).

The Java topics do contain some untranslated English strings.  If I query
on an English string in the InfoCenter version, the search succeeds,
listing the Japanese topics in the search results frame and highlighting
the matched string in the displayed topic.

Could there be an issue with encoding, transmitting, and decoding the query
string via the web server / servlet container?

I noticed that escape is deprecated in ECMAScript v3 (equivalent to
Netscape 6 JavaScript 1.5 or IE 5.5 JScript 5.5) and so tried the
encodeURIComponent() JavaScript function in search.jsp but got the same
result.

I'm trying to confirm that InfoCenter Eclipse 2.0.2 on WebSphere
Application Server is also failing to pass the query string through to the
Eclipse web application.  Our tester reports the following message:

  There was an error in your action:
  Java.lang.IllegalArgumentException

Any suggestions on how we might fix this problem?  We're quite late in a
release cycle.


Thanks,


Erik Hennum
ehennum AT us.ibm.com


java.io.CharConversionException: isHexDigit
      at org.apache.tomcat.util.buf.UDecoder.convert(UDecoder.java:124)
      at org.apache.tomcat.util.buf.UDecoder.convert(UDecoder.java:87)
      at
org.apache.tomcat.util.http.Parameters.processParameters(Parameters.j
ava:408)
      at
org.apache.tomcat.util.http.Parameters.processParameters(Parameters.j
ava:495)
      at
org.apache.tomcat.util.http.Parameters.handleQueryParameters(Paramete
rs.java:278)
      at
org.apache.coyote.tomcat4.CoyoteRequest.parseRequestParameters(Coyote
Request.java:1920)
      at
org.apache.coyote.tomcat4.CoyoteRequest.getParameterNames(CoyoteReque
st.java:942)
      at
org.apache.coyote.tomcat4.CoyoteRequest.getParameterMap(CoyoteRequest
.java:922)
      at
org.apache.coyote.tomcat4.CoyoteRequestFacade.getParameterMap(CoyoteR
equestFacade.java:193)
      at
org.apache.catalina.core.ApplicationHttpRequest.setRequest(Applicatio
nHttpRequest.java:525)
      at
org.apache.catalina.core.ApplicationHttpRequest.<init>(ApplicationHtt
pRequest.java:125)
      at
org.apache.catalina.core.ApplicationDispatcher.wrapRequest(Applicatio
nDispatcher.java:921)
      at
org.apache.catalina.core.ApplicationDispatcher.doInclude(ApplicationD
ispatcher.java:547)
      at
org.apache.catalina.core.ApplicationDispatcher.include(ApplicationDis
patcher.java:498)
      at
org.apache.jsp.search_results_jsp._jspService(search_results_jsp.java
:48)
      at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:136)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
      at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper
.java:202)
      at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:2
89)
      at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:240)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
      at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:247)
      at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:193)
      at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:260)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:191)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.catalina.core.StandardContext.invoke(StandardContext.java:
2397)
      at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:180)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatche
rValve.java:170)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:641)
      at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:171)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:641)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:174)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:22
3)
      at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:405)
      at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
ssConnection(Http11Protocol.java:380)
      at
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java
:508)
      at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadP
ool.java:533)
      at java.lang.Thread.run(Thread.java:513)




                    "Dorian Birsan"

                    <birsan@xxxxxxxxxx>             To:
platform-help-dev@xxxxxxxxxxx
                    Sent by:                        cc:

                    platform-help-dev-admin@        Subject:  Re:
[platform-help-dev] Lucene analyzers for double-byte languages?
                    eclipse.org



                    07/16/2003 10:25 AM

                    Please respond to

                    platform-help-dev








Erik,

Search should work in the languages you listed, as eclipse provides a
default analyzer.
The English and German analyzers are a bit smarter, as they deal with
stemming, stop words, etc.
You could certainly pick up 3rd party plugins and contribute them as
plugins in your product (that's the intention of the analyzer extension
point).

-Dorian



 Erik
 Hennum/Oakland/IBM@IBMUS          To:
 Sent by:                  platform-help-dev@xxxxxxxxxxx
 platform-help-dev-admin@e         cc:
 clipse.org                        Subject:        [platform-help-dev]
                           Lucene analyzers for double-byte languages?

 07/16/2003 12:51 PM
 Please respond to
 platform-help-dev









Help Developers:

With regard to

 http://dev.eclipse.org/mhonarc/lists/platform-help-dev/msg00082.html

has there been any success in creating Lucene analyzers for Japanese,
Traditional Chinese, Simplified Chinese, and Korean?

Our need is to enable search in these languages on Eclipse 2.0.2

If there aren't any analyzers, would we need to hack the JSPs to disable or
hide the search UI?  (Merely to confirm.)

I did notice in looking at the Lucene site that an analyzer is available
for Simplified Chinese:

 http://marc.theaimsgroup.com/?l=lucene-dev&m=100705753831746&q=p3

I presume that would need a wrapper to plug into the
org.eclipse.help.luceneAnalyzer extension point.


Thanks,


Erik Hennum
ehennum AT us.ibm.com


_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev



_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev



_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev





Back to the top