Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [platform-help-dev] Lucene analyzers for double-byte languages?


Erik,

You may be hitting a number of bugs fixed in 2.1, this one being the most likely candidate:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=25935.
Basically, it is about the machine locale having to match the document locale, unless UTF-8.

Also https://bugs.eclipse.org/bugs/show_bug.cgi?id=30138 could affect the results you see.

-Dorian



Erik Hennum/Oakland/IBM@IBMUS
Sent by: platform-help-dev-admin@xxxxxxxxxxx

07/17/2003 02:55 PM
Please respond to platform-help-dev

       
        To:        platform-help-dev@xxxxxxxxxxx
        cc:        
        Subject:        Re: [platform-help-dev] Lucene analyzers for double-byte languages?

       






Hi, Dorian:

Part of the problem seems to have been in the encoding, which was
ISO-8859-1, so all of the Japanese characters were represented as text
entities (suboptimal, to put it mildly).  I understand that text entities
aren't indexed.

I tried the Shift-JIS encoding, but the Tomcat console reported parse
errors during indexing.

So, I switched to UTF-8 encoding, which had some benefit.  Searching for a
Japanese string now creates the fulltext search index without errors.
However, searching on a Japanese string still doesn't match anything, while
searching on an embedded English string does.

Regarding declaring the locale, I've been testing as follows:

*  The order of locale preferences in my browser are:  ja, en-us, en.

*  The only documents are localized Japanese documents in
...\eclipse\plugins\our.plugin.doc\nl\ja
  (I removed the English default documents from the plugin on my latest
tests to eliminate any potential for matching the wrong language.)

*  The localized Japanese pages for this plugin are displaying correctly
when I navigate within InfoCenter Eclipse 2.0.2

*  If I search on an English string embedded in the Japanese documents (and
now on a Japanese string, too), the search indexes are created in

   ...\jakarta-tomcat-4.1.10
\work\Standalone\localhost\help\.metadata\.plugins\org.eclipse.help\nl\ja

Doesn't this imply that Eclipse is receiving the correct locale?  If not,
what else needs to be done to select the correct locale?

By the way, I was mistaken in reporting that
"java.io.CharConversionException: isHexDigit" is thrown by the search.
Watching more carefully, I see that it was thrown earlier when I select the
"book" for the localized plugin in the table of contents.  Despite the
exception, the table of contents for the plugin displays correctly.

Because we'd like the user to be able to select the locale via the browser
preference, I don't think we would want to hard-code the locale in a proxy
web application.

Do you have any suggestions of other things to try?


Thanks in advance,


Erik Hennum
ehennum AT us.ibm.com



                                                                                                                                   
                     "Dorian Birsan"                                                                                                
                     <birsan@xxxxxxxxxx>             To:       platform-help-dev@xxxxxxxxxxx                                        
                     Sent by:                        cc:                                                                            
                     platform-help-dev-admin@        Subject:  Re: [platform-help-dev] Lucene analyzers for double-byte languages?  
                     eclipse.org                                                                                                    
                                                                                                                                   
                                                                                                                                   
                     07/17/2003 05:34 AM                                                                                            
                     Please respond to                                                                                              
                     platform-help-dev                                                                                              
                                                                                                                                   
                                                                                                                                   





Erik, other groups have successfully tested 2.0.2 infocenter on many
languages, including those your mentioned, so something must be different
in the setup.
Since things work fine in the stand-alone, it appears that the locale
passed to the infocenter is not correct. Unlike the stand-alone, the locale
is picked up from the request, not from the host machine. So you must
ensure your browser sends the appropriate locale.
An alternative to detecting the browser locale is to proxy the infocenter
by another webapp, and have a dispatcher servlets that wraps the incoming
request, changes the locale to a desired locale and then delegates to the
real infocenter.
There is a fix in 3.0 to fix some locale related issues, but in your case
the problem is likely caused by your browser not sending the expected
locale.

-Dorian


                                                                         
  Erik                                                                  
  Hennum/Oakland/IBM@IBMUS         To:                                  
                           platform-help-dev@xxxxxxxxxxx                
  Sent by:                         cc:                                  
  platform-help-dev-admin@         Subject:        Re:                  
  eclipse.org              [platform-help-dev] Lucene analyzers for      
                           double-byte languages?                        
                                                                         
  07/16/2003 10:19 PM                                                    
  Please respond to                                                      
  platform-help-dev                                                      
                                                                         








Hi, Dorian:

That's reassuring!

And, in fact, when I try the search using the standalone version of Eclipse
2.0.2, I can select a Japanese string from a topic, copy it into the search
field, run the search, and return a list of search results.  When I
selected a search results, the matched Japanese string is correctly
highlighted in the displayed topic.

However, if I do the same thing in the InfoCenter version of Eclipse 2.0.2,
the query string doesn't seem to get to the backend Eclipse web
application. The status bar shows a correct-looking URL for a time, but
nothing comes back to change the prompt in the search results frame.  The
Tomcat console reports the error "java.io.CharConversionException:
isHexDigit" during conversion of parameters (I've appended the full
exception).

The Java topics do contain some untranslated English strings.  If I query
on an English string in the InfoCenter version, the search succeeds,
listing the Japanese topics in the search results frame and highlighting
the matched string in the displayed topic.

Could there be an issue with encoding, transmitting, and decoding the query
string via the web server / servlet container?

I noticed that escape is deprecated in ECMAScript v3 (equivalent to
Netscape 6 _javascript_ 1.5 or IE 5.5 JScript 5.5) and so tried the
encodeURIComponent() _javascript_ function in search.jsp but got the same
result.

I'm trying to confirm that InfoCenter Eclipse 2.0.2 on WebSphere
Application Server is also failing to pass the query string through to the
Eclipse web application.  Our tester reports the following message:

  There was an error in your action:
  Java.lang.IllegalArgumentException

Any suggestions on how we might fix this problem?  We're quite late in a
release cycle.


Thanks,


Erik Hennum
ehennum AT us.ibm.com


java.io.CharConversionException: isHexDigit
      at org.apache.tomcat.util.buf.UDecoder.convert(UDecoder.java:124)
      at org.apache.tomcat.util.buf.UDecoder.convert(UDecoder.java:87)
      at
org.apache.tomcat.util.http.Parameters.processParameters(Parameters.j
ava:408)
      at
org.apache.tomcat.util.http.Parameters.processParameters(Parameters.j
ava:495)
      at
org.apache.tomcat.util.http.Parameters.handleQueryParameters(Paramete
rs.java:278)
      at
org.apache.coyote.tomcat4.CoyoteRequest.parseRequestParameters(Coyote
Request.java:1920)
      at
org.apache.coyote.tomcat4.CoyoteRequest.getParameterNames(CoyoteReque
st.java:942)
      at
org.apache.coyote.tomcat4.CoyoteRequest.getParameterMap(CoyoteRequest
.java:922)
      at
org.apache.coyote.tomcat4.CoyoteRequestFacade.getParameterMap(CoyoteR
equestFacade.java:193)
      at
org.apache.catalina.core.ApplicationHttpRequest.setRequest(Applicatio
nHttpRequest.java:525)
      at
org.apache.catalina.core.ApplicationHttpRequest.<init>(ApplicationHtt
pRequest.java:125)
      at
org.apache.catalina.core.ApplicationDispatcher.wrapRequest(Applicatio
nDispatcher.java:921)
      at
org.apache.catalina.core.ApplicationDispatcher.doInclude(ApplicationD
ispatcher.java:547)
      at
org.apache.catalina.core.ApplicationDispatcher.include(ApplicationDis
patcher.java:498)
      at
org.apache.jsp.search_results_jsp._jspService(search_results_jsp.java
:48)
      at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:136)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
      at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper
.java:202)
      at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:2
89)
      at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:240)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
      at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:247)
      at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:193)
      at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:260)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:191)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.catalina.core.StandardContext.invoke(StandardContext.java:
2397)
      at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:180)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatche
rValve.java:170)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:641)
      at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:171)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:641)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:174)
      at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContex
t.invokeNext(StandardPipeline.java:643)
      at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.jav
a:480)
      at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)

      at
org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:22
3)
      at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:405)
      at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
ssConnection(Http11Protocol.java:380)
      at
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java
:508)
      at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadP
ool.java:533)
      at java.lang.Thread.run(Thread.java:513)




                    "Dorian Birsan"

                    <birsan@xxxxxxxxxx>             To:
platform-help-dev@xxxxxxxxxxx
                    Sent by:                        cc:

                    platform-help-dev-admin@        Subject:  Re:
[platform-help-dev] Lucene analyzers for double-byte languages?
                    eclipse.org



                    07/16/2003 10:25 AM

                    Please respond to

                    platform-help-dev








Erik,

Search should work in the languages you listed, as eclipse provides a
default analyzer.
The English and German analyzers are a bit smarter, as they deal with
stemming, stop words, etc.
You could certainly pick up 3rd party plugins and contribute them as
plugins in your product (that's the intention of the analyzer extension
point).

-Dorian



 Erik
 Hennum/Oakland/IBM@IBMUS          To:
 Sent by:                  platform-help-dev@xxxxxxxxxxx
 platform-help-dev-admin@e         cc:
 clipse.org                        Subject:        [platform-help-dev]
                           Lucene analyzers for double-byte languages?

 07/16/2003 12:51 PM
 Please respond to
 platform-help-dev









Help Developers:

With regard to

 http://dev.eclipse.org/mhonarc/lists/platform-help-dev/msg00082.html

has there been any success in creating Lucene analyzers for Japanese,
Traditional Chinese, Simplified Chinese, and Korean?

Our need is to enable search in these languages on Eclipse 2.0.2

If there aren't any analyzers, would we need to hack the JSPs to disable or
hide the search UI?  (Merely to confirm.)

I did notice in looking at the Lucene site that an analyzer is available
for Simplified Chinese:

 http://marc.theaimsgroup.com/?l=lucene-dev&m=100705753831746&q=p3

I presume that would need a wrapper to plug into the
org.eclipse.help.luceneAnalyzer extension point.


Thanks,


Erik Hennum
ehennum AT us.ibm.com


_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev



_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev



_______________________________________________
platform-help-dev mailing list
platform-help-dev@xxxxxxxxxxx
http://dev.eclipse.org/mailman/listinfo/platform-help-dev


Back to the top