Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[platform-help-dev] Contributors needed for NL search support

We are looking for developers that can help us enable documentation
searching in a number of languages. Japanese, Traditional Chinese,
Simplified Chinese and Korean are the ones that require the most focus.

The help search engine in eclipse is based on the Apache's Lucene search
framework  (http://jakarta.apache.org/lucene/docs/index.html) and currently
search works well on English or German documents, and likely would work to
some degree (we haven't tested this ) on other  languages such as French,
Italian, Spanish or Portuguese.

Lucene uses language analyzers to extract index terms from documents, i.e.
break the text into tokens, then apply a number of filters (lower case,
stemming, stopwords, or any kind of lexical analysis) to obtained the
desired search functionality.  For English and German, lucene provides
proper analyzers that do stemming, stopwords, etc, and that's what eclipse
uses. For other languages, we default to the StandardAnalyzer class, which
does not appear to work well on the languages mentioned above.
So, this is where we need you help: creating lucene analyzers for
- first priority: Japanese, Traditional Chinese, Simplified Chinese, Korean
- second priority: French, Italian, Spanish, Portuguese and Brazilian
Portuguese.
- any other languages

In Eclipse, we have created an extension point
org.eclipse.help.luceneAnalyzer that allows plugging in analyzers for
specific languages. An analyzer needs to extend
org.apache.lucene.analysis.Analyzer (
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Analyzer.html
), which basically says you need to implement one method that takes a
java.io.Reader and returns a stream of tokens.

There is also a need for a HTML parser. We're currently using the one that
comes with Lucene demo, but we don't have the confidence it handles certain
languages correctly (we can't test, because we don't have the proper
analyzers).

So, if anyone can help, please let us know and we'll provide more details
and work with you to get things going.

Thanks!
- the eclipse help system team



Back to the top