Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [platform-help-dev] Lucene & Search

This seems to be a very cool search framework.

Immediate comments:

-- I see from the Lucene FAQ and from Greg's comments that its framework
nature here means you could plug-in support in Lucene for XML, SGML and PDF
doc types (or any other doc type) in addition to plugging in HTML--if you
have 1) proper doc. parsers and 2) valid docs.

-- The FAQ mentions that their framework allows for other analyzers /
filters for advanced search queries. So you can write not only language
extensions per se (like Spanish or Italian) but also indexing for advanced
token analysis within a language (to roll out additional natural language
query support like stemming or morphological changes (plurals, etc...). I
imagine you could also write thesaurus (synonym) support for terminology
with this framework--the example in the FAQ calls this "term aliasing", and
only mentions word pairs, but I wonder if the alias could point to an
indexed list of terms instead of just a single target word.

--I notice their incremental indexing can use new indexes while the old one
is in place, so the old one is "current" till the new one is ready to
merge. Very nice, since over large doc sets you don't want to lose search
while an whole new index is generated! So if you replace one plug-in out of
many plug-ins, search could work while the update takes place in the
background. It looks like Lucene takes care of garbage collection of old
index segments as well; also very helpful for managing doc. changes.

- Jamie



"Greg Adams/OTT/OTI" <Greg_Adams@xxxxxxx>@eclipse.org on 22/11/2001
12:06:31 PM

Please respond to platform-help-dev@xxxxxxxxxxx

Sent by:  platform-help-dev-admin@xxxxxxxxxxx


To:   platform-help-dev@xxxxxxxxxxx
cc:
Subject:  [platform-help-dev] Lucene & Search



Problem
==========
        Currently the eclipse platform does not provide a built-in
mechanism for searching the online documentation.

Proposal
==========

The Eclipse PMC and the help component lead have recommended pursuing the
Lucene open source search framework as our help search for the V2 release.
For more information on lucene visit
http://jakarta.apache.org/lucene/docs/index.html


Prototype
===========

We have conducted an initial look at Lucene and done a quick proof of
concept integration to prove the feasibility/suitability of the Lucene
option. We are also aware of other places within eclipse to use the search
framework however for the moment we will limit ourselves to the help search
issue. If that proceeds well additional usages may be suggested.

Brief observations on Lucene:
=====================================


*        Lucene is 100% Java and will work on all platforms supported by

        eclipse (GTK, SUSE, photon, hp, solaris, aix, windows etc.).

*        Small code base & well documented

*        Index/search speeds are reasonable

*        Index files can be persisted/pre-built

*        Incremental indexing as new files become available from plug-ins

*        Supports content tagging (e.g. author)

*        Supports heading filtering

*        Good range of query facilities (an, or, not etc.)

*        Extensible to allow support for other file formats.
        -        Easily made to work over zip files - our prototype  did
this

*        Ranks results

*        Designed as a toolkit - intended to be grown & added to

        Lucene is a search framework.
        This means it does not include explicitly knowledge about searching
specific
        domains (e.g. html) nor specific languages support (e.g. french),
However:

        -        Language oriented searching can be added as extensions.
The lucene build
                includes a german extension - unclear how good it is.

        -        HTML domain searching can be added. The demo accompanying
lucene provides
                a sample html search engine which we can make use of and
extend as needed
                however even the basic demo one is useful.

        *        This framework/extensibility approach means that various
locale groups
                or search/domain experts can contribute their skills to
lucene open source
                effort and also have it benefit eclipse.

        *         open question: DBCS and BiDi language support ... the
framework clearly supports
                Latin-1 type languages in terms of the ability to add
analyser modules.
                 but since we are not locale language experts it is hard to
comment
                on issues - however this again, is an opportunity for
contribution from others

*        Lucene is not externalized. That is its strings have not been
taken out.
        However most of the strings are not likely be exposed to the user -
they are primarily
        programming error cases, or very unlikely cases.


*        Our intent is to not create a modified version of lucene. If we
see possibilities for
        improving/enhancing lucene we will work with their open source
community.


Your Turn
=============

*        If you have comments on this proposal please let us know via the
mailing list.

*        If you have indepth technical knowledge of Lucene we would be
interested in
        additional pros/cons/risks/limitations that you are aware of.
        In addition let us know if you'd be willing to help if needed. We
don't need help
        at the moment, but its useful to know who is available.






Back to the top