Hi all,
please find the
answers to your questions about the usage of lucene in SMILA below.
1. How does SMILA work with German special
characters like ö,ä,ü,ß.
I tried request with “Schueler”/ “Schüler” and the
result was nearly the same.
But when I tried “über” / “ueber” the second request
does not return any response.
So please tell me why Schueler and Schüler as part of a request seem to be
identical, but über and ueber not!?
2. Is the Lucene- StandardAnalyzer in a way
configurable which allows to alter/add/delete/ etc. stop-words?
3. Does the Lucene- StandardAnalyzer provide a
normalization?
4. Using “\n”, “\r” or
“\t” as a search request leads to a search result which is not
empty. Could this be disabled?
The handling of
umlauts depends on the analyzer. The standard analyzer doesn’t change
umlauts itself, only uppercase characters are transferred into the
corresponding lowercase notation. Does the text contain both
“Schüler” and “Schueler”?
A list of stop words
can be defined in the declaration of the analyzer in the following way:
<IndexField FieldNo="0"
IndexValue="true" Name="search-article_content"
StoreText="true" Tokenize="true" Type="Text">
<Analyzer
ClassName="org.apache.lucene.analysis.de.GermanAnalyzer">
<ParameterSet
xmlns="http://www.brox.de/ParameterSet">
<Parameter
xsi:type="StringList" Name="stopWords">
<Value>der</Value>
<Value>die</Value>
<Value>das</Value>
</Parameter>
</ParameterSet>
</Analyzer>
</IndexField>
This declaration is
used during index creation.
What do you mean with
“normalization”? Non searchable characters like “@”,
“/” “\” are skipped and searchable characters are not
changed (beside upper- to lowercase). Language dependent stemming is not
provided in the standard analyzer.
Again, non searchable
characters are skipped, therefore using “\n” as a search request
will lead to the request “n”. Please check the indexed data for
existence of “n”, “r”, etc. Note that indexing “N
E W Y O R K“ will lead to a hit when searching for “n” or
“r”. One other reason for such kind of search results may be
“trash” of the file conversion process (e.g. PDF into UTF-8).
A nice tool to browse
the lucene index is “luke” which can be found here: http://www.getopt.org/luke/. The index
of SMILA is located in
”.\workspace\.metadata\.plugins\com.brox.anyfinder.lucene\<INDEX_NAME>.
Furthermore the documentation on http://lucene.apache.org/java/docs/
may be interesting too.
Hope this
helps…
Best regards
Michael
Dr. Michael Hagström
brox IT-Solutions GmbH
Chief Technology
Officer
An der Breiten Wiese
9
30625 HANNOVER
(Germany)