Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[smila-user] Re: Funny questions about SMILA

Hi all,

 

please find the answers to your questions about the usage of lucene in SMILA below.

 

1.       How does SMILA work with German special characters like ö,ä,ü,ß.
I tried request with “Schueler”/ “Schüler” and the result was nearly the same.
But when I tried “über” / “ueber” the second request does not return any response.
So please tell me why Schueler and Schüler as part of a request seem to be identical, but über and ueber not!?

2.       Is the Lucene- StandardAnalyzer in a way configurable which allows to alter/add/delete/ etc. stop-words?

3.       Does the Lucene- StandardAnalyzer provide a normalization?

4.       Using “\n”, “\r” or “\t” as a search request leads to a search result which is not empty. Could this be disabled?

 

The handling  of umlauts depends on the analyzer. The standard analyzer doesn’t change umlauts itself, only uppercase characters are transferred into the corresponding lowercase notation. Does the text contain both “Schüler” and “Schueler”?

 

A list of stop words can be defined in the declaration of the analyzer in the following way:

<IndexField FieldNo="0" IndexValue="true" Name="search-article_content" StoreText="true" Tokenize="true" Type="Text">

<Analyzer ClassName="org.apache.lucene.analysis.de.GermanAnalyzer">

                                <ParameterSet xmlns="http://www.brox.de/ParameterSet">

                                               <Parameter xsi:type="StringList" Name="stopWords">

                                                               <Value>der</Value>

                                                               <Value>die</Value>

                                                               <Value>das</Value>

                                               </Parameter>

                               </ParameterSet>

                </Analyzer>

</IndexField>

This declaration is used during index creation.

 

What do you mean with “normalization”? Non searchable characters like “@”, “/” “\” are skipped and searchable characters are not changed (beside upper- to lowercase). Language dependent stemming is not provided in the standard analyzer.

 

Again, non searchable characters are skipped, therefore using “\n” as a search request will lead to the request “n”. Please check the indexed data for existence of “n”, “r”, etc. Note that indexing “N E W  Y O R K“ will lead to a hit when searching for “n” or “r”. One other reason for such kind of search results may be “trash” of the file conversion process (e.g. PDF into UTF-8).

 

A nice tool to browse the lucene index is “luke” which can be found here: http://www.getopt.org/luke/. The index of SMILA is located in ”.\workspace\.metadata\.plugins\com.brox.anyfinder.lucene\<INDEX_NAME>.   Furthermore the documentation on http://lucene.apache.org/java/docs/ may be interesting too.

 

Hope this helps…

 

Best regards

 

Michael

 

 

Dr. Michael Hagström

 

brox IT-Solutions GmbH
Chief Technology Officer                                                                                               

An der Breiten Wiese 9                             

30625      HANNOVER (Germany)             

 


Back to the top