[smila-user] Re: Funny questions about SMILA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[smila-user] Re: Funny questions about SMILA

From: Michael Hagström <mhagstroem@xxxxxxx>
Date: Wed, 3 Feb 2010 10:55:53 +0100
Accept-language: de-DE
Acceptlanguage: de-DE
Delivered-to: smila-user@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-user>
List-help: <mailto:smila-user-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-user>, <mailto:smila-user-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/smila-user>, <mailto:smila-user-request@eclipse.org?subject=unsubscribe>
Thread-index: AcqjHjkjmoLDOybMTY206x50TJhpBQAA2MzwADIHMFAAAGJlMAArhBrQ
Thread-topic: Funny questions about SMILA

Hi all,

please find the answers to your questions about the usage of lucene in SMILA below.

1. How does SMILA work with German special characters like ö,ä,ü,ß.
I tried request with “Schueler”/ “Schüler” and the result was nearly the same.
But when I tried “über” / “ueber” the second request does not return any response.
So please tell me why Schueler and Schüler as part of a request seem to be identical, but über and ueber not!?

2. Is the Lucene- StandardAnalyzer in a way configurable which allows to alter/add/delete/ etc. stop-words?

3. Does the Lucene- StandardAnalyzer provide a normalization?

4. Using “\n”, “\r” or “\t” as a search request leads to a search result which is not empty. Could this be disabled?

The handling of umlauts depends on the analyzer. The standard analyzer doesn’t change umlauts itself, only uppercase characters are transferred into the corresponding lowercase notation. Does the text contain both “Schüler” and “Schueler”?

A list of stop words can be defined in the declaration of the analyzer in the following way:

</Parameter>

</ParameterSet>

</Analyzer>

</IndexField>

This declaration is used during index creation.

What do you mean with “normalization”? Non searchable characters like “@”, “/” “\” are skipped and searchable characters are not changed (beside upper- to lowercase). Language dependent stemming is not provided in the standard analyzer.

Again, non searchable characters are skipped, therefore using “\n” as a search request will lead to the request “n”. Please check the indexed data for existence of “n”, “r”, etc. Note that indexing “N E W Y O R K“ will lead to a hit when searching for “n” or “r”. One other reason for such kind of search results may be “trash” of the file conversion process (e.g. PDF into UTF-8).

A nice tool to browse the lucene index is “luke” which can be found here: http://www.getopt.org/luke/. The index of SMILA is located in ”.\workspace\.metadata\.plugins\com.brox.anyfinder.lucene\<INDEX_NAME>. Furthermore the documentation on http://lucene.apache.org/java/docs/ may be interesting too.

Hope this helps…

Best regards

Michael

Dr. Michael Hagström

brox IT-Solutions GmbH
Chief Technology Officer

An der Breiten Wiese 9

30625 HANNOVER (Germany)

Follow-Ups:
- AW: [smila-user] Re: Funny questions about SMILA
  - From: Andreas.Schultz

Prev by Date: AW: [smila-user] AW: Funny questions about SMILAs
Next by Date: AW: [smila-user] Re: Funny questions about SMILA
Previous by thread: [smila-user] Funny questions about SMILAs
Next by thread: AW: [smila-user] Re: Funny questions about SMILA
Index(es):
- Date
- Thread

Breadcrumbs