[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
AW: [smila-user] Re: Funny questions about SMILA

The GermanAnalyzer is doing German stemming.

 

Here a part of the Lucene documentation.

 

Supports an external list of stopwords (words that will not be indexed at all) and an external list of exclusions (word that will not be stemmed, but indexed). A default set of stopwords is used unless an alternative list is specified, but the exclusion list is empty by default.

 

 

 

 

Von: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im Auftrag von Andreas.Schultz@xxxxxxxxxxx
Gesendet: Mittwoch, 3. Februar 2010 11:49
An: smila-user@xxxxxxxxxxx
Betreff: AW: [smila-user] Re: Funny questions about SMILA

 

Hi Michael,

 

1)       I think the text does only contain “Schüler” and _not_ “Schueler”? But I will prove this.

2)       Do you know about a Language dependent stemming as part of an analyzer for german?

 

 

Thanks

 

Andreas Schultz
Senior Software Developer

- - - - Bitte beachten Sie meine neuen Kontaktdaten - - - -


Empolis GmbH  |  Meisenstr. 90 | 33607 Bielefeld  |  Germany
AN ATTENSITY GROUP COMPANY
Phone +49 (0)521 55 785 413|  Fax +49 (0)521 55 785 121
andreas.schultz@xxxxxxxxxxx

 

www.empolis.com
Sitz Kaiserslautern  |  Amtsgericht Kaiserslautern HRB 30711  |  Geschäftsführer: Dr. Stefan Wess, Dr. Peter Tepassé

 

………………………………………………………………………………………………………………………………………………………………………………………………………..

Know. Right. Now.

Das ist unsere Philosophie. Empolis, an Attensity Group Company, bietet eine integrierte Suite von Geschäftsanwendungen,

die mit Hilfe patentierter semantischer Informations-Technologien die exponentiell wachsende Menge unstrukturierter
Daten analysiert, interpretiert und automatisiert verarbeitet. Entscheider, Experten, Mitarbeiter und Kunden erhalten so
stets situations- und aufgabengerecht genau das Wissen, das für ihre Arbeit relevant ist.

………………………………………………………………………………………………………………………………………………………………………………………………………..

Abonnieren Sie unseren monatlichen Newsletter: http://www.empolis.de/newsletter.html

 

Von: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im Auftrag von Michael Hagström
Gesendet: Mittwoch, 3. Februar 2010 10:56
An: smila-user@xxxxxxxxxxx
Betreff: [smila-user] Re: Funny questions about SMILA

 

Hi all,

 

please find the answers to your questions about the usage of lucene in SMILA below.

 

1.       How does SMILA work with German special characters like ö,ä,ü,ß.
I tried request with “Schueler”/ “Schüler” and the result was nearly the same.
But when I tried “über” / “ueber” the second request does not return any response.
So please tell me why Schueler and Schüler as part of a request seem to be identical, but über and ueber not!?

2.       Is the Lucene- StandardAnalyzer in a way configurable which allows to alter/add/delete/ etc. stop-words?

3.       Does the Lucene- StandardAnalyzer provide a normalization?

4.       Using “\n”, “\r” or “\t” as a search request leads to a search result which is not empty. Could this be disabled?

 

The handling  of umlauts depends on the analyzer. The standard analyzer doesn’t change umlauts itself, only uppercase characters are transferred into the corresponding lowercase notation. Does the text contain both “Schüler” and “Schueler”?

 

A list of stop words can be defined in the declaration of the analyzer in the following way:

<IndexField FieldNo="0" IndexValue="true" Name="search-article_content" StoreText="true" Tokenize="true" Type="Text">

<Analyzer ClassName="org.apache.lucene.analysis.de.GermanAnalyzer">

                                <ParameterSet xmlns="http://www.brox.de/ParameterSet">

                                               <Parameter xsi:type="StringList" Name="stopWords">

                                                               <Value>der</Value>

                                                               <Value>die</Value>

                                                               <Value>das</Value>

                                               </Parameter>

                               </ParameterSet>

                </Analyzer>

</IndexField>

This declaration is used during index creation.

 

What do you mean with “normalization”? Non searchable characters like “@”, “/” “\” are skipped and searchable characters are not changed (beside upper- to lowercase). Language dependent stemming is not provided in the standard analyzer.

 

Again, non searchable characters are skipped, therefore using “\n” as a search request will lead to the request “n”. Please check the indexed data for existence of “n”, “r”, etc. Note that indexing “N E W  Y O R K“ will lead to a hit when searching for “n” or “r”. One other reason for such kind of search results may be “trash” of the file conversion process (e.g. PDF into UTF-8).

 

A nice tool to browse the lucene index is “luke” which can be found here: http://www.getopt.org/luke/. The index of SMILA is located in ”.\workspace\.metadata\.plugins\com.brox.anyfinder.lucene\<INDEX_NAME>.   Furthermore the documentation on http://lucene.apache.org/java/docs/ may be interesting too.

 

Hope this helps…

 

Best regards

 

Michael

 

 

Dr. Michael Hagström

 

brox IT-Solutions GmbH
Chief Technology Officer                                                                                               

An der Breiten Wiese 9                             

30625      HANNOVER (Germany)