| AW: [smila-user] Re: Funny questions about SMILA |
|
The GermanAnalyzer is
doing German stemming. Here a part of the Lucene
documentation. Supports an external list of stopwords
(words that will not be indexed at all) and an external list of exclusions (word
that will not be stemmed, but indexed). A default set of stopwords is used
unless an alternative list is specified, but the exclusion list is empty by
default. Von:
smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im
Auftrag von Andreas.Schultz@xxxxxxxxxxx Hi Michael, 1) I think the text does only contain “Schüler” and _not_
“Schueler”? But I will prove this. 2) Do you know about a Language dependent
stemming as part of an analyzer for german? Thanks Andreas Schultz - - - - Bitte
beachten Sie meine neuen Kontaktdaten - - - -
www.empolis.com ……………………………………………………………………………………………………………………………………………………………………………………………………….. Know. Right.
Now. Das ist unsere
Philosophie. Empolis, an Attensity Group Company, bietet eine integrierte Suite
von Geschäftsanwendungen, die mit Hilfe
patentierter semantischer Informations-Technologien die exponentiell wachsende
Menge unstrukturierter ……………………………………………………………………………………………………………………………………………………………………………………………………….. Abonnieren Sie
unseren monatlichen Newsletter: http://www.empolis.de/newsletter.html
Von: smila-user-bounces@xxxxxxxxxxx
[mailto:smila-user-bounces@xxxxxxxxxxx] Im Auftrag von Michael Hagström Hi all, please find the answers
to your questions about the usage of lucene in SMILA below. 1. How does SMILA work with German special
characters like ö,ä,ü,ß. 2. Is the Lucene- StandardAnalyzer in a way
configurable which allows to alter/add/delete/ etc. stop-words? 3. Does the Lucene- StandardAnalyzer provide a
normalization? 4. Using “\n”, “\r” or
“\t” as a search request leads to a search result which is not
empty. Could this be disabled? The handling of
umlauts depends on the analyzer. The standard analyzer doesn’t change
umlauts itself, only uppercase characters are transferred into the
corresponding lowercase notation. Does the text contain both
“Schüler” and “Schueler”? A list of stop words
can be defined in the declaration of the analyzer in the following way: <IndexField FieldNo="0"
IndexValue="true" Name="search-article_content"
StoreText="true" Tokenize="true" Type="Text"> <Analyzer
ClassName="org.apache.lucene.analysis.de.GermanAnalyzer">
<ParameterSet xmlns="http://www.brox.de/ParameterSet">
<Parameter xsi:type="StringList" Name="stopWords">
<Value>der</Value>
<Value>die</Value>
<Value>das</Value>
</Parameter>
</ParameterSet>
</Analyzer> </IndexField> This declaration is
used during index creation. What do you mean with
“normalization”? Non searchable characters like “@”,
“/” “\” are skipped and searchable characters are not
changed (beside upper- to lowercase). Language dependent stemming is not
provided in the standard analyzer. Again, non searchable
characters are skipped, therefore using “\n” as a search request
will lead to the request “n”. Please check the indexed data for
existence of “n”, “r”, etc. Note that indexing “N
E W Y O R K“ will lead to a hit when searching for “n”
or “r”. One other reason for such kind of search results may be
“trash” of the file conversion process (e.g. PDF into UTF-8). A nice tool to browse
the lucene index is “luke” which can be found here: http://www.getopt.org/luke/. The index
of SMILA is located in
”.\workspace\.metadata\.plugins\com.brox.anyfinder.lucene\<INDEX_NAME>.
Furthermore the documentation on http://lucene.apache.org/java/docs/
may be interesting too. Hope this
helps… Best regards Michael Dr. Michael Hagström brox IT-Solutions GmbH An der Breiten Wiese
9
30625 HANNOVER
(Germany)
|