I experienced this in other porjects as normal, especially when lucene
is used on a "write/flush after each change" way.
I don't know a solution, we are also struggling with it, but ways to go
are:
* use one indexwriter, only flush/write very seldomly, never close it.
* if you batch a big job - create the new documents in an extra index,
later merge this new thing with the main index using lucene's merge
functionality.
whatever you do, this is NOT the real problem.
the real problem will be updating/deleting one or many documents from
an index, which takes AGES.
for that, a possible performance-oriented solution is to NOT
delete/update, but set an external "deleted" flag in a table for
documents,
and add new versions on top, and then filter out the "deleted" when
searching.
but, I don't really know by experience, this is only what I try in that
situation... maybe go ask the lucene mailinglists...
best
Leo
It was Andreas.Weber@xxxxxxxxxxx who said at the right time 13.05.2009
16:15 the following words:
FYI, here's a short update of the still running Lucene indexing test with SMILA:
In the first hour, 85.000 docs were indexed.
In the secound hour, approx. 65.000 were indexed, makes 150.000 in total.
Now, after 7 hours, 380.000 docs are indexed, this is 55.000/hour.
Not sure how this will go on, but I think we have to do something...
BTW, in a test szenario without SMILA, it took 175 h to index the 25 Mio docs with Lucene.
(That's 140.000 docs/hour.)
Best regards,
Andreas
-----Ursprüngliche Nachricht-----
Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von
Daniel.Stucky@xxxxxxxxxxx
Gesendet: Mittwoch, 13. Mai 2009 13:31
An: smila-dev@xxxxxxxxxxx
Betreff: [smila-dev] Lucene indexing performance
Hi all,
during an index build (over 150.000 documents) we noticed that indexing
speed gets slower as the index increases in size. Compared to the first
hour of execution, the 2nd hour was only capable of indexing 80% of the
load that was indexed in the first hour.
I took a look at the Lucene integration code (by brox) and found, that
for each index update (add or delete) a new IndexWriter is created and
closed. This assures that the document is committed for IndexReaders and
the index is flushed, but I guess that it's bad for performance.
What were the reasons for implementing it that way ? Wouldn't it be
possible to reuse an IndexWriter, flushing the index either by Memory
usage or number of documents added/deleted ?
Bye,
Daniel
_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev
_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev
--
____________________________________________________
DI Leo Sauermann http://www.dfki.de/~sauermann
Deutsches Forschungszentrum fuer
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080 Fon: +49 631 20575-116
D-67663 Kaiserslautern Fax: +49 631 20575-102
Germany Mail: leo.sauermann@xxxxxxx
Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________
|