Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
AW: AW: [smila-dev] Lucene indexing performance

Hi Leo,

 

thanks for your feedback.

We are currently testing a “hotfix” and hope we can provide a solution for this issue with M3.

 

Bye,

Daniel

 

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Leo Sauermann
Gesendet: Montag, 18. Mai 2009 15:40
An: Smila project developer mailing list
Betreff: Re: AW: [smila-dev] Lucene indexing performance

 

I experienced this in other porjects as normal, especially when lucene is used on a "write/flush after each change" way.

I don't know a solution, we are also struggling with it, but ways to go are:
* use one indexwriter, only flush/write very seldomly, never close it.
* if you batch a big job - create the new documents in an extra index, later merge this new thing with the main index using lucene's merge functionality.

whatever you do, this is NOT the real problem.

the real problem will be updating/deleting one or many documents from an index, which takes AGES.
for that, a possible performance-oriented solution is to NOT delete/update, but set an external "deleted" flag in a table for documents,
and add new versions on top, and then filter out the "deleted" when searching.

but, I don't really know by experience, this is only what I try in that situation... maybe go ask the lucene mailinglists...

best
Leo

It was Andreas.Weber@xxxxxxxxxxx who said at the right time 13.05.2009 16:15 the following words:

FYI, here's a short update of the still running Lucene indexing test with SMILA:
 
In the first hour, 85.000 docs were indexed.
In the secound hour, approx. 65.000 were indexed, makes 150.000 in total.
Now, after 7 hours, 380.000 docs are indexed, this is 55.000/hour.
 
Not sure how this will go on, but I think we have to do something...
 
BTW, in a test szenario without SMILA, it took 175 h to index the 25 Mio docs with Lucene.
(That's 140.000 docs/hour.)
 
Best regards,
 Andreas
 
  
-----Ursprüngliche Nachricht-----
Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von
Daniel.Stucky@xxxxxxxxxxx
Gesendet: Mittwoch, 13. Mai 2009 13:31
An: smila-dev@xxxxxxxxxxx
Betreff: [smila-dev] Lucene indexing performance
 
Hi all,
 
during an index build (over 150.000 documents) we noticed that indexing
speed gets slower as the index increases in size. Compared to the first
hour of execution, the 2nd hour was only capable of indexing 80% of the
load that was indexed in the first hour.
 
I took a look at the Lucene integration code (by brox) and found, that
for each index update (add or delete) a new IndexWriter is created and
closed. This assures that the document is committed for IndexReaders and
the index is flushed, but I guess that it's bad for performance.
 
What were the reasons for implementing it that way ? Wouldn't it be
possible to reuse an IndexWriter, flushing the index either by Memory
usage or number of documents added/deleted ?
 
Bye,
Daniel
_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev
    
_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev
 
  




-- 
____________________________________________________
DI Leo Sauermann       http://www.dfki.de/~sauermann 
 
Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo.sauermann@xxxxxxx
 
Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________

Back to the top