Bug 26958 - Performance: indexing of documents for search should be fast
Summary: Performance: indexing of documents for search should be fast
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: User Assistance (show other bugs)
Version: 2.1   Edit
Hardware: PC All
: P3 normal (vote)
Target Milestone: 2.1 M5   Edit
Assignee: Konrad Kolosowski CLA
QA Contact:
URL:
Whiteboard:
Keywords: performance
Depends on: 16194
Blocks:
  Show dependency tree
 
Reported: 2002-11-22 10:42 EST by Dorian Birsan CLA
Modified: 2003-01-15 16:19 EST (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dorian Birsan CLA 2002-11-22 10:42:05 EST
Currently, the progress monitor when help search is done from the workbench 
adds too much overhead. We should update the progress less often.

Investigate speeding up the indexing of large document sets.

Investigate search filtering for large document sets and see if indexing 
require changes.
Comment 1 Dorian Birsan CLA 2002-11-22 10:49:29 EST
Indexing large documents should be reasonable fast and not hang the system (no 
out of memory, etc.)
Comment 2 Konrad Kolosowski CLA 2002-11-23 00:52:28 EST
When indexing Eclipse docs on my machine, 7.7% of cumulative time within 
IndexingOperation was spend in progress monitor.
I released a code to not display the name of every document indexed, and update 
progress less often.  It resulted in progress monitors using 0.20% of 
cumulative time.
Comment 3 Konrad Kolosowski CLA 2002-11-25 11:34:10 EST
For Indexing large documents fast, I have made following changes to 
WordTokenStream:

changed methods signatures to final (methods are called about 160 times per 
document, and making them final decreases method call overhead);

changed tokenizing of document to be done on demand in next() method using a 
fixed size buffer, instead of tokenizing whole document in the constructor; 
this ensures that large documents can be indexed without huge memory 
requirement imposed by this class;

Chached all relevant objects, and not create new ones when tokenizing strings, 
the only exception is a StringBuffer, that is allocated for each (few k) buffer 
of characters;

Allocate StringBuffer and ArrayList with sizes that minimize chances internal 
resizing;

Use ArrayList.get() instead of ArrayList.Iterator.next();
Comment 4 Konrad Kolosowski CLA 2002-11-25 14:55:29 EST
4.5 % of indexing time was spent in ResourceLocator for locating and openning 
URLs (not the actual reading of contents).  Released the code to decrease this 
to 1.6% by caching URLs of doc.zip files.
Comment 5 Dorian Birsan CLA 2002-11-27 22:22:23 EST
For products with lots of plugins we need a faster way to detect changes (what 
plugins have been indexed). Similarly, when 10s of thousands of topics are 
there, we keep a table of what's been indexed, this must be lean and fast to 
compare.
Comment 6 Konrad Kolosowski CLA 2002-11-29 14:30:28 EST
For large documents indexed there is various limits on size being indexed in 
different parts of our code.  We should ensure that the limit is consistent 
among what part of document is read, parsed, and how many keywords indexed.  It 
the numbers diverge a lot, there will be unnecessary limitation on size od 
document indexed or time will be wasted reading documents beyond of how much 
will be indexed.
Comment 7 Konrad Kolosowski CLA 2002-12-02 13:03:31 EST
To improve indexing and search speed we have a limit of 1M characters on the 
parsed text of the document.  No more will be indexed.  It is desired that 
documents are not read past this point, but Lucene HTML parser we are using 
does not behave nicely when the reader it provided is closed before reaching 
the end of the file.  For now we will leave help code unchanged, and it will 
read input from HTML parser to the end, but destroy everything past 1 M 
characters.  This can be changed when a different parser is used.

The other limit is maximum number of tokens (words) in a field (document 
contents) that Lucene IndexWriter will handle.  It is used to prevent out of 
memory problems within Lucene.  We have set this limit to 1M tokens, which 
should never take effect since our first limit of 1M characters.  In effect, we 
are safe, and no adjustments to the limits seem necessary.
Comment 8 Dorian Birsan CLA 2003-01-14 15:19:14 EST
We also have a pre-built index solution that can significantly speed index time 
(this requires build support).