26958 – Performance: indexing of documents for search should be fast

Bug 26958 - Performance: indexing of documents for search should be fast

Summary: Performance: indexing of documents for search should be fast

Status:	RESOLVED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	User Assistance (show other bugs)
Version:	2.1
Hardware:	PC All

Importance:	P3 normal (vote)
Target Milestone:	2.1 M5
Assignee:	Konrad Kolosowski
QA Contact:

URL:
Whiteboard:
Keywords:	performance

Depends on:	16194
Blocks:
	Show dependency tree

Reported:	2002-11-22 10:42 EST by Dorian Birsan
Modified:	2003-01-15 16:19 EST (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dorian Birsan

2002-11-22 10:42:05 EST

Currently, the progress monitor when help search is done from the workbench 
adds too much overhead. We should update the progress less often.

Investigate speeding up the indexing of large document sets.

Investigate search filtering for large document sets and see if indexing 
require changes.

Comment 1 Dorian Birsan

2002-11-22 10:49:29 EST

Indexing large documents should be reasonable fast and not hang the system (no 
out of memory, etc.)

Comment 2 Konrad Kolosowski

2002-11-23 00:52:28 EST

When indexing Eclipse docs on my machine, 7.7% of cumulative time within 
IndexingOperation was spend in progress monitor.
I released a code to not display the name of every document indexed, and update 
progress less often.  It resulted in progress monitors using 0.20% of 
cumulative time.

Comment 3 Konrad Kolosowski

2002-11-25 11:34:10 EST

For Indexing large documents fast, I have made following changes to 
WordTokenStream:

changed methods signatures to final (methods are called about 160 times per 
document, and making them final decreases method call overhead);

changed tokenizing of document to be done on demand in next() method using a 
fixed size buffer, instead of tokenizing whole document in the constructor; 
this ensures that large documents can be indexed without huge memory 
requirement imposed by this class;

Chached all relevant objects, and not create new ones when tokenizing strings, 
the only exception is a StringBuffer, that is allocated for each (few k) buffer 
of characters;

Allocate StringBuffer and ArrayList with sizes that minimize chances internal 
resizing;

Use ArrayList.get() instead of ArrayList.Iterator.next();

Comment 4 Konrad Kolosowski

2002-11-25 14:55:29 EST

4.5 % of indexing time was spent in ResourceLocator for locating and openning 
URLs (not the actual reading of contents).  Released the code to decrease this 
to 1.6% by caching URLs of doc.zip files.

Comment 5 Dorian Birsan

2002-11-27 22:22:23 EST

For products with lots of plugins we need a faster way to detect changes (what 
plugins have been indexed). Similarly, when 10s of thousands of topics are 
there, we keep a table of what's been indexed, this must be lean and fast to 
compare.

Comment 6 Konrad Kolosowski

2002-11-29 14:30:28 EST

For large documents indexed there is various limits on size being indexed in 
different parts of our code.  We should ensure that the limit is consistent 
among what part of document is read, parsed, and how many keywords indexed.  It 
the numbers diverge a lot, there will be unnecessary limitation on size od 
document indexed or time will be wasted reading documents beyond of how much 
will be indexed.

Comment 7 Konrad Kolosowski

2002-12-02 13:03:31 EST

To improve indexing and search speed we have a limit of 1M characters on the 
parsed text of the document.  No more will be indexed.  It is desired that 
documents are not read past this point, but Lucene HTML parser we are using 
does not behave nicely when the reader it provided is closed before reaching 
the end of the file.  For now we will leave help code unchanged, and it will 
read input from HTML parser to the end, but destroy everything past 1 M 
characters.  This can be changed when a different parser is used.

The other limit is maximum number of tokens (words) in a field (document 
contents) that Lucene IndexWriter will handle.  It is used to prevent out of 
memory problems within Lucene.  We have set this limit to 1M tokens, which 
should never take effect since our first limit of 1M characters.  In effect, we 
are safe, and no adjustments to the limits seem necessary.

Comment 8 Dorian Birsan

2003-01-14 15:19:14 EST

We also have a pre-built index solution that can significantly speed index time 
(this requires build support).