Community
Participate
Working Groups
Currently, the progress monitor when help search is done from the workbench adds too much overhead. We should update the progress less often. Investigate speeding up the indexing of large document sets. Investigate search filtering for large document sets and see if indexing require changes.
Indexing large documents should be reasonable fast and not hang the system (no out of memory, etc.)
When indexing Eclipse docs on my machine, 7.7% of cumulative time within IndexingOperation was spend in progress monitor. I released a code to not display the name of every document indexed, and update progress less often. It resulted in progress monitors using 0.20% of cumulative time.
For Indexing large documents fast, I have made following changes to WordTokenStream: changed methods signatures to final (methods are called about 160 times per document, and making them final decreases method call overhead); changed tokenizing of document to be done on demand in next() method using a fixed size buffer, instead of tokenizing whole document in the constructor; this ensures that large documents can be indexed without huge memory requirement imposed by this class; Chached all relevant objects, and not create new ones when tokenizing strings, the only exception is a StringBuffer, that is allocated for each (few k) buffer of characters; Allocate StringBuffer and ArrayList with sizes that minimize chances internal resizing; Use ArrayList.get() instead of ArrayList.Iterator.next();
4.5 % of indexing time was spent in ResourceLocator for locating and openning URLs (not the actual reading of contents). Released the code to decrease this to 1.6% by caching URLs of doc.zip files.
For products with lots of plugins we need a faster way to detect changes (what plugins have been indexed). Similarly, when 10s of thousands of topics are there, we keep a table of what's been indexed, this must be lean and fast to compare.
For large documents indexed there is various limits on size being indexed in different parts of our code. We should ensure that the limit is consistent among what part of document is read, parsed, and how many keywords indexed. It the numbers diverge a lot, there will be unnecessary limitation on size od document indexed or time will be wasted reading documents beyond of how much will be indexed.
To improve indexing and search speed we have a limit of 1M characters on the parsed text of the document. No more will be indexed. It is desired that documents are not read past this point, but Lucene HTML parser we are using does not behave nicely when the reader it provided is closed before reaching the end of the file. For now we will leave help code unchanged, and it will read input from HTML parser to the end, but destroy everything past 1 M characters. This can be changed when a different parser is used. The other limit is maximum number of tokens (words) in a field (document contents) that Lucene IndexWriter will handle. It is used to prevent out of memory problems within Lucene. We have set this limit to 1M tokens, which should never take effect since our first limit of 1M characters. In effect, we are safe, and no adjustments to the limits seem necessary.
We also have a pre-built index solution that can significantly speed index time (this requires build support).