Bug 57455 - Aggregated plugin search indexes
Summary: Aggregated plugin search indexes
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: User Assistance (show other bugs)
Version: 3.0   Edit
Hardware: All All
: P1 enhancement (vote)
Target Milestone: 3.1 M7   Edit
Assignee: Dejan Glozic CLA
QA Contact:
URL:
Whiteboard:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2004-04-05 10:28 EDT by Paul Hardiman CLA
Modified: 2005-05-25 11:45 EDT (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Hardiman CLA 2004-04-05 10:28:32 EDT
Allow search indexes to be located within a doc plugin. Allow Eclipse Help to
aggregate plugins' indexes when developing the master index.
Comment 1 Konrad Kolosowski CLA 2004-04-05 10:47:58 EDT
Paul,
Do you mean pre-built text search index should be contributed in multiple 
pieces instead of one per product, or are you concerned about keyword index or 
alphabetical index?
Comment 2 Konrad Kolosowski CLA 2004-04-05 13:27:46 EDT
Reply from Paul:

Hi,
Thanks for responding.

I mean that each doc plugin should have the option to include a pre-built,
Eclipse Help compatible search index. When Eclipse Help builds the master
index for each doc plugin, it will either use the pre-existing indexes in
that plugin or it will generate the indexes on the fly. The builder of the
doc plugin takes responsibility for the currentness of the index if it
exists.

This may mean that Eclipse provides a utility to build search indexes for a
doc plugin. Perhaps that entity could be referenced in the plugin.xml file.
Comment 3 Konrad Kolosowski CLA 2004-04-05 13:30:37 EDT
Yes, We have consider this alternative when developing support for pre-built 
index, but we did not have time to develop both.
Right now there index can be contributed per product, what you suggest has 
certain advantages (if product is built incrementally on top of other 
products, of plug-ins are updated frequently).
Comment 4 Mike Beltzner CLA 2005-04-04 14:34:24 EDT
This would adequately resolve the problem stated in bug 77328, as well. That is
to say that the plugin providers could ship with a pre-built index which would
drastically reduce the initial indexing time upon first search. 

Please note that in terms of user expectations, this first indexing process
should take no longer than 15-20s (assuming all the plugins have pre-built indicies)
Comment 5 Dejan Glozic CLA 2005-04-18 13:17:47 EDT
The first part of the implementation finished as follows:

1) An element 'index' is added to the org.eclipse.help.toc extension:
   <index path="relative_path_to_index_directory"/>

   This element is an indication for help that the plug-in has prebuilt index 
and also a pointer to the location of the index starting from the plug-
in/fragment directory.

2) An API class 'HelpIndexBuilder' is added to org.eclipse.help.base. This 
class parses the plug-in manifest provided as the input and creates index in 
the destination directory.

3) For fragments, one index is created for each locale directory found under 
nl/ directory. Since fragments typically only serve files for documentation 
plug-ins, HelpIndexBuilder must be given the manifest of the fragment plug-in, 
not the fragment itself.

4) PDE UI has an action under 'PDE Tools' registered for plugin.xml and 
fragment.xml called 'Create Help Index'. It will produce index as defined 
above. The help documents must not be zipped.

5) A custom Ant task has been defined in org.eclipse.help.base 
called 'help.createIndex' that accepts 'manifest' and 'destination' 
attributes. 

6) SDK documentation plug-in build.xml files have been modified to hook the 
Ant task and build the search index as part of the regular build.

Moving to Konrad for adding code in Help that takes advantage of prebuilt 
indexes if found.
Comment 6 Dejan Glozic CLA 2005-04-18 13:19:47 EDT
(In reply to comment #5)
> 5) A custom Ant task has been defined in org.eclipse.help.base 
> called 'help.createIndex' that accepts 'manifest' and 'destination' 
> attributes. 

Correction: the name of the ant task is 'help.buildHelpIndex'.

Comment 7 Konrad Kolosowski CLA 2005-04-21 00:53:16 EDT
I released the first cut implementation of help indexing using prebuilt 
indexes.  Coincidently :-) the index produced by the 'Create Help Index' 
action is compatible with what we are expecting.  Thanks Dejan.

Details like error handling, consistency, and progress reporting during index 
merging are still on the table being looked at, but most functionality is 
there so doc plugins can start providig their own index any time.

If a plugin declares prebuilt index, help check for prebuilt indexes in root, 
nl, os, ws directories, of plug-in and fragments.  Plug-ins are allowed to be 
jarred, index is then extracted before being merged.
All these indexes are merged into the master index, and then duplicates are 
removed.  This ensures that documents in indexes in nl directores take 
precendence over documents prebuilt into the root index, while allowing the nl 
to contain subset of all documents.

When a plug-in with prebuilt index, or its fragments, contains translations, 
it must provide prebuilt index of translated files in the appropriate nl 
subdirectories in order to search non default language files.  I am just 
reiterating the need to run the tool as the tool already creates indexes in 
all appropriate directories.

It is highly recommended that all plug-ins in a product provide prebuilt 
index.  Post merging, index is adjusted to precisely contain these and only 
these documents that are in the built set of TOCs.  Any topis in TOC that did 
not make it into master index from prebuilt indexes will be parsed and added 
at runtime as before.
Comment 8 Dejan Glozic CLA 2005-04-21 09:13:32 EDT
Konrad, that's great news! Did you get a chance to time the current 
implementation:

1) Total ellapsed time without prebuilt indexes
2) Time with prebuilt indexes
3) Time with prebuilt indexes in JARd plug-ins
Comment 9 Konrad Kolosowski CLA 2005-04-21 11:52:09 EDT
On Thinkpad T40 running from the battery, using 5 eclipse doc plug-ins:
1. 46s  (unjarred plug-ins with    doc.zip, without index)
2. 1.4s (unjarred plug-ins with    doc.zip, with    index) 33 times speed-up
3. 2.2s (  jarred plug-ins without doc.zip, with    index) 21 times speed-up
Do I have a bug, or what? :-)  I think of injecting to some Thread.sleep
(10000) so users will still notice how cool we are and create some kind of 
index at runtime, and we can improve to Thread.sleep(5000) in the next release.

The search results page in help browser refreshes every 2 seconds if index is 
updating, so to the user sees the first results in about 46.3s, 2.3s, 4.3s (in 
these scenarios).


I have not measured, but adding fragments with large translation delta, let's 
say with all documents translatable, will double the time for 2. and 3.  What 
code does in this case is merge English and NL indexes (double the work), and 
then remove English duplicates (which could mean all English documents).  So 
the speed-up for English only product is 33 times, but with translations in 
not English language the speed-up will be only about 10-15 times if all or 
most documents are translated.
The following idea allows NL to be speeded up to the same level as English:
- the action that creates index in NL directory, can mark it as "complete" if 
number of translated HTML files in NL directory is 95%-100% number of English 
files.
- help system would merges such index and ignore other indexes, for example at 
the root of the plug-in;  no additional index to merge, no duplicates to worry 
about.
Comment 10 Dejan Glozic CLA 2005-04-21 11:59:02 EDT
These are all great numbers. I suggest we focus on bringing the function to 
the polished/GA level while I investigate detecting the 'complete' status for 
the index. BTW, where would I write it if true?
Comment 11 Konrad Kolosowski CLA 2005-04-21 12:13:46 EDT
existence of a an empty file "indexed_complete" in the index directory would 
suffice.

The complete index idea adds to the complexity of the build index action, the 
impact on help system is small (if file exist, do less work).

If we implment this, we have two options:
a) when most documents are translated, mark it as complete, and remaining 
documents will be indexed at runtime.  The magic number when to mark index as 
complete would be about 98% for maximized performance.
b) when creating nl index, add all NL files from nl directory, but also add 
English for the missing documents.  No documents would be parsed at runtime 
and the magic number for the tool could be much lower for example 80%.  The 
only cost would be little bit of wasted installation size as two indexes, 
english end nl would contain some of the same English files.
Comment 12 Konrad Kolosowski CLA 2005-04-24 18:16:31 EDT
I released final implementation of index merging, good to go for 3.1.
Comment 13 Dejan Glozic CLA 2005-04-24 19:21:45 EDT
Konrad, do you still want me to product 'index_complete' file in the index?
Comment 14 Konrad Kolosowski CLA 2005-04-25 11:16:47 EDT
Yes.  It is not critical, but if we use new performance results as our new 
base, doing this will give 2 times speedup for NLed plugins.  I think it is 
worth it.
I released the code that checks for 'index_complete' file and does not merge 
more indexes from this plugin.

I sugggest option a) from comment 11 when building plug-in index.  Compare 
number of documents in the prebuilt index and in TOCs of given plug-in.  Write 
a file if numbers are close.

Option b) is more complicated and is only guaranteed to give consistent 
results if fragments specifies exact match for plug-in version.
Comment 15 Konrad Kolosowski CLA 2005-05-04 17:17:00 EDT
Dejan,
This bug should be resolved.  The feature is in M7.  If the detail 
with 'index_complete' marker file is not added by M7, it can be tracked in a 
separate report.
Comment 16 Dejan Glozic CLA 2005-05-04 17:49:45 EDT
Fixed.

I opened bug 93729 to track the remaining 'index_complete' item.
Comment 17 Mark Melvin CLA 2005-05-25 10:23:28 EDT
This is very coo - but what would be involved in making this pre-index operation
work with doc.zip archives?  I want to use this as part of my build process, but
I do not ship the raw html source.  I actually create a doc.zip from a bunch of
smaller plugins, and then I would like to pre-index the whole zipfile at the end
- but I can't.  It looks like what I will have to do is unzip it again just to
pre-index it, then delete the html folder.

Since the Help system already knows how to index doc.zip archives, this
capability can't be that hard to add - can it?
Comment 18 Dejan Glozic CLA 2005-05-25 10:28:10 EDT
Eclipse SDK also zips the docs but we run the indexer just before the zip 
operation. It is strange to have the zip created from HTMLs that are not in 
the same plug-in. I can see how that would be a problem to you. Konrad?
Comment 19 Konrad Kolosowski CLA 2005-05-25 11:45:48 EDT
Agree with Dejan, doc.zip is equvalent of binary for documentation.
Indexing from doc.zip could be implemented, but I don't think it is 
necessary.  It is much easier to unzip, and later delete the folder.

With doc.zip, we would need to worry about enumerating files inside the zip, 
picking between documents that exists both in doc.zip and flat.  doc.zip 
should be built after the index.  If it exists already, it might be from the 
previous run of the build, and we should not use it.  No time to look at these 
before 3.1.  We can consider patches post 3.1