platform-help-home/proposals/xmlsearch/xmlsearch.htm

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (view) (download) (as text)

1 : dejan 1.1 <p></p>
2 :     <h1>XML Content Search in Eclipse Help </h1>
3 :     <p>Konrad Kolosowski and Dejan Glozic,
4 :     <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%m/%d/%Y" startspan -->08/08/2005<!--webbot bot="Timestamp" endspan i-checksum="12630" --></p>
5 :     <h2>&nbsp;</h2>
6 :     <h2>Background</h2>
7 :     <p>Although Eclipse help requires HTML as the final presentation format, there
8 :     are no technical reason why documentation cannot be delivered in XML form and
9 :     transformed into HTML on demand. This approach provides for conditional
10 :     transformation that can produce different results depending on the context
11 :     attributes such as platform, target audience, context etc. The visual attributes
12 :     can also be controlled outside the content.</p>
13 :     <p>The purpose of this document is to address the problem of information search
14 :     in Eclipse help system when XML is used as a content format. Search indexing
15 :     will have to be performed on the original XML documents, while the search hits
16 :     obtained from the subsequent searches will point at the transformed HTML
17 :     documents. This document analyses problems that result from this separation and
18 :     offers possible solutions.</p>
19 :     <p>This document only lists ideas and is by no means an indication of the final
20 :     implementation. For this reason, please do not make dependencies or otherwise
21 :     factor in these ideas into your product plans.</p>
22 :     <p>In the following text, the acronym 'UA' will be used repeatedly to denote
23 :     'User Assistance'.</p>
24 :     <h2>The problem</h2>
25 :     <p>Eclipse 3.1 help system uses one instance of Lucene index per locale for
26 :     indexing all documentation. Multiple fields are used for title, summary, content
27 :     text analyzed in two ways (stemmed and exact search). Name field is used to
28 :     relate document to document identifiers understood by the help system - URLs.
29 :     More than one instance of an index exists only when different locale is used. In
30 :     the Infocenter installation, multiple indexes are in use simultaneously, but are
31 :     completely independent. The only two exceptions are that multiple locales are
32 :     synchronized to prevent memory spikes and locking shared configuration to
33 :     prevent index corruption by multiple Infocenters.</p>
34 :     <p>When indexing help content contributed by XML, keeping 3.1 search features
35 :     would be fairly easy. The only change required would be to delegate to a
36 :     different parser or document producer when encountering an XML document instead
37 :     of an HTML one. An enhancement would be that we will allow other UA components
38 :     (e.g. intro, cheatsheets) to be indexed, and displayable documents built from
39 :     XML, rather than have one XML file to one topic relationship.</p>
40 :     <p>The following thoughts assume we will allow XML as a format for Eclipse help,
41 :     but not completely unify format of content contribution across UA components.
42 :     Content and API will be reusable, thus opening possibility for search beyond
43 :     classic help system topic, across UA.</p>
44 :     <h3>The Lucene index structure</h3>
45 :     <p>Indexing information from components other than help system to the index can
46 :     be designed using multiple indexes, one for information from each component, or
47 :     single index for all information. Since the information is of similar type and
48 :     will likely be searched together, the results must be presented as a unified set
49 :     and it makes more sense to index all types of documentation using one index for
50 :     all components.</p>
51 :     <p>Since Lucene will serve multiple components, not just help system, it needs
52 :     to be separated from the tight control of the help system. The life cycle of
53 :     IndexWriter, IndexReader, IndexSearcher and stability of the index need to be
54 :     part of the Index component - Lucene plus a manager layer with APIs. Access to
55 :     the index should be through service like APIs, with participants registered
56 :     through extension point. This will allow index to manage concurrency and solicit
57 :     indexing of all components when search call is initiated, change in
58 :     contributions occurs or recovery from corruption is needed. </p>
59 :     <p>Lucene has no assumption of document format. A search result is a document
60 :     artifact that consists of any number of text fields containing text tokens.
61 :     Basic search query allows for searching for one term within one field. Searching
62 :     across fields is accomplished using complex queries built with knowledge of
63 :     relevant fields. We have flexibility of using additional fields for storing or
64 :     meta information, in text format. There is no built-in relationship between
65 :     fields, no structure, and no relationship between documents, making indexing of
66 :     structured information not straight forward if preserving structure or
67 :     relationship is important. </p>
68 :     <p>We must keep the number of searchable fields finite and low, to preserve good
69 :     performance. There are two natural approaches for indexing an XML document that
70 :     use single field that will later be searched. One is to extract all text from
71 :     the document and index in one field (another field needed to store the document
72 :     identifier). This is very simple, performing solution, but treats all text from
73 :     the document equal. Another approach is to treat each element text and each
74 :     element attribute values as units, and index every one as a dedicated document.
75 :     For this position to be beneficial, it requires an element context (its
76 :     containment within document and other element) to be stored, or an identifier
77 :     assigned that can be used to look up the corresponding XML document outside of
78 :     Lucene. </p>
79 :     <h3>Abstracting the indexable document</h3>
80 :     <p>Forgetting for a while about format of the source document, what makes sense
81 :     in Eclipse UA is to be able to retrieve document fragments that can be presented
82 :     on its own or are reusable pieces that can be displayed as part of different
83 :     pages or different places. The fact that the content originated from an XML file
84 :     should be irrelevant. Therefore XML content needs to be abstracted into a
85 :     fragment that will correspond to Lucene document that can be indexed. A sample
86 :     abstract IndexableDocument can be designed with the following methods: </p>
87 :     <p>String getContributor - containing id of index contributor, for examples
88 :     org.eclipse.ui.intro String getName() - containing unique document identifier
89 :     understood by a component, for example URL, XPath, or anything that will be
90 :     needed to retrieve document for displaying Reader getContent() - containing main
91 :     text content for search String getTitle() - containing higher importance text
92 :     for search String getKeywords() - containing optional keywords, synonyms for
93 :     search, present or not in the original document String getRawTitle() -
94 :     containing one line human readable name/title that can be later retrieved from
95 :     the index String getSummary() - containing optional, multiple line human
96 :     readable description of the content String getConstraints() - containing ids of
97 :     the constraints, like OS, or roles that this content applies to </p>
98 :     <p>More methods can be added if necessary, but allowing variable number of
99 :     fields based on XML schema would take away benefits of using common index
100 :     between components. Fields private to component depending on each XML definition
101 :     would practically equal having separate indexes and require different search
102 :     query for each document type, while theoretically all documents would be kept in
103 :     the same Lucene index/file. The benefits and performance of such design are
104 :     questionable. </p>
105 :     <p>Each index participant will register their own IndexableDocument factory
106 :     (parser, text extractor, digester whatever the name) with the index manager.
107 :     Upon requests for the document the factory will produce the document that can
108 :     easily be indexed. Assuming we will not be able to unify contribution format
109 :     across all UA components (old help, intro, and cheat sheets format may need to
110 :     be supported), each UA component may have their own and multiple of factories or
111 :     even allow for pluggable factories for undefined source document format. A
112 :     factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated
113 :     content. From the implementation point of view, factories will mainly be
114 :     parsers. If there is a need the parsers can be parameterized and work of the
115 :     schema that is also used for displaying documents, but the challenge is to make
116 :     them fast, and withstand long documents without large memory consumption. </p>
117 :     <h3>Collating and filtering results</h3>
118 :     <p>If separate, each UA component needs to keep track of its own contribution,
119 :     be able to answer whether it requires indexing (either addition or deletion) of
120 :     documents, and produce a list of document IDs to add/remove upon indexing
121 :     triggered by any of components. This will allow for consistent index and
122 :     searches irrespective of component activation by the user.</p>
123 :     <p>Merging of search results from across components occurs automatically, as all
124 :     documents exist in the same Lucene index. However search results will need to be
125 :     passed through each indexing participant for filtering and converting of
126 :     IndexableDocument identifier to a handle to a document that can be presented to
127 :     the user. Each component would be responsible for filtering search results that
128 :     should be hidden from the user. For example enabling activity is a frequent
129 :     operation that should not result in an index change. Filtering based on
130 :     activities should occur post search, on search results. </p>
131 :     <p>Some filtering constraints with well defined set of values, for example OS,
132 :     WS, ARCH can be indexed with minimal overhead. If indexing occurred on the
133 :     client machine, documents not satisfying the constraints could be skipped and
134 :     not indexed at all, but taking advantage of prebuilt indexes requires such
135 :     content to be indexed and filtered on the client. Keeping such content
136 :     identified by indexed constraints will allow automatic filtering by
137 :     complimenting the search with additional boolean query for constraints field.
138 :     </p>
139 :     <p>Filtering based on NL should not occur. Given that UA content is human
140 :     readable it is almost 100% translatable and very small number of documents is
141 :     expected to be common between languages. For this reason index should contain
142 :     text for one locale only as in 3.1 help system.</p>
143 :     <p>Optionally, we can add additional filtering field for the UA component's own
144 :     use. The information there would not be further analyzed, but indexed as is.
145 :     When performing search, each UA component would have a chance of narrowing
146 :     search to its own contributions, and further by providing query to apply to
147 :     additional filtering field. For example, when invoked from a wizard, help system
148 :     search may provide a query to mandate additional filtering field of the results
149 :     to contain “task”. The information can be partitioned this way into subsets not
150 :     necessary understood by all parts of UA. </p>
151 :     <p>One of the things expected in future versions of help are dynamically
152 :     revealed links to related documents, when they are present, without showing
153 :     broken links all the time. Not trying to support searches for references, it
154 :     does not look necessary to index links or their description. Omitting links from
155 :     indexing will ensure that the target document is found when it exists, and not
156 :     the document that refers to it.</p>
157 :     <h3>Handling prebuilt indexes</h3>
158 :     <p>Since index will be shared among components, producing prebuilt indexes will
159 :     be more challenging. For easiest maintenance and consistency, it is best for the
160 :     Index component to be responsible for generating and merging prebuilt indexes.
161 :     It will require that each component will handle and document factories are able
162 :     to work with documents in the workspace, or other installation of Eclipse.</p>
163 :     <p>It is impossible to come up with a perfect ranking algorithm. The more
164 :     structural differences between documents, the more complicated index is and more
165 :     difficult to design optimal ranking algorithm. Help system search up to 3.1
166 :     suffers from document length having too great an influence on the ranking.
167 :     Shorter documents with N matches rank much higher than long documents with the
168 :     same N number of matches in text. It may be necessary to implement custom Scorer
169 :     for UA searches.</p>