platform-help-home/proposals/xmlsearch/xmlsearch.htm
Parent Directory
|
Revision Log
Revision 1.1 - (view) (download) (as text)
| 1 : | dejan | 1.1 | <p></p> |
| 2 : | <h1>XML Content Search in Eclipse Help </h1> | ||
| 3 : | <p>Konrad Kolosowski and Dejan Glozic, | ||
| 4 : | <!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%m/%d/%Y" startspan -->08/08/2005<!--webbot bot="Timestamp" endspan i-checksum="12630" --></p> | ||
| 5 : | <h2> </h2> | ||
| 6 : | <h2>Background</h2> | ||
| 7 : | <p>Although Eclipse help requires HTML as the final presentation format, there | ||
| 8 : | are no technical reason why documentation cannot be delivered in XML form and | ||
| 9 : | transformed into HTML on demand. This approach provides for conditional | ||
| 10 : | transformation that can produce different results depending on the context | ||
| 11 : | attributes such as platform, target audience, context etc. The visual attributes | ||
| 12 : | can also be controlled outside the content.</p> | ||
| 13 : | <p>The purpose of this document is to address the problem of information search | ||
| 14 : | in Eclipse help system when XML is used as a content format. Search indexing | ||
| 15 : | will have to be performed on the original XML documents, while the search hits | ||
| 16 : | obtained from the subsequent searches will point at the transformed HTML | ||
| 17 : | documents. This document analyses problems that result from this separation and | ||
| 18 : | offers possible solutions.</p> | ||
| 19 : | <p>This document only lists ideas and is by no means an indication of the final | ||
| 20 : | implementation. For this reason, please do not make dependencies or otherwise | ||
| 21 : | factor in these ideas into your product plans.</p> | ||
| 22 : | <p>In the following text, the acronym 'UA' will be used repeatedly to denote | ||
| 23 : | 'User Assistance'.</p> | ||
| 24 : | <h2>The problem</h2> | ||
| 25 : | <p>Eclipse 3.1 help system uses one instance of Lucene index per locale for | ||
| 26 : | indexing all documentation. Multiple fields are used for title, summary, content | ||
| 27 : | text analyzed in two ways (stemmed and exact search). Name field is used to | ||
| 28 : | relate document to document identifiers understood by the help system - URLs. | ||
| 29 : | More than one instance of an index exists only when different locale is used. In | ||
| 30 : | the Infocenter installation, multiple indexes are in use simultaneously, but are | ||
| 31 : | completely independent. The only two exceptions are that multiple locales are | ||
| 32 : | synchronized to prevent memory spikes and locking shared configuration to | ||
| 33 : | prevent index corruption by multiple Infocenters.</p> | ||
| 34 : | <p>When indexing help content contributed by XML, keeping 3.1 search features | ||
| 35 : | would be fairly easy. The only change required would be to delegate to a | ||
| 36 : | different parser or document producer when encountering an XML document instead | ||
| 37 : | of an HTML one. An enhancement would be that we will allow other UA components | ||
| 38 : | (e.g. intro, cheatsheets) to be indexed, and displayable documents built from | ||
| 39 : | XML, rather than have one XML file to one topic relationship.</p> | ||
| 40 : | <p>The following thoughts assume we will allow XML as a format for Eclipse help, | ||
| 41 : | but not completely unify format of content contribution across UA components. | ||
| 42 : | Content and API will be reusable, thus opening possibility for search beyond | ||
| 43 : | classic help system topic, across UA.</p> | ||
| 44 : | <h3>The Lucene index structure</h3> | ||
| 45 : | <p>Indexing information from components other than help system to the index can | ||
| 46 : | be designed using multiple indexes, one for information from each component, or | ||
| 47 : | single index for all information. Since the information is of similar type and | ||
| 48 : | will likely be searched together, the results must be presented as a unified set | ||
| 49 : | and it makes more sense to index all types of documentation using one index for | ||
| 50 : | all components.</p> | ||
| 51 : | <p>Since Lucene will serve multiple components, not just help system, it needs | ||
| 52 : | to be separated from the tight control of the help system. The life cycle of | ||
| 53 : | IndexWriter, IndexReader, IndexSearcher and stability of the index need to be | ||
| 54 : | part of the Index component - Lucene plus a manager layer with APIs. Access to | ||
| 55 : | the index should be through service like APIs, with participants registered | ||
| 56 : | through extension point. This will allow index to manage concurrency and solicit | ||
| 57 : | indexing of all components when search call is initiated, change in | ||
| 58 : | contributions occurs or recovery from corruption is needed. </p> | ||
| 59 : | <p>Lucene has no assumption of document format. A search result is a document | ||
| 60 : | artifact that consists of any number of text fields containing text tokens. | ||
| 61 : | Basic search query allows for searching for one term within one field. Searching | ||
| 62 : | across fields is accomplished using complex queries built with knowledge of | ||
| 63 : | relevant fields. We have flexibility of using additional fields for storing or | ||
| 64 : | meta information, in text format. There is no built-in relationship between | ||
| 65 : | fields, no structure, and no relationship between documents, making indexing of | ||
| 66 : | structured information not straight forward if preserving structure or | ||
| 67 : | relationship is important. </p> | ||
| 68 : | <p>We must keep the number of searchable fields finite and low, to preserve good | ||
| 69 : | performance. There are two natural approaches for indexing an XML document that | ||
| 70 : | use single field that will later be searched. One is to extract all text from | ||
| 71 : | the document and index in one field (another field needed to store the document | ||
| 72 : | identifier). This is very simple, performing solution, but treats all text from | ||
| 73 : | the document equal. Another approach is to treat each element text and each | ||
| 74 : | element attribute values as units, and index every one as a dedicated document. | ||
| 75 : | For this position to be beneficial, it requires an element context (its | ||
| 76 : | containment within document and other element) to be stored, or an identifier | ||
| 77 : | assigned that can be used to look up the corresponding XML document outside of | ||
| 78 : | Lucene. </p> | ||
| 79 : | <h3>Abstracting the indexable document</h3> | ||
| 80 : | <p>Forgetting for a while about format of the source document, what makes sense | ||
| 81 : | in Eclipse UA is to be able to retrieve document fragments that can be presented | ||
| 82 : | on its own or are reusable pieces that can be displayed as part of different | ||
| 83 : | pages or different places. The fact that the content originated from an XML file | ||
| 84 : | should be irrelevant. Therefore XML content needs to be abstracted into a | ||
| 85 : | fragment that will correspond to Lucene document that can be indexed. A sample | ||
| 86 : | abstract IndexableDocument can be designed with the following methods: </p> | ||
| 87 : | <p>String getContributor - containing id of index contributor, for examples | ||
| 88 : | org.eclipse.ui.intro String getName() - containing unique document identifier | ||
| 89 : | understood by a component, for example URL, XPath, or anything that will be | ||
| 90 : | needed to retrieve document for displaying Reader getContent() - containing main | ||
| 91 : | text content for search String getTitle() - containing higher importance text | ||
| 92 : | for search String getKeywords() - containing optional keywords, synonyms for | ||
| 93 : | search, present or not in the original document String getRawTitle() - | ||
| 94 : | containing one line human readable name/title that can be later retrieved from | ||
| 95 : | the index String getSummary() - containing optional, multiple line human | ||
| 96 : | readable description of the content String getConstraints() - containing ids of | ||
| 97 : | the constraints, like OS, or roles that this content applies to </p> | ||
| 98 : | <p>More methods can be added if necessary, but allowing variable number of | ||
| 99 : | fields based on XML schema would take away benefits of using common index | ||
| 100 : | between components. Fields private to component depending on each XML definition | ||
| 101 : | would practically equal having separate indexes and require different search | ||
| 102 : | query for each document type, while theoretically all documents would be kept in | ||
| 103 : | the same Lucene index/file. The benefits and performance of such design are | ||
| 104 : | questionable. </p> | ||
| 105 : | <p>Each index participant will register their own IndexableDocument factory | ||
| 106 : | (parser, text extractor, digester whatever the name) with the index manager. | ||
| 107 : | Upon requests for the document the factory will produce the document that can | ||
| 108 : | easily be indexed. Assuming we will not be able to unify contribution format | ||
| 109 : | across all UA components (old help, intro, and cheat sheets format may need to | ||
| 110 : | be supported), each UA component may have their own and multiple of factories or | ||
| 111 : | even allow for pluggable factories for undefined source document format. A | ||
| 112 : | factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated | ||
| 113 : | content. From the implementation point of view, factories will mainly be | ||
| 114 : | parsers. If there is a need the parsers can be parameterized and work of the | ||
| 115 : | schema that is also used for displaying documents, but the challenge is to make | ||
| 116 : | them fast, and withstand long documents without large memory consumption. </p> | ||
| 117 : | <h3>Collating and filtering results</h3> | ||
| 118 : | <p>If separate, each UA component needs to keep track of its own contribution, | ||
| 119 : | be able to answer whether it requires indexing (either addition or deletion) of | ||
| 120 : | documents, and produce a list of document IDs to add/remove upon indexing | ||
| 121 : | triggered by any of components. This will allow for consistent index and | ||
| 122 : | searches irrespective of component activation by the user.</p> | ||
| 123 : | <p>Merging of search results from across components occurs automatically, as all | ||
| 124 : | documents exist in the same Lucene index. However search results will need to be | ||
| 125 : | passed through each indexing participant for filtering and converting of | ||
| 126 : | IndexableDocument identifier to a handle to a document that can be presented to | ||
| 127 : | the user. Each component would be responsible for filtering search results that | ||
| 128 : | should be hidden from the user. For example enabling activity is a frequent | ||
| 129 : | operation that should not result in an index change. Filtering based on | ||
| 130 : | activities should occur post search, on search results. </p> | ||
| 131 : | <p>Some filtering constraints with well defined set of values, for example OS, | ||
| 132 : | WS, ARCH can be indexed with minimal overhead. If indexing occurred on the | ||
| 133 : | client machine, documents not satisfying the constraints could be skipped and | ||
| 134 : | not indexed at all, but taking advantage of prebuilt indexes requires such | ||
| 135 : | content to be indexed and filtered on the client. Keeping such content | ||
| 136 : | identified by indexed constraints will allow automatic filtering by | ||
| 137 : | complimenting the search with additional boolean query for constraints field. | ||
| 138 : | </p> | ||
| 139 : | <p>Filtering based on NL should not occur. Given that UA content is human | ||
| 140 : | readable it is almost 100% translatable and very small number of documents is | ||
| 141 : | expected to be common between languages. For this reason index should contain | ||
| 142 : | text for one locale only as in 3.1 help system.</p> | ||
| 143 : | <p>Optionally, we can add additional filtering field for the UA component's own | ||
| 144 : | use. The information there would not be further analyzed, but indexed as is. | ||
| 145 : | When performing search, each UA component would have a chance of narrowing | ||
| 146 : | search to its own contributions, and further by providing query to apply to | ||
| 147 : | additional filtering field. For example, when invoked from a wizard, help system | ||
| 148 : | search may provide a query to mandate additional filtering field of the results | ||
| 149 : | to contain “task”. The information can be partitioned this way into subsets not | ||
| 150 : | necessary understood by all parts of UA. </p> | ||
| 151 : | <p>One of the things expected in future versions of help are dynamically | ||
| 152 : | revealed links to related documents, when they are present, without showing | ||
| 153 : | broken links all the time. Not trying to support searches for references, it | ||
| 154 : | does not look necessary to index links or their description. Omitting links from | ||
| 155 : | indexing will ensure that the target document is found when it exists, and not | ||
| 156 : | the document that refers to it.</p> | ||
| 157 : | <h3>Handling prebuilt indexes</h3> | ||
| 158 : | <p>Since index will be shared among components, producing prebuilt indexes will | ||
| 159 : | be more challenging. For easiest maintenance and consistency, it is best for the | ||
| 160 : | Index component to be responsible for generating and merging prebuilt indexes. | ||
| 161 : | It will require that each component will handle and document factories are able | ||
| 162 : | to work with documents in the workspace, or other installation of Eclipse.</p> | ||
| 163 : | <p>It is impossible to come up with a perfect ranking algorithm. The more | ||
| 164 : | structural differences between documents, the more complicated index is and more | ||
| 165 : | difficult to design optimal ranking algorithm. Help system search up to 3.1 | ||
| 166 : | suffers from document length having too great an influence on the ranking. | ||
| 167 : | Shorter documents with N matches rank much higher than long documents with the | ||
| 168 : | same N number of matches in text. It may be necessary to implement custom Scorer | ||
| 169 : | for UA searches.</p> |
| help@eclipse.org | ViewVC Help |
| Powered by ViewVC 1.0.3 |
