Bug 135107 - [Help] Allow users to use .html/.htm file extension for XHTML content
Summary: [Help] Allow users to use .html/.htm file extension for XHTML content
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: User Assistance (show other bugs)
Version: 3.2   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: 3.2 RC2   Edit
Assignee: Curtis d'Entremont CLA
QA Contact:
URL:
Whiteboard:
Keywords: greatbug
Depends on:
Blocks:
 
Reported: 2006-04-05 14:53 EDT by Curtis d'Entremont CLA
Modified: 2006-04-20 14:09 EDT (History)
10 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Curtis d'Entremont CLA 2006-04-05 14:53:04 EDT
We are finding that many doc writers are using the .html/.htm file extension for their XHTML content because there is a lot of work involved in updating the links, toc, context help, cheat sheet links, etc. At the moment, help requires the correct file extension in order for content producers and searching to work.

The fact that they cannot do this is affecting the adoption of the new dynamic content feature negatively (it is only available for XHTML). We have an HTML -> XHTML conversion tool but it does not convert everything, like links.

We seem to have the following options to address this:

1. Improve the conversion tool to handle all these updates (potentially a lot of work for PDE)

2. Tweak the HTML handlers to peek into the DOCTYPE and handle this case, at the cost of some performance. i.e. allow XHTML with .html/.htm file extension (a bit slower, but good for adoption)

The original assumption that the common case is to use the correct extension is turning out to not be true.
Comment 1 Curtis d'Entremont CLA 2006-04-05 14:56:28 EDT
John, could we get some input from the PMC on this one? We'd like to see if something can be done in 3.2 to alleviate this. Dejan prefers the second option.
Comment 2 John Arthorne CLA 2006-04-05 18:05:26 EDT
I'm not PMC -- CCing some PMC instances for input.
Comment 3 Dejan Glozic CLA 2006-04-05 19:56:15 EDT
There are two places where we would need to add content detection:

1) Dynamic content producer
2) XHTML Lucene search participant

For 1), the implementation is straightforward because the producer is given a first crack at serving the resource. We are now simply checking the resource extension, but for htm and html we can also quickly peek into the file and check if DOCTYPE is for XHTML.

For 2), Lucene participants currently register extensions for which they should be used. We would need to add an optional alternative mechanism to determine the content type. Search participants are not large classes themselves but we would like to avoid loading them just to determine if the content type applies to them (hence the 'extensions' attribute :-). We are open to suggestions as to how add the dynamic content detection capability without loading the world. John, are you aware of similar efforts elsewhere - we would like to be consistent if possible.

One possible and fairly generic option is to add additional attribute with a regular expression that the file content should match in order to qualify. We would not want to apply the expression against the file, just the first N bytes (so that we could, for example, look inside an XML file with *.xml extension and determine if we should care about it by checking the root element name). In our particular case, we would write an expression that matches the following:

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 

(the rest is not important).
Comment 4 Mike Wilson CLA 2006-04-06 10:04:44 EDT
I'm not sure where you're heading with this. The pattern "first look at the extension, then allow for code to distinguish based on the first chunk of the file" is exactly what is implemented by the existing content type support. Is there no way to make use of that?
Comment 5 Dejan Glozic CLA 2006-04-06 10:39:25 EDT
That's why we asked for feedback (see my question to John about similar efforts elsewhere :-). Apparently you are saying that the content support is what we need. We will study it and see how to take advantage of it.
Comment 6 Dejan Glozic CLA 2006-04-06 10:42:58 EDT
Curtis, can you please look up the runtime support for content types and recommend the way to use it in both the cases described above.
Comment 7 Curtis d'Entremont CLA 2006-04-06 12:51:32 EDT
Content types seems like the way to go. Each content type has an ID, which we would specify for the content producer and search participant. For the content type definition of XHTML, we list out .htm, .html, and .xhtml as the file extensions, and we plug in a content describer that will read the doctype. The content producer would simply ask the platform what is the content type of the input stream rather than looking at the extension, and for the search participant we deprecate the attribute where you specify file extensions, and introduce a new one to specify the content type id.

Is the PMC ok with the addition of a new API attribute to the lucene search participants extension point? This attribute, contentType, is to specify the id of the content type. For example, here's the current XHTML search participant:

   <extension
         point="org.eclipse.help.base.luceneSearchParticipants">
      <searchParticipant
            extensions="xhtml"
            id="org.eclipse.help.base.xhtml"
            participant="org.eclipse.help.internal.search.XHTMLSearchParticipant"
            headless="true"/>
   </extension>

And with the new attribute it would become something like:

   <extension
         point="org.eclipse.help.base.luceneSearchParticipants">
      <searchParticipant
            contentType="xhtml"
            id="org.eclipse.help.base.xhtml"
            participant="org.eclipse.help.internal.search.XHTMLSearchParticipant"
            headless="true"/>
   </extension>

where "xhtml" here is the id for the XHTML content type.
Comment 8 Mike Wilson CLA 2006-04-06 13:20:06 EDT
I believe the org.eclipse.core.runtime.contentTypes extension uses "content-type", rather than "contentType". Is there some reason why this should be different?
Comment 9 Curtis d'Entremont CLA 2006-04-06 13:53:43 EDT
Confirmed, let's be consistent and go with content-type. Thanks for the suggestion.
Comment 10 Mike Wilson CLA 2006-04-06 14:25:24 EDT
This seems like a reasonable approach to me then. However, I don't have a feel for how much impact this has on existing consumers. Who makes use of the lucene seach participants extension point? Is this new in 3.2? (apologies for being so uninformed).
Comment 11 Curtis d'Entremont CLA 2006-04-06 15:15:36 EDT
In order to search documents like XHTML, welcome content, cheat sheets, lucene must be able to parse these documents to find all the words to index, etc. This is what search participant support was added for (you can plug-in your code that reads your document type and extracts the words to index). It was introduced in 3.2.

The help implementation provides participants for the formats mentioned above, so the only external consumers of this extension point are people who want to plug-in their own formats that need to be searchable.

Existing consumers who are just using HTML will not be affected by this change.
Comment 12 Mike Wilson CLA 2006-04-07 13:24:46 EDT
This seems fine to me then. +1.
Comment 13 Curtis d'Entremont CLA 2006-04-07 17:14:25 EDT
Done.
Comment 14 Amy Wu CLA 2006-04-17 16:26:20 EDT
Isn't introducing a new HTML and XHTML Content type somewhat of an API change?

This introduction has hit WTP (Web Tools Platform) since it was already declaring its own HTML content type.  Now there is a conflict between the two and it looks like the org.eclipse.help's html content type wins making WTP's HTML editor rather difficult to use because it needs its html content type.
Comment 15 Curtis d'Entremont CLA 2006-04-17 16:49:52 EDT
Seems like this would be a problem in general when two components/products need the same content type and try to coexist. Jeff, is there a best practice for handling such conflicts? It seems in this case we could move it next to the XML content type in core (maybe not for 3.2 but eventually), but in general we can't push every content type into core.
Comment 16 Gunnar Wagenknecht CLA 2006-04-17 16:54:08 EDT
(In reply to comment #15)
> but in general we can't push every content type into core.

Well not in core but there could be at least one common content types bundle. I'm pretty sure that there are a lot adopters out there that would prefer relying on such a shared bundle instead of defining their own definitions again and again.

Comment 17 David Williams CLA 2006-04-17 17:00:32 EDT
I'd like to re-open to request reverting this and suggest a radical re-thinking. 

True, the the content type is a conflict that could maybe be overcome, 

But, it seems to me the original "problem" this bug set out to fix is out of scope for the Eclipse Project. 

The authoring, searching, producing of help documents, since they are HTML or XHTML could belong in the WTP project. We of course are glad to consider contributions :) 

I suggest the platform help system be defined to mean displaying and associating help, not producing or authoring it. 

Radical, eh? Please consider. 
Comment 18 Curtis d'Entremont CLA 2006-04-17 17:17:39 EDT
David, I'm not sure I follow - content types are not just for editors. What we were trying to do here is to determine whether a document is html or xhtml and then follow the appropriate procedure for *displaying* that document (i.e. do we need to run it through the dynamic content logic, etc). Unless I'm mistaken, this is what content types are all about.
Comment 19 David Williams CLA 2006-04-17 17:23:33 EDT
Oh yeah, I'm all for content types ... I just thought this bug was originall related to something going on during the authoring of HTML and XHTML documents. 

So, you're sayng that's not the case, its actually something going on after the doucmentes are authored, and during their display. Sorry, guess I mis-read. 

Comment 20 Dejan Glozic CLA 2006-04-17 17:50:12 EDT
Are there any conflict resolution mechanisms in core for handling this type of problem? It seems as if both Help and WTP use content type mechanism to solve legitimate problems and exactly as the spec intended. Help would be happy to reduce scope of its content type definition to avoid conflicts. Is there any mechanism for this?
Comment 21 Dejan Glozic CLA 2006-04-17 18:10:40 EDT
Ok, I did a bit of reading (which is a thing I do sometimes :-) and this is what I found about contributing content types:

http://help.eclipse.org/help31/topic/org.eclipse.platform.doc.isv/guide/runtime_content_contributing.htm

Apparently there is something about defining a 'base' content type and then creating other types that extend it. We can create a simple content type for 'htm' and 'html' files and then let both WTP and Help extend and tweak these for their own purposes.

One thing I didn't find and that made me curious is lack of support for something that platform UI has for key bindings (the notion of 'contexts' or similar). Since key bindings can clash depending on the context in which they are used, it is possible to define key bindings for dialogs only, or for a particular editor only etc.

In the case of content types, Help declared a content type to help it detect if a file is HTML or XHTML based on the DTD specification. We had no intention of extending this content type to editors, but it got picked up nevertheless. We would gladly restrict our content type to programmatic use only if we could.
Comment 22 David Williams CLA 2006-04-17 18:18:34 EDT
I'd love to provide a good community case of using conflict resolution rules, but will have to refresh my memory if they apply here. (The twists involve per project resolution, priority attribute, and aliases). 

But the timing of "doing new work" isn't great .. and it really work be a LOT of new work ... we are supposed to be shutting  down you know :) (and, and Amy said, a new content type is a new API). 

So, I want to persist, Here's the part that made me think the original, fundamental problem is "out of scope" for the Eclipse Project, per se, from comment #0. 

= = = = 
At the moment, help requires
the correct file extension in order for content producers and searching to
work.

The fact that they cannot do this is affecting the adoption of the new dynamic
content feature negatively (it is only available for XHTML). We have an HTML ->
XHTML conversion tool but it does not convert everything, like links.

= = = = 


The "convert HTML to XHTML" would appear to be an obvious web tools function. 
And, still seems to me the original problem to be solved wasn't with help per se, but was to help authors convert their help. 

So, still seems to me could be reverted, and the "new work" deferred till next release. 

And, I'm not really talking about "ownership" or "management" ... just the correct architecture. As Eclipse gets larger and larger, many things that would have seemed "obvious" in the past, should be re-thought. Such as "Help" ... yes, help is in scope of platfom, and, in past, anything related to Help would therefore be "in scope" ... but, as we are all overworked, to "save time and money", if there is some other Eclipse Foundation project that provides some or all the required functionality, it should be leveraged as the right thing to do. 

Case in point ... I haven't looked at how your HTML Content Type Describer figures out charset ... but, I knows ours is not simple, and has a huge population of testers and advisors. Wouldn't you want to pick that sort of stuff up? Also, even if the content type is "resolved", by these advanced mechanisms, I'd be pleasently surprised if the "encoding rules" did not end up different (and, hence, a huge problem for users, "copying" HTML from one proejct to another). 

Seems to me, by the time help is in "production" its locaion should be well undersood? Why would content type be needed then, exactly? I still get the sense you are solving a problem that could more easily be solved in another projects ... and let's leave the advance complicated stuff till later? 



Comment 23 David Williams CLA 2006-04-17 18:21:37 EDT
BTW, the definitive "design doc" for content types is 
http://dev.eclipse.org/viewcvs/index.cgi/%7Echeckout%7E/platform-core-home/documents/content_types.html

If that helps .. either provide a quick and easy solution, or, helps to 
convince you this is not an a quick and easy problem :) 

Comment 24 Amy Wu CLA 2006-04-17 18:25:04 EDT
(Note: while writing this comment #21-comment #23 got added, so this was submitted without reading those comments)

There is a priority attribute that is supposed to help in conflicts, but I'm not sure if that will help this situation if there really are help contributors that work in WTP.

You can extend content types or provide aliases for them according to the org.eclipse.core.runtime.contentTypes extension point documentation.  This may work.  Especially if the html content type is declared in a core plugin like stated in comment #16.  I think there is a level of maintenance involved though, because if you are defining a content type, you should support it (like figuring out the encoding)

I think the easiest (but not ideal) solution may be for the help plugin to at least set the priority of their html and xhtml content type to be "low" so that others that build on top of eclipse may override it.  This will still lock help's definition of the html and xhtml content type into API though, won't it?  So that if there's a better solution moved to core, help will still need to keep their definition of html and xhtml content type around?
Comment 25 Dejan Glozic CLA 2006-04-17 18:46:36 EDT
OK guys, lets regroup. Help has the following problem:

1) We need to tell if a topic with an *.htm or *.html extension is a valid XHTML file
2) We need to somehow associate the description of this content type with our Lucene search participant
3) We need to be able to make this call both programmatically and decleratively

For a while, it seemed that content type support solved all our problems. Now it does look as perfect. We need to find a way out short from 'don't use content types'. There must be something we are missing.

And yes, we are aware that content types are API, but note that the entire defect is about adding something after API freeze in order to avoid undue hardship of the ID community i.e. it is worth it.
Comment 26 Dejan Glozic CLA 2006-04-17 18:47:25 EDT
it does look as perfect -> it does not look as perfect
Comment 27 Jeff McAffer CLA 2006-04-17 22:08:21 EDT
Adding Rafael to the fun since he actually knows how content types work...
Comment 28 Curtis d'Entremont CLA 2006-04-18 11:41:07 EDT
Ok, here's the proposed plan for 3.2.

We will leave our content types in but make them "placeholders" for the WTP types by using the alias-for attribute. To be clear, yes, we will refer to WTP from platform (for now). Our definitions will be used only when WTP is not there. When WTP is there, it will behave as if ours were not there. WTP should not have to change anything. This way we keep our content-types API, and WTP editors still work.

Referring to WTP from platform is no good long term, so in the future, we need to think about a general solution for conflicting content types.

But for now, is WTP ok with this approach?
Comment 29 Dejan Glozic CLA 2006-04-18 11:46:54 EDT
Curtis, just to be clear, we will remove our HTML content type definition and only keep a placeholder definition for XHMTL, right?
Comment 30 Curtis d'Entremont CLA 2006-04-18 12:25:34 EDT
Upon further investigation, the proposed plan won't work because WTP defines a
single content type that includes both html and xhtml. Since need a way to
specify xhtml only (not html), this won't work.

For Dejan's question, actually now I remember why we need both html and xhtml
content types. Since xhtml can be in a .html file, we need to declare this
extension for the content type so that it will even be considered. From what
I've seen, the content type resolver will stop at the file extension if there's
only one type declaring it. By including the html content type we are telling
the platform that an .html file could be either html or xhtml, forcing it to
consult the describer, which will give the final decision.
Comment 31 Rafael Chaves CLA 2006-04-18 13:12:34 EDT
Two related but different content types declared as a single content type in WTP sounds fishy to me. I opened bug 137301 against WTP to discuss that.
Comment 32 Amy Wu CLA 2006-04-18 13:25:57 EDT
WTP has been meaning to create a separate XHTML content type.  There's already a bug for this: bug 91838.  We just haven't got around to fixing it.
Comment 33 Amy Wu CLA 2006-04-18 13:28:34 EDT
Oh, and I want to add that fixing it now in WTP's 1.5 RC1 would be a pretty major change since we'd have to update all the places that are currently checking for html content type and evaluate whether or not we want to add xhtml content type as well.
Comment 34 Rafael Chaves CLA 2006-04-18 13:55:38 EDT
My two cents:

All WTP has to do now is to agree on an ID and a place in the content type hierarchy. UA can then define a XHTML content type that is a placeholder to the future WTP XHTML.

For instance, let's say WTP plans to declare XHTML as a subtype of HTML after Callisto. UA can then declare now a placeholder for the HTML (pointing to the existing WTP's) content type and a second placeholder for the XHTML (pointing to the future WTP's) content type. In standalone mode, UA will work fine. With WTP in callisto, WTP's HTML content type will replace UA's, but UA's XHTML content type will be preserved. When WTP starts providing its own XHTML content type, both of UA's content types will be replaced (when running with WTP).
Comment 35 Dejan Glozic CLA 2006-04-18 14:05:02 EDT
Can we agree on this resolution then?
Comment 36 Amy Wu CLA 2006-04-18 14:13:26 EDT
But won't this break WTP's assumption that HTML and XHTML content type are currently the same?  Currently WTP's HTML editor works on WTP's HTML Content type, which includes XHTML.  If UA adds a separate XHTML content type, then in Callisto, xhtml files will still be UA's XHTML content type, meaning users will not be able to automatically use the HTML editor with XHTML files.
Comment 37 Rafael Chaves CLA 2006-04-18 14:26:27 EDT
That will not be a problem if UA's content type is strict enough (i.e. its describer returns VALID only if all the requirements are satisfied - for instance, one of the mandatory DOCTYPEs are available).

So:

1) unless the file is really XHTML, the content type will be WTP's HTML (when matching is VALID, the most specific wins, if it is INDETERMINATE, the most general wins)
2) if the file is deemed XHTML, any behavior currently associated to WTP's HTML content type will be available to UA's XHTML content type as well (because XHTML will be a subtype of HTML). At least that is what is expected from any plug-in that allows contributors to specify content type associations.
Comment 38 Mike Wilson CLA 2006-04-18 14:29:32 EDT
I'm still confused... If they try to open an XHTML file in this case, won't it get handed to UA instead of WTP?
Comment 39 Rafael Chaves CLA 2006-04-18 14:39:15 EDT
Only if UA is associating an editor to XHTML. I thought that this had nothing to do with editors, only with search participants. Is that right?
Comment 40 Curtis d'Entremont CLA 2006-04-18 14:56:46 EDT
I would like to propose another, simpler plan. For 3.2, don't support content types for lucene search participants yet; basically rollback the new API we added as McQ originally suggested, bring it back to how it was before, no content types declared (WTP is happy). The general case of having two search participants for files with the same extension won't be supported (I've never heard any requests for this except from ourselves). So we will handle this all internally - the HTML participant will be made aware that it can receive xhtml content and it has to delegate properly to the xhtml participant.

I've discussed with Dejan and he is ok with it.
Comment 41 Mike Wilson CLA 2006-04-18 15:09:22 EDT
I believe this is the best answer at this point, as well. 

Let's all keep in mind that this is one we'd like to do right though, post R3.2 ship.
Comment 42 David Williams CLA 2006-04-18 15:12:27 EDT
Agreed, this is a good area for a little "cross project" work post 3.2. 

BTW, sorry if I missed this, or if wrong forum,  but what exactly is the significance of type to the search participant? Are certain tag's ignored? Are certain tags flaged as searchable? I'm partially just interested in how HTML and XHTML are being treated differently by search ... but I mostly interested from the point of view if if/when we in WTP have search participants for HTML/XHTML. What would the "overlap" then be? Or, is your use of this search entirely from the "search box" of the help dialog .. and would not have anything to do with searching HTML/XHTML under development in workspace projects? 

Comment 43 Amy Wu CLA 2006-04-18 15:17:48 EDT
I like this new/simpler solution plan (especially at this stage in release). 
Thanks so much for the quick response and resolution to the issues with WTP.
Comment 44 Dejan Glozic CLA 2006-04-18 15:20:37 EDT
Help component has added new dynamic capabilities for help content in 3.2. If the document is XHTML, it is possible to use help-specific elements to filter content, include external content and place anchors for content injection. XHTML file with these elements is transformed into another XHTML file by dynamic content producer.

We needed to have a pluggable search participant that will be sensitive to these dynamic capabilities. XHTML as content is not interesting to us (a standard XHTML file is indexed as HTML one would be). However, XHTML file with dynamic capabilities needs to be indexed differently.

This is all for information search (Help>Search or Help>Help Contents>Search) and has nothing to do with workspace search that works on resources.
Comment 45 Curtis d'Entremont CLA 2006-04-18 18:59:16 EDT
Done.

Just to reiterate, we are no longer defining content types, so WTP should be fine. Could you confirm with the next nightly build?

Future: We should rework this next release to handle content types and do it properly; perhaps have the content types available in a place where both we and WTP can access them. I've opened bug 137397 to track this.
Comment 46 David Williams CLA 2006-04-18 23:50:26 EDT
Well, Curtis is on a roll ... but, I think this too is a "great bug" ... in the investigations that Curtis did for the conflict that Amy discovered. While not "fixed" in the original sense of the bug, the "early testing" and cross-project collaboration done here avoided a much worse problem that would not have been possible if we were not doing a "simultaneous" release. (so, to be explicit, its Curtis and Amy that I think get the credit for the "great bug" part here ... some of this 'contentType' stuff is outside their normal realm ... the rest of us on this bug are just lurkers giving arm chair advice :) 

Comment 47 Curtis d'Entremont CLA 2006-04-19 16:24:59 EDT
FYI, 3.2 RC1 is being rebuilt as RC1a, the only change being this fix. When finished, WTP can start using RC1a instead of RC1.
Comment 48 Kim Moir CLA 2006-04-20 13:10:05 EDT
Could WTP and Platform UA please confirm that in build
http://download.eclipse.org/eclipse/downloads/drops/I20060419-1640/index.php

this issue is resolved and that the build can be promoted to RC1a.
Comment 49 Jesse Kuhnert CLA 2006-04-20 13:11:52 EDT
Works for me.
Comment 50 Curtis d'Entremont CLA 2006-04-20 13:38:37 EDT
I can confirm the build is good, the fix is in there, and the conflict should no longer exist. Can you double check, David or Amy?
Comment 51 Amy Wu CLA 2006-04-20 14:05:33 EDT
Verified there is no longer conflict of html/xhtml content type.  Thanks so much.
Comment 52 Kim Moir CLA 2006-04-20 14:09:28 EDT
Curtis, please sign off on platform-releng-dev mailing list too so the entire team is aware of the status. thanks.