[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Newsgroup Home]
[news.eclipse.technology.epf] Importing Word content

Hi again,

I have been experimenting with ways to automate or improve importing of content from Word into EPF. One issue is how to clean up the Word HTML.
I have successfully made a small converter, using HTML Tidy (batch) with a configuration file:


tidy ?config configWordClean.txt ?f errors.txt ?m [filename].htm

---
// sample config file for HTML tidy
indent: auto
indent-spaces: 2
wrap: 72
word-2000: yes
clean: yes
markup: yes
output-xml: yes
input-xml: no
doctype: omit
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: no
break-before-br: no
uppercase-tags: no
uppercase-attributes: no
char-encoding: latin1
---

The result is only a starting point. In the second step, I use a custom made WordTidy.xslt to filter the remainder to suit EPF more specifically.

---
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; xmlns:xs="http://www.w3.org/2001/XMLSchema"; xmlns:fn="http://www.w3.org/2005/xpath-functions"; exclude-result-prefixes="fn xsl xs">
<xsl:output method="xhtml" encoding="ISO-8859-1" indent="yes"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="html">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="head">
</xsl:template>
<xsl:template match="body">
<body>
<xsl:apply-templates/>
</body>
</xsl:template>
<xsl:template match="div">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="table">
<table width="{@width}" border="{@border}" cellspacing="{@cellspacing}" cellpadding="{@cellpadding}">
<xsl:apply-templates/>
</table>
</xsl:template>
<xsl:template match="tbody">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="*[text() = '&#160;' ]"/>
<xsl:template match="span[@class='c1']"/>
<xsl:template match="tr">
<tr>
<xsl:apply-templates/>
</tr>
</xsl:template>
<xsl:template match="td | th">
<td width="{@width}" valign="{@valign}">
<xsl:apply-templates/>
</td>
</xsl:template>
<xsl:template match="h1 | h2 | h3">
<h3>
<xsl:apply-templates/>
</h3>
</xsl:template>
<xsl:template match="h4">
<h4>
<xsl:apply-templates/>
</h4>
</xsl:template>
<xsl:template match="h5">
<h5>
<xsl:apply-templates/>
</h5>
</xsl:template>
<xsl:template match="span">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="p">
<p>
<xsl:apply-templates/>
</p>
</xsl:template>
<xsl:template match="img">
<img width="{@width}" height="{@height}">
<xsl:attribute name="src" select=" concat('resources/', substring-after(@src, '/')) "/>
<xsl:apply-templates/>
</img>
</xsl:template>
<xsl:template match="br"/>
<xsl:template match="ul">
<ul>
<xsl:apply-templates/>
</ul>
</xsl:template>
<xsl:template match="li">
<li>
<xsl:apply-templates/>
</li>
</xsl:template>
</xsl:stylesheet>
---


Notice how I enforce a rule whereby all images are located relative to the html in a /ressources folder (within the EPF project structure).
Another step is then to place these images in the right place so the references are valid!
---


The above technique really helps a lot, but I would like to automate it even more. Why should I have to manually Insert the HTML into the RTE each time. Why not use the EPF API to do this more directly???

Has any work been done towards this goal already?

Kristian