[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Newsgroup Home]
|
[news.eclipse.platform.ua] Re: Add pdf files to search index
|
Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer,
or do I
> need to use Lucene APIs directly to parse and index the documents?
I think that in the Eclipse help, it parses the HTML files also. I'm not
sure at which point in the indexing process. I do know that there is an
HTMLParser.java class in the org.eclipse.help.base.source JAR, and I
presume that is there because the HTML files have to be parsed at some
point in the process.
So I would imagine that to get PDF files into the Lucene index, the PDF
files would have to be parsed.
Chris Goldthorpe wrote:
This sounds like the sort of feature that others in the community may
have already implemented, and if so it would be great to get this
contributed to Eclipse. I'll ask around at IBM to see if we have anyone
looking into this. Meanwhile if anyone else on the newsgroup has
implemented this or thought about implementing this I'd be interested to
know what approach you used.
One thought is that the PDF document would need to be parsed. I just
went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d786f4e384936fa93ce1137a23b7e422
"In order to index PDF documents you need to first parse them to extract
text that you want to index from them. Here are some PDF parsers that
can help you with that:
PDFBox is a Java API from Ben Litchfield that will let you access the
contents of a PDF document. It comes with integration classes for Lucene
to translate a PDF into a Lucene document.
XPDF is an open source tool that is licensed under the GPL. It's not a
Java tool, but there is a utility called pdftotext that can translate
PDF files into text files on most platforms from the command line.
Based on xpdf, there is a utility called pdftohtml that can translate
PDF files into HTML files. This is also not a Java application.
JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------
A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html
The PDFBox site says that it is licensed under the BSD License. I don't
know if that is compatible with the Eclipse license, such that PDFBox
would be a viable solution to ship with the Eclipse Platform itself.
XPDF and Jpedal seem to be GPL or LGPL.
Hope that helps,
Lee Anne