[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Newsgroup Home]
[news.eclipse.platform.ua] Re: Add pdf files to search index

Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer, or do I
> need to use Lucene APIs directly to parse and index the documents?


I think that in the Eclipse help, it parses the HTML files also. I'm not sure at which point in the indexing process. I do know that there is an HTMLParser.java class in the org.eclipse.help.base.source JAR, and I presume that is there because the HTML files have to be parsed at some point in the process.

So I would imagine that to get PDF files into the Lucene index, the PDF files would have to be parsed.

Chris Goldthorpe wrote:
This sounds like the sort of feature that others in the community may have already implemented, and if so it would be great to get this contributed to Eclipse. I'll ask around at IBM to see if we have anyone looking into this. Meanwhile if anyone else on the newsgroup has implemented this or thought about implementing this I'd be interested to know what approach you used.

One thought is that the PDF document would need to be parsed. I just went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d786f4e384936fa93ce1137a23b7e422


"In order to index PDF documents you need to first parse them to extract text that you want to index from them. Here are some PDF parsers that can help you with that:

PDFBox is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.

XPDF is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line.

Based on xpdf, there is a utility called pdftohtml that can translate PDF files into HTML files. This is also not a Java application.

JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------

A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html

The PDFBox site says that it is licensed under the BSD License. I don't know if that is compatible with the Eclipse license, such that PDFBox would be a viable solution to ship with the Eclipse Platform itself.

XPDF and Jpedal seem to be GPL or LGPL.

Hope that helps,
Lee Anne