[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
[smila-dev] Re: [Aperture-devel] usage of extractors
|
It was Daniel.Stucky@xxxxxxxxxxx who said at the right time 05.12.2008
14:08 the following words:
Hi aperture team,
I have another question concerning the usage of extractors:
As I understand - Extractor implementations register their factory at
the ExtractorRegistry. Each ExtractorFactory provides a list of
supported mimetypes it can extract. The ExtractorRegistry returns
available Extractors for a specified mimetype or a set of all registered
Extractors.
uh, nearly.
The ExtractorFactories are registered at the ExtractorRegistry.
The registry returns Factories, not Extractors.
A factory returns an extractor.
(=the usual Elfish art of frameworking)
How is it possible to select a certain Extractor to use for document
extraction if multiple extractors are available for the specified
mimetype? Is such a selection logic part of the implementation of a
smila pipelet using the aperture extractor ?
It is not possible, and its a mumbo-jumbo.
When two extractors are registered for a mime-type, there is no ranking
or rating of them.
basically, three ways how you can solve it in SMILA (=in aperture we
will probably leave it open)
* do not allow multiple extractors (=the best solution !)
* rank them from generic - to - specific: a "any file" extractor has a
specific rate of "0.1", a PDF extractor that knows how to scan the text
for titles and whatever is "1.0", but "1.0" only for "PDFs generated
from Latex and using titles". This is TRICKY
* have both extractors run, mix the results (easy)
for the last: "have both extractors run, mix the results"
this is implemented in nepomuk, mixing the result is easy - because we
use RDF.
the magic lines of code are:
http://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.semanticdesktop.nepomuk.comp.datawrapper.aperture/src/org/semanticdesktop/nepomuk/comp/datawrapper/aperture/impl/ApertureDataWrapperCrawlerHandler.java#L512
http://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.semanticdesktop.nepomuk.comp.datawrapper.aperture/src/org/semanticdesktop/nepomuk/comp/datawrapper/aperture/impl/ApertureDataWrapperCrawlerHandler.java#L550
in the second part, we can safely run multiple extractors on one stream,
because we buffered it before (into a file).
you can think of better schemes, of course...
best
Leo
Bye,
Daniel
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Aperture-devel mailing list
Aperture-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/aperture-devel
--
____________________________________________________
DI Leo Sauermann http://www.dfki.de/~sauermann
Deutsches Forschungszentrum fuer
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080 Fon: +49 631 20575-116
D-67663 Kaiserslautern Fax: +49 631 20575-102
Germany Mail: leo.sauermann@xxxxxxx
Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________