Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[smila-dev] Re: [Aperture-devel] usage of extractors

It was Daniel.Stucky@xxxxxxxxxxx who said at the right time 05.12.2008 14:08 the following words:
Hi aperture team,

I have another question concerning the usage of extractors:

As I understand - Extractor implementations register their factory at
the ExtractorRegistry. Each ExtractorFactory provides a list of
supported mimetypes it can extract. The ExtractorRegistry returns
available Extractors for a specified mimetype or a set of all registered
Extractors.
uh, nearly.
The ExtractorFactories are registered at the ExtractorRegistry.
The registry returns Factories, not Extractors.
A factory returns an extractor.
(=the usual Elfish art of frameworking)

How is it possible to select a certain Extractor to use for document
extraction if multiple extractors are available for the specified
mimetype? Is such a selection logic part of the implementation of a
smila pipelet using the aperture extractor ?
It is not possible, and its a mumbo-jumbo.
When two extractors are registered for a mime-type, there is no ranking or rating of them. basically, three ways how you can solve it in SMILA (=in aperture we will probably leave it open)
* do not allow multiple extractors (=the best solution !)
* rank them from generic - to - specific: a "any file" extractor has a specific rate of "0.1", a PDF extractor that knows how to scan the text for titles and whatever is "1.0", but "1.0" only for "PDFs generated from Latex and using titles". This is TRICKY
* have both extractors run, mix the results (easy)

for the last: "have both extractors run, mix the results"
this is implemented in nepomuk, mixing the result is easy - because we use RDF.
the magic lines of code are:
http://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.semanticdesktop.nepomuk.comp.datawrapper.aperture/src/org/semanticdesktop/nepomuk/comp/datawrapper/aperture/impl/ApertureDataWrapperCrawlerHandler.java#L512
http://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.semanticdesktop.nepomuk.comp.datawrapper.aperture/src/org/semanticdesktop/nepomuk/comp/datawrapper/aperture/impl/ApertureDataWrapperCrawlerHandler.java#L550

in the second part, we can safely run multiple extractors on one stream,
because we buffered it before (into a file).
you can think of better schemes, of course...

best
Leo

Bye,
Daniel

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Aperture-devel mailing list
Aperture-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/aperture-devel


--
____________________________________________________
DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo.sauermann@xxxxxxx

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________



Back to the top