[smila-dev] Re: [Aperture-devel] usage of extractors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[smila-dev] Re: [Aperture-devel] usage of extractors

From: Leo Sauermann <leo.sauermann@xxxxxxx>
Date: Fri, 05 Dec 2008 18:55:04 +0100
Delivered-to: smila-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-dev>
List-help: <mailto:smila-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=unsubscribe>
Organization: DFKI GmbH
User-agent: Thunderbird 2.0.0.18 (Windows/20081105)

It was Daniel.Stucky@xxxxxxxxxxx who said at the right time 05.12.200814:08 the following words:

Hi aperture team,

I have another question concerning the usage of extractors:

As I understand - Extractor implementations register their factory at
the ExtractorRegistry. Each ExtractorFactory provides a list of
supported mimetypes it can extract. The ExtractorRegistry returns
available Extractors for a specified mimetype or a set of all registered
Extractors.

uh, nearly.
The ExtractorFactories are registered at the ExtractorRegistry.
The registry returns Factories, not Extractors.
A factory returns an extractor.
(=the usual Elfish art of frameworking)

How is it possible to select a certain Extractor to use for document
extraction if multiple extractors are available for the specified
mimetype? Is such a selection logic part of the implementation of a
smila pipelet using the aperture extractor ?

It is not possible, and its a mumbo-jumbo.

When two extractors are registered for a mime-type, there is no rankingor rating of them.basically, three ways how you can solve it in SMILA (=in aperture wewill probably leave it open)

* do not allow multiple extractors (=the best solution !)

* rank them from generic - to - specific: a "any file" extractor has aspecific rate of "0.1", a PDF extractor that knows how to scan the textfor titles and whatever is "1.0", but "1.0" only for "PDFs generatedfrom Latex and using titles". This is TRICKY

* have both extractors run, mix the results (easy)

for the last: "have both extractors run, mix the results"

this is implemented in nepomuk, mixing the result is easy - because weuse RDF.

the magic lines of code are:
http://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.semanticdesktop.nepomuk.comp.datawrapper.aperture/src/org/semanticdesktop/nepomuk/comp/datawrapper/aperture/impl/ApertureDataWrapperCrawlerHandler.java#L512
http://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.semanticdesktop.nepomuk.comp.datawrapper.aperture/src/org/semanticdesktop/nepomuk/comp/datawrapper/aperture/impl/ApertureDataWrapperCrawlerHandler.java#L550

in the second part, we can safely run multiple extractors on one stream,
because we buffered it before (into a file).
you can think of better schemes, of course...

best
Leo

Bye,
Daniel

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Aperture-devel mailing list
Aperture-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/aperture-devel



--
____________________________________________________

DI Leo Sauermann http://www.dfki.de/~sauermannDeutsches Forschungszentrum fuerKuenstliche Intelligenz DFKI GmbH

Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo.sauermann@xxxxxxx

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
____________________________________________________

References:
- [smila-dev] usage of extractors
  - From: Daniel.Stucky

Prev by Date: Re: [smila-dev] aperture bundles for smila integration
Next by Date: [smila-dev] AW: [Aperture-devel] usage of extractors
Previous by thread: [smila-dev] usage of extractors
Next by thread: [smila-dev] AW: [Aperture-devel] usage of extractors
Index(es):
- Date
- Thread

Breadcrumbs