Hi Hannes,
yes, that’s the idea behind SMILA, to provide an infrastructure
with easy mechanisms for integration of additional functionality so that
projects such as yours can be realized faster and without reinventing the wheel
all the time.
You should definitely check out org.eclipse.smila.integration.solr.
Also check out what pipelets (they are not that many yet) are available that
might provide reusable functionality for you.
Bye,
Daniel
Von:
smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im
Auftrag von Hannes Carl Meyer
Gesendet: Dienstag, 11. Mai 2010 11:12
An: smila-user@xxxxxxxxxxx
Betreff: Re: [smila-user] Handling Streaming Ressources vs JMS
Hi Daniel,
thank you! For testing purpose I'm going to split the data in handy chunks and
try to get my sample use case working with SMILA.
Thats actually my plan so far:
- Split the stream into records
- Put records into a "Database" (HBase)
- Analyze records regarding: language, language-specific content, named
entities
- Index records + metadata into Solr index
If you can imagine advices on those steps I would like to hear from you.
I know SMILA is in a pretty early stage but I spent lot of time always
re-inventing those infrastructures and maybe SMILA would be a solution for
projects in future.
Regards,
Hannes
On Tue, May 11, 2010 at 10:58 AM, <daniel.stucky@xxxxxxxxxxxxx>
wrote:
Hi Hannes,
thanks for your interest in
SMILA.
At the moment the data
exchange between Crawler and Connectivity does not support streaming. All the
data of objects is actually copied (using a byte[]) as record attachments
(or as Strings using record attributes). So you certainly cannot use data of
such a big size as you plan to use.
However, perhaps you can
still use SMILA to do the job J
Assuming that all machines
you are running SMILA on are able to access the data to be processed (e.g. by
public URL, a filesystem share or a database, etc.) your Crawler could
only provide the information necessary to access the data but not the data
itself (e.g. a URL, or a path, or a database Id). In the BPEL pipeline then you
would need to implement your own Pipelet that is capable of reading the data
using a stream und create multiple records from the streamed data.
You may want to take a look
at the org.eclipse.smila.processing.pipelets.xmlprocessing.XmlSplitterPipelet
as a sample on how to generate new records from an existing record.
Bye,
Daniel
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer