[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [smila-user] Handling Streaming Ressources vs JMS
|
Hi Daniel,
thank you! For testing purpose I'm going to split the
data in handy chunks and try to get my sample use case working with
SMILA.
Thats actually my plan so far:
- Split the stream
into records
- Put records into a "Database" (HBase)
- Analyze records regarding:
language, language-specific content, named entities
- Index records +
metadata into Solr index
If you can imagine advices on those
steps I would like to hear from you.
I know SMILA is in a pretty early stage but I spent lot of time
always re-inventing those infrastructures and maybe SMILA would be a
solution for projects in future.
Regards,
HannesOn Tue, May 11, 2010 at 10:58 AM,
<daniel.stucky@xxxxxxxxxxxxx> wrote:
Hi Hannes,
thanks for your interest in SMILA.
At the moment the data exchange between Crawler and Connectivity
does not support streaming. All the data of objects is actually copied (using a
byte[]) as record attachments (or as Strings using record attributes). So
you certainly cannot use data of such a big size as you plan to use.
However, perhaps you can still use SMILA to do the job J
Assuming that all machines you are running SMILA on are able to
access the data to be processed (e.g. by public URL, a filesystem share or
a database, etc.) your Crawler could only provide the information necessary to
access the data but not the data itself (e.g. a URL, or a path, or a database
Id). In the BPEL pipeline then you would need to implement your own Pipelet
that is capable of reading the data using a stream und create multiple records
from the streamed data.
You may want to take a look at the org.eclipse.smila.processing.pipelets.xmlprocessing.XmlSplitterPipelet
as a sample on how to generate new records from an existing record.
Bye,
Daniel
--
https://www.xing.com/profile/HannesCarl_Meyerhttp://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer