Hi Hannes,
thanks for your interest in SMILA.
At the moment the data exchange between Crawler and Connectivity
does not support streaming. All the data of objects is actually copied (using a
byte[]) as record attachments (or as Strings using record attributes). So
you certainly cannot use data of such a big size as you plan to use.
However, perhaps you can still use SMILA to do the job J
Assuming that all machines you are running SMILA on are able to
access the data to be processed (e.g. by public URL, a filesystem share or
a database, etc.) your Crawler could only provide the information necessary to
access the data but not the data itself (e.g. a URL, or a path, or a database
Id). In the BPEL pipeline then you would need to implement your own Pipelet
that is capable of reading the data using a stream und create multiple records
from the streamed data.
You may want to take a look at the org.eclipse.smila.processing.pipelets.xmlprocessing.XmlSplitterPipelet
as a sample on how to generate new records from an existing record.
Bye,
Daniel
Von:
smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im
Auftrag von Hannes Carl Meyer
Gesendet: Dienstag, 11. Mai 2010 10:24
An: smila-user@xxxxxxxxxxx
Betreff: [smila-user] Handling Streaming Ressources vs JMS
Hi,
I'm thinking about giving SMILA a try for an indexing and text analysis project
analyzing lots of realtime information such as Twitter's data.
Of course I started looking into SMILA's architecture (http://wiki.eclipse.org/SMILA/Architecture_Overview)
wether it would be possible to handling streaming resources.
Regarding the Architecture Overview, is it really necessary to use JMS between
the crawling and analysis?
I'm going to start over with a dataset of 500GB raw text messages and could
imagine going up to 4-5TB - imho this would create an overhead when handling
with JMS.
Looking forward hear your experiences!
Regards,
Hannes
--
https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer