Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
AW: [smila-user] Handling Streaming Ressources vs JMS

Hi Hannes,

 

yes, that’s the idea behind SMILA, to provide an infrastructure with easy mechanisms for integration of additional functionality so that projects such as yours can be realized faster and without reinventing the wheel all the time.

 

You should definitely check out org.eclipse.smila.integration.solr. Also check out what pipelets (they are not that many yet) are available that might provide reusable functionality for you.

 

Bye,

Daniel

 

Von: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im Auftrag von Hannes Carl Meyer
Gesendet: Dienstag, 11. Mai 2010 11:12
An: smila-user@xxxxxxxxxxx
Betreff: Re: [smila-user] Handling Streaming Ressources vs JMS

 

Hi Daniel,

thank you! For testing purpose I'm going to split the data in handy chunks and try to get my sample use case working with SMILA.

Thats actually my plan so far:

- Split the stream into records
- Put records into a "Database" (HBase)
- Analyze records regarding: language, language-specific content, named entities
- Index records + metadata into Solr index

If you can imagine advices on those steps I would like to hear from you.

I know SMILA is in a pretty early stage but I spent lot of time always re-inventing those infrastructures and maybe SMILA would be a solution for projects in future.

Regards,

Hannes

On Tue, May 11, 2010 at 10:58 AM, <daniel.stucky@xxxxxxxxxxxxx> wrote:

Hi Hannes,

 

thanks for your interest in SMILA.

 

At the moment the data exchange between Crawler and Connectivity does not support streaming. All the data of objects is actually copied (using a byte[])  as record attachments (or as Strings using record attributes). So you certainly cannot use data of such a big size as you plan to use.

 

However, perhaps you can still use SMILA to do the job J

 

Assuming that all machines you are running SMILA on are able to access the data to be processed (e.g. by public URL, a filesystem share  or a database, etc.) your Crawler could only provide the information necessary to access the data but not the data itself (e.g. a URL, or a path, or a database Id). In the BPEL pipeline then you would need to implement your own Pipelet that is capable of reading the data using a stream und create multiple records from the streamed data.

 

You may want to take a look at the org.eclipse.smila.processing.pipelets.xmlprocessing.XmlSplitterPipelet as a sample on how to generate new records from an existing record.

 

Bye,

Daniel

 

Von: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] Im Auftrag von Hannes Carl Meyer
Gesendet: Dienstag, 11. Mai 2010 10:24
An: smila-user@xxxxxxxxxxx
Betreff: [smila-user] Handling Streaming Ressources vs JMS

 

Hi,

I'm thinking about giving SMILA a try for an indexing and text analysis project analyzing lots of realtime information such as Twitter's data.
Of course I started looking into SMILA's architecture (http://wiki.eclipse.org/SMILA/Architecture_Overview) wether it would be possible to handling streaming resources.

Regarding the Architecture Overview, is it really necessary to use JMS between the crawling and analysis?
I'm going to start over with a dataset of 500GB raw text messages and could imagine going up to 4-5TB - imho this would create an overhead when handling with JMS.

Looking forward hear your experiences!

Regards,

Hannes

--

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer




--

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer


Back to the top